Since AlexNet 
won the ImageNet Challenge: ILSVRC 20128] to some more advanced applications, e.g., object detection [17, 21], semantic segmentation , video analysis  and many others. In these fields, CNNs have achieved state-of-the-art performance compared with traditional methods based on manually designed visual features.
However, deep models often have a huge number of parameters and its size is very large, which incurs not only huge memory requirement but also unbearable computation burden. As a result, a typical deep model is hard to be deployed on resource constrained devices, e.g., mobile phones or embedded gadgets. To make CNNs available on resource-constrained devices, there are lots of studies on model compression, which aims to reduce the model redundancy without significant degeneration in performance. Channel pruning [9, 19, 11] is one of the important methods. Different from simply making sparse connections [5, 6], channel pruning reduces the model size by directly removing redundant channels and can achieve fast inference without special software or hardware implementation.
In order to determine which channels to reserve, existing reconstruction-based methods [9, 19, 11] usually minimize the reconstruction error of feature maps between the original model and the pruned one. However, a well-reconstructed feature map may not be optimal for there is a gap between intermediate feature map and the performance of final output. Information redundancy channels could be mistakenly kept to minimize the reconstruction error of feature maps. To find the channels with true discriminative power for the network, DCP  attend to conduct channel selection by introducing additional discrimination-aware losses that are actually correlated with the final performance. It constructs the discrimination-aware losses by a fully connected layer which works on the entire feature map. However, the discrimination-aware loss of DCP is designed for classification task. Since object detection network uses the classification network as backbone, a simple method to conduct DCP for object detection is to fine-tune the pruned model, which was trained on classification dataset, for the object detection task. But the information that the two tasks need is not exactly the same. The classification task needs strong semantic information while what the object detection task needs is not only semantic information but also localization information. Hence, the existing training scheme may not be optimal due to the mismatched goals of feature learning for classification and object detection task.
In this paper, we propose a method called localization-aware channel pruning (LCP), which conducts channel pruning directly for object detection. We propose a localization-aware auxiliary network for object detection task. First, we design the auxiliary network with a local feature extraction layer which can obtain precise localization information of the default boxes by pixel alignment. Then, adaptive sampling strategy is proposed to enlarge the receptive fields of the default boxes in shallow layers. Finally, we propose a scale-aware loss function which tends to reserve the information of small objects.
Our main contributions are summarized as follows. (1) We propose a localization-aware auxiliary network which can find out the channels with key information so that we can conduct channel pruning directly on object detecion dataset, which saves lots of time and computing resources. (2) We propose an adaptive sampling strategy which enlarges the receptive fields of the default boxes in shallow layers. (3) We introduce a new scale-aware loss function which tends to reserve the channels which contain the information of small objects. Extensive experiments on benchmark datasets show that the proposed method is theoretically reasonable and practically effective. For example, our method can prune 70% parameters of SSD  based on ResNet-50  with modest accuracy drop on VOC2007, which outperforms the-state-of-art method.
Network quantization compresses the original network by reducing the number of bits required to represent each weight. Han et al.  propose a complete deep network compression pipeline: First trim the unimportant connections and retrain the sparsely connected network. Weight sharing is then used to quantize the weight of the connection, and then the quantized weight and codebook are Huffman encoded to further reduce the compression ratio. Courbariaux et al.  propose to accelerate the model by reducing the weight and accuracy of the output, because this will greatly reduce the memory size and access times of the network, and replace the arithmetic operator with a bit-wise operator. Li et al. 
consider that multi-weights have better generalization capabilities than binarization and the distribution of weights is close to a combination of a normal distribution and a uniform distribution. Zhou et al. propose a method which can convert the full-precision CNN into a low-precision network, making the weights 0 or 2 without loss or even higher precision (shifting can be performed on embedded devices such as FPGAs).
Sparse or Low-rank Connections
Wen et al.  propose a learning method called Structured Sparsity Learning, which can learn a sparse structure to reduce computational cost, and the learned structural sparseness can be effectively accelerate for hardware. Guo et al.  propose a new network compression method, called dynamic network surgery, is to reduce network complexity through dynamic connection pruning. Unlike previous methods of greedy pruning, this approach integrates join stitching throughout the process to avoid incorrect trimming and maintenance of the network. Jin et al.  proposes to reduce the computational complexity of the model by training a sparsely high network. By adding a paradigm about weights to the loss function of the network, the sparsity of weights can be reduced.
Finding unimportant weights in the network has a long history. LeCun  and Hassibi  consider using the Hessian, which contains second order derivative, performs better than using the magnitude of the weights. Computing the Hessian is expensive and thus is not widely used. Han  et al. proposed an iterative pruning method to remove the redundancy in deep models. Their main insight is that small-weight connectivity below a threshold should be discarded. In practice, this can be aided by applying or regularization to push connectivity values to become smaller. The major weakness of this strategy is the loss of universality and flexibility, thus seems to be less practical in real applications. Li et al.  measure the importance of channels by calculating the sum of absolute values of weights. Hu et al. 
define average percentage of zeros (APoZ) to measure the activation of neurons. Neurons with higher values of APoZ are considered more redundant in the network. With a sparsity regularizer in the objective function[1, 18], training based methods are proposed to learn the compact models in the training phase. With the consideration of efficiency, reconstruction-methods [9, 19]
transform the channel selection problem into the optimization of reconstruction error and solve it by a greedy algorithm or LASSO regression. DCP aimed at selecting the most discriminative channels for each layer by considering both the reconstruction error and the discrimination-aware loss.
The auxiliary network we propose mainly consists of three parts. First, a local feature extraction layer is designed to extract the features of the boxes. To this end, an algorithm is proposed to adaptively adjust the sampling area which enlarges the receptive fields of the boxes in shallow layers. Then, a scale-aware loss is proposed to adaptively adjust the loss according to the scale of samples. After the auxiliary is constructed, we conduct channel pruning with the localization-aware losses of the auxiliary network. Fig. 1 is the overall frame diagram. The details of the proposed approach are elaborated below.
Local Feature Extraction
For object detection task, if we predict the bounding boxes directly on the entire feature maps, there will be a huge amount of parameters and unnecessary noises. So, it is important to extract the feature of region of interest(RoI) , which can be better used for classification and regression. In order to obtain precise localization information and find out the channels which are important for classification and regression, we use a RoIAlign layer that properly aligning the extracted features with the input. RoIAlign use bilinear interpolation to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the result (using max or average), see Fig.2 for details.
The RoIAlign layer extracts the feature from the region of default boxes which have been choosed as training samples. However, there are still problems. The receptive fields of the feature map sent to the auxiliary network are very small when we prune the shallow layers. The feature we extract does not have enough context information. In order to solve this problem, we propose a strategy which helps us change the sampling area adaptively.The details are discussed below.
For better description of the algorithm, some notations are given first. For a training sample, represents the sampling scale of the RoIAlign layer, represents the coordinates of ground truth box , denotes the coordinates of the matched default box . We further use to denote the total number of layers and to denote the current layer. First, we calculate the of box A and B:
B is a positive sample only if is larger than a preset threshold. we do not change the sampling area of RoIAlign layer when B is negative sample. If B is a positive sample, then we calculate the smallest enclosing convex object C for A and B:
where are the coordinates of C. Then the adaptive sampling area is defined as:
where the is calculated by:
and the sampling scale of the RoIAlign layer is defined as:
where the , , are constants. With the adaptive sampling, we get enough receptive fields for the boxes in shallow layers and hardly introduce redundant context information for the boxes in deep layers. The process is summarized in Algorithm 1.
After we construct the RoIAlign layer and the sampling strategy, we can use the gradient of the auxiliary network to conduct model pruning. However, the scale of objects in object detection varies widely, which makes detection difficult, especially for small objects. To solve this problem, we propose a scale-aware loss which adjust the loss according to the scale of the training sample. For small object, we tend to generate a larger loss by giving greater weight, which helps to reserve the channels containing information for small objects. The details are discussed below.
For a better description, we use and to denote the input size of the network and the th predicted box, respectively. and represent the width and height of the -th default box. In the stage of channel pruning, we use cross entropy and GIoU  to construct the loss of the auxiliary network. It is reasonable to use GIoU as loss function for boxes regression. It considers not only overlapping areas but also non-overlapping areas, which better reflects the overlap of the boxes. The GIoU of two arbitrary shapes and is defined as:
where , and are calculated by Eq. 1 - Eq. 5. Fig. 3 is a schematic diagram of GIoU. Then, we use to denote the GIoU of the -th predicted box and the ground truth, to represent the cross entropy of the -th predicted box. Then, in the pruning stage, represents the classification loss, represents the regression loss, represents the localization-aware loss of the auxiliary network. Finally, the loss of positive samples in pruning stage is defined as:
where m is a constant coefficient, is the scale factor calculated by:
where k is a constant that controls the magnification.
Localization-aware Channel Pruning
After we construct the auxiliary network and the localization-aware loss, we can conduct channel pruning with them layer by layer. The pruning process of the whole model is described in Algorithm 2. For better description of the channel selection algorithm, some notations are given first. Considering a layers of the CNN model and we are pruning the -th layer, represents the output feature map of the layer, denotes the convolution filter of the -th layer of the pruned model and represents the convolution operation. We further use to denote output feature maps of the -th layer of the original model. Here, , , represents the number of output channels, the height and the width of the feature maps respectively. Finally we use and to denote classification loss and regression loss of the auxiliary network in fine-tune stage, and to denote classification loss and regression loss of the pruned network.
To find out the channels which really contribute to the network, we should fine-tune the auxiliary network and pruned network first and the fine-tune loss is defined as the sum of the losses of them:
In order to minimizing the reconstruction error of a layer, we introduce a reconstruction loss as DCP does which can be defined as the Euclidean distance of feature maps between the original model and the pruned one:
where , represents the selected channels, represents the submatrix indexed by .
Taking into account the reconstruction error, the localization-aware loss of the auxiliary network, the problem of channel pruning can be formulated to minimize the following joint loss function:
where is a constant, is the number of channels to be selected. Directly optimizing Eq. Localization-aware Channel Pruning is NP-hard. Following general greedy methods in DCP, we conduct channel pruning by considering the gradient of Eq. 17. Specifically, the importance of the -th channel is defined as:
is the square sum of gradient of the -th channel. Then we reserve the channels with the largest importance and remove others. After this, the selected channels is further optimized by stochastic gradient(SGD). is updated by:
where represents the learning rate. After updating , the channel pruning of a single layer is finished.
We evaluate LCP on the popular 2D object detector SSD . Several state-of-the-art methods are adopted as the baselines, including ThiNet and DCP. In order to verify the effectiveness of our method, we use VGG and ResNet to extract feature respectively.
Dataset and Evaluation
The results of all baselines are reported on standard object detection benchmarks, i.e. the PASCAL VOC  . PASCAL VOC2007 and 2012: The Pascal Visual Object Classes (VOC) benchmark is one of the most widely used datasets for classification, object detection and semantic segmentation. We use the union of VOC2007 and VOC2012 trainval as training set, which contains 16551 images and objects from 20 pre-defined categories annotated with bounding boxes. And we use the VOC2007 test as test set which contains 4592 images. In order to verify the effectiveness of our method, on PASCAL VOC, we first compare our method only with ThiNet based on VGG-16 because the authors of DCP do not release the VGG model. To this end, we compare our method with DCP and ThiNet based on ResNet-50. Then we conduct the ablation experiment of our method on PASCAL VOC. In order to more fully verify the effectiveness of our method, we also perform experiments on the MS COCO2017 dataset. Due to the space limitation, we only compare our method with DCP on COCO to demonstrate the performance of the proposed method.
In this paper, we use for all experiments on PASCAL VOC. For experiments on MS COCO, the main performance measure used in this benchmark is shown by AP, which is averaging mAP across different value of IoU thresholds, i.e. .
Our experiments are based on SSD and the input size of the SSD is . We use VGGNet and ResNet as the feature extraction network for experiments. For ThiNet, we implement it for object detection. And the three methods prune the same number of channels for each layer. Other common parameters are described in detail below.
For VGGNet 
, we use VGG-16 without Batch Normalization layer and prune the SSD from conv1-1 to conv5-3. The network is fine-tuned for 10 epochs every time a layer is pruned and the learning rate is started at 0.001 and divided by 10 at epoch 5. After the model is pruned, we fine-tune it for 60k iterations and the learning rate is started at 0.0005 and divided by 10 at iteration 30k and 45k, respectively.
For ResNet , we use the layers of ResNet-50 from conv1-x to conv4-x for feature extracting. The network is fine-tuned for 15 epochs every time a layer is pruned and the learning rate is started at 0.001 and divided by 10 at epoch 5 and 10, respectively. After the model is pruned, we fine-tune it for 120k iterations and the learning rate is started at 0.001 and divided by 10 at iteration 80k and 100k, respectively.
For Adaptive Sampling, we set , , to 10, 7, 14 respectively. For Scale-aware Loss, we set , to 50, 0.5 respectively.
Experiments on PASCAL VOC
On PASCAL VOC, we prune the VGG-16 from conv1-1 to conv5-3 with compression ratio 0.75, which is 4x faster. We report the results in Tab. 1. From the results, we can see that our method achieves the best performance under the same acceleration rate. The accuracy of reconstruction based method like ThiNet drops a lot. But for our LCP, there is not much degradation in the performance of object detection. It is proved that our method retain the channels which really contribute to the final performance. Then we conduct the experiment based ResNet-50. We report the results in Tab. 2. From the results, LCP achieves the best performance regardless of pruning by 75% or pruning by 50%, which proves that our method can reserve the channels which contain key information for classification and regression. In addition, the ThiNet outperforms the DCP when pruning ratio is 0.7, which indicates that pruning the model on classification dataset for object detection is not optimal.
Experiments on MS COCO
In this section, we prune the ResNet-50 by 70% on COCO2017 and compare the DCP with our LCP. We report the results in Tab. 3. From the results, Our method achieves a better performance than the DCP, which further illustrates the effectiveness of our approach. It is noted that compared with DCP, LCP has larger gain on small objects. It verifies the effectiveness of our scale-aware loss, which keeps the information for small objects. In addition, the higher the IoU threshold, the greater improvement of our method. This indicates that our method retains more localization information and can obtain more accurate predictions.
Gradient Analysis. In this section, we prune the VGG-16 from conv1-1 to conv5-3 with compression ratio 0.75 On PASCAL VOC. Then we count the percentage of the gradients generated by the three losses during the pruning process. From Fig. 4, we see that the gradient of regression loss play a important role during the pruning process, which proves that the localization information is necessary. The gradient generated by reconstruction error only works in the shallow layers while the localization-aware loss contributes to the channel pruning process each layer.
Component Analysis. In this section, in order to verify the effectiveness of the three points we propose, we prune the SSD based on VGG-16 by 75% with different combinations of our points. We report the results in Tab.4. From the results, we can get that each part of the method we propose contributes to the performance. The adaptive sampling boosts performance by 0.3% and the scale-aware loss boosts performance by 0.5%.
Loss Analysis. In order to explore the importance of the gradient of regression loss, we prune the SSD based on VGG-16 by 75% with different losses. We report the results in Tab. 5. From the results, we can know that the performance of our method drops a lot without the gradient of the regression loss during the pruning stage, which shows that the regression branch contains important localization information.
Visualization of predictions
In this section, we prune the SSD based on VGG-16 by 75% and we compare the original model with the pruned models. From Fig. 5, we can find that the predictions of our method are closed to the predictions of the original model while the predictions of ThiNet are far away. It is proved that our method reserve more localization information for bounding box regression.
In this paper, we propose a localization-aware auxiliary network which allows us to conduct channel pruning directly for object detection. We first design the auxiliary network with RoIAlign layer which can obtain precise localization information of the default boxes. Then, we propose an algorithm for adaptively adjusting the sampling area which enlarges the receptive fields when we prune shallow layers. Finally, we propose a scale-aware loss function which tends to keep the channels that contain the information for small objects. Visualization shows our method reserves layers with more localization information. Moreover, extensive experiments demonstrate the effectiveness of our method.
-  (2016) Learning the number of neurons in deep networks. In Advances in Neural Information Processing Systems, pp. 2270–2278. Cited by: Channel Pruning.
-  (2016) Binarized neural networks: training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv preprint arXiv:1602.02830. Cited by: Network Quantization.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: Dataset and Evaluation.
-  (2016) Dynamic network surgery for efficient dnns. In Advances In Neural Information Processing Systems, pp. 1379–1387. Cited by: Sparse or Low-rank Connections.
-  (2015) Deep compression: compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149. Cited by: Introduction, Network Quantization, Channel Pruning.
-  (2015) Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143. Cited by: Introduction.
-  (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: Channel Pruning.
Deep residual learning for image recognition.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: Introduction, Introduction, Implementation details.
-  (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: Introduction, Introduction, Channel Pruning.
-  (2016) Network trimming: a data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250. Cited by: Channel Pruning.
-  (2018) Efficient dnn neuron pruning by minimizing layer-wise nonlinear reconstruction error.. In IJCAI, Vol. 2018, pp. 2–2. Cited by: Introduction, Introduction.
-  (2016) Training skinny deep neural networks with iterative hard thresholding methods. arXiv preprint arXiv:1607.05423. Cited by: Sparse or Low-rank Connections.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Introduction.
-  (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: Channel Pruning.
-  (2016) Ternary weight networks. arXiv preprint arXiv:1605.04711. Cited by: Network Quantization.
-  (2016) Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710. Cited by: Channel Pruning.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: Introduction, Introduction, Experiments.
-  (2017) Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2736–2744. Cited by: Channel Pruning.
-  (2017) Thinet: a filter level pruning method for deep neural network compression. In Proceedings of the IEEE international conference on computer vision, pp. 5058–5066. Cited by: Introduction, Introduction, Channel Pruning.
-  (2015) Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pp. 1520–1528. Cited by: Introduction.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: Introduction.
-  (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 658–666. Cited by: Scale-aware Loss.
-  (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: Introduction.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Implementation details.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: Introduction.
-  (2016) Learning structured sparsity in deep neural networks. In Advances in neural information processing systems, pp. 2074–2082. Cited by: Sparse or Low-rank Connections.
-  (2017) Incremental network quantization: towards lossless cnns with low-precision weights. arXiv preprint arXiv:1702.03044. Cited by: Network Quantization.
-  (2018) Discrimination-aware channel pruning for deep neural networks. In Advances in Neural Information Processing Systems, pp. 875–886. Cited by: Introduction, Channel Pruning.