Neural networks have enabled state-of-the-art approaches to achieve incredible results on computer vision tasks such as object detection. However, such success greatly relies on costly computation resources, which hinders people with cheap devices from appreciating the advanced technology. In this paper, we propose Cross Stage Partial Network (CSPNet) to mitigate the problem that previous works require heavy inference computations from the network architecture perspective. We attribute the problem to the duplicate gradient information within network optimization. The proposed networks respect the variability of the gradients by integrating feature maps from the beginning and the end of a network stage, which, in our experiments, reduces computations by 20 significantly outperforms state-of-the-art approaches in terms of AP50 on the MS COCO object detection dataset. The CSPNet is easy to implement and general enough to cope with architectures based on ResNet, ResNeXt, and DenseNet. Source code is at https://github.com/WongKinYiu/CrossStagePartialNetworks.READ FULL TEXT VIEW PDF
Neural networks have been shown to be especially powerful when it gets deeper [he2016deep, xie2017aggregated, huang2017densely] and wider [Zagoruyko2016WRN]. However, extending the architecture of neural networks usually brings up a lot more computations, which makes computationally heavy tasks such as object detection unaffordable for most people. Light-weight computing has gradually received stronger attention since real-world applications usually require short inference time on small devices, which poses a serious challenge for computer vision algorithms. Although some approaches were designed exclusively for mobile CPU [howard2017mobilenets, sandler2018mobilenetv2, howard2019searching, tan2019mnasnet, zhang2018shufflenet, ma2018shufflenetv2], the depth-wise separable convolution techniques they adopted are not compatible with industrial IC design such as Application-Specific Integrated Circuit (ASIC) for edge-computing systems. In this work, we investigate the computational burden in state-of-the-art approaches such as ResNet, ResNeXt, and DenseNet. We further develop computationally efficient components that enable the mentioned networks to be deployed on both CPUs and mobile GPUs without sacrificing the performance.
In this study, we introduce Cross Stage Partial Network (CSPNet). The main purpose of designing CSPNet is to enable this architecture to achieve a richer gradient combination while reducing the amount of computation. This aim is achieved by partitioning feature map of the base layer into two parts and then merging them through a proposed cross-stage hierarchy. Our main concept is to make the gradient flow propagate through different network paths by splitting the gradient flow. In this way, we have confirmed that the propagated gradient information can have a large correlation difference by switching concatenation and transition steps. In addition, CSPNet can greatly reduce the amount of computation, and improve inference speed as well as accuracy, as illustrated in Fig 1. The proposed CSPNet-based object detector deals with the following three problems:
1) Strengthening learning ability of a CNN The accuracy of existing CNN is greatly degraded after lightweightening, so we hope to strengthen CNN’s learning ability, so that it can maintain sufficient accuracy while being lightweightening. The proposed CSPNet can be easily applied to ResNet, ResNeXt, and DenseNet. After applying CSPNet on the above mentioned networks, the computation effort can be reduced from 10% to 20%, but it outperforms ResNet [he2016deep], ResNeXt [xie2017aggregated], DenseNet [huang2017densely], HarDNet [chao2019hardnet], Elastic [wang2019elastic], and Res2Net [gao2019res2net], in terms of accuracy, in conducting image classification task on ImageNet [deng2009imagenet].
2) Removing computational bottlenecks Too high a computational bottleneck will result in more cycles to complete the inference process, or some arithmetic units will often idle. Therefore, we hope we can evenly distribute the amount of computation at each layer in CNN so that we can effectively upgrade the utilization rate of each computation unit and thus reduce unnecessary energy consumption. It is noted that the proposed CSPNet makes the computational bottlenecks of PeleeNet [wang2018pelee] cut into half. Moreover, in the MS COCO [lin2014microsoft] dataset-based object detection experiments, our proposed model can effectively reduce 80% computational bottleneck when test on YOLOv3-based models.
3) Reducing memory costs The wafer fabrication cost of Dynamic Random-Access Memory (DRAM) is very expensive, and it also takes up a lot of space. If one can effectively reduce the memory cost, he/she will greatly reduce the cost of ASIC. In addition, a small area wafer can be used in a variety of edge computing devices. In reducing the use of memory usage, we adopt cross-channel pooling [goodfellow2013maxout] to compress the feature maps during the feature pyramid generating process. In this way, the proposed CSPNet with the proposed object detector can cut down 75% memory usage on PeleeNet when generating feature pyramids.
Since CSPNet is able to promote the learning capability of a CNN, we thus use smaller models to achieve better accuracy. Our proposed model can achieve 50% COCO AP at 109 fps on GTX 1080ti. Since CSPNet can effectively cut down a significant amount of memory traffic, our proposed method can achieve 40% COCO AP at 52 fps on Intel Core i9-9900K. In addition, since CSPNet can significantly lower down the computational bottleneck and Exact Fusion Model (EFM) can effectively cut down the required memory bandwidth, our proposed method can achieve 42% COCO AP at 49 fps on Nvidia Jetson TX2.
CNN architectures design. In ResNeXt [xie2017aggregated], Xie et al. first demonstrate that cardinality can be more effective than the dimensions of width and depth. DenseNet [huang2017densely] can significantly reduce the number of parameters and computations due to the strategy of adopting a large number of reuse features. And it concatenates the output features of all preceding layers as the next input, which can be considered as the way to maximize cardinality. SparseNet [zhu2018sparsely] adjusts dense connection to exponentially spaced connection can effectively improve parameter utilization and thus result in better outcomes. Wang et al. further explain why high cardinality and sparse connection can improve the learning ability of the network by the concept of gradient combination and developed the partial ResNet (PRN) [wang2019enriching]. For improving the inference speed of CNN, Ma et al. [ma2018shufflenetv2] introduce four guidelines to be followed and design ShuffleNet-v2. Chao et al. [chao2019hardnet] proposed a low memory traffic CNN called Harmonic DenseNet (HarDNet) and a metric Convolutional Input/Output (CIO) which is an approximation of DRAM traffic proportional to the real DRAM traffic measurement.
Real-time object detector. The most famous two real-time object detectors are YOLOv3 [redmon2018yolov3] and SSD [liu2016ssd]. Based on SSD, LRF [wang2019learning] and RFBNet [liu2018receptive] can achieve state-of-the-art real-time object detection performance on GPU. Recently, anchor-free based object detector [duan2019centernet, zhou2019objects, law2018cornernet, law2019cornernet, zhang2019freeanchor] has become main-stream object detection system. Two object detector of this sort are CenterNet [zhou2019objects] and CornerNet-Lite [law2019cornernet], and they both perform very well in terms of efficiency and efficacy. For real-time object detection on CPU or mobile GPU, SSD-based Pelee [wang2018pelee], YOLOv3-based PRN [wang2019enriching], and Light-Head RCNN [li2017light]-based ThunderNet [qin2019thundernet] all receive excellent performance on object detection.
DenseNet. Figure 2 (a) shows the detailed structure of one-stage of the DenseNet proposed by Huang et al. [huang2017densely]. Each stage of a DenseNet contains a dense block and a transition layer, and each dense block is composed of dense layers. The output of the dense layer will be concatenated with the input of the dense layer, and the concatenated outcome will become the input of the dense layer. The equations showing the above-mentioned mechanism can be expressed as:
where represents the convolution operator, and means to concatenate , and and are the weights and output of the dense layer, respectively.
If one makes use of a backpropagation algorithm to update weights, the equations of weight updating can be written as:
where is the function of weight updating, and represents the gradient propagated to the dense layer. We can find that large amount of gradient information are reused for updating weights of different dense layers. This will result in different dense layers repeatedly learn copied gradient information.
Cross Stage Partial DenseNet. The architecture of one-stage of the proposed CSPDenseNet is shown in Figure 2 (b). A stage of CSPDenseNet is composed of a partial dense block and a partial transition layer. In a partial dense block, the feature maps of the base layer in a stage are split into two parts through channel . Between and , the former is directly linked to the end of the stage, and the latter will go through a dense block. All steps involved in a partial transition layer are as follows: First, the output of dense layers, , will undergo a transition layer. Second, the output of this transition layer, , will be concatenated with and undergo another transition layer, and then generate output . The equations of feed-forward pass and weight updating of CSPDenseNet are shown in Equations 3 and 4, respectively.
We can see that the gradients coming from the dense layers are separately integrated. On the other hand, the feature map that did not go through the dense layers is also separately integrated. As to the gradient information for updating weights, both sides do not contain duplicate gradient information that belongs to other sides.
Overall speaking, the proposed CSPDenseNet preserves the advantages of DenseNet’s feature reuse characteristics, but at the same time prevents an excessively amount of duplicate gradient information by truncating the gradient flow. This idea is realized by designing a hierarchical feature fusion strategy and used in a partial transition layer.
Partial Dense Block. The purpose of designing partial dense blocks is to 1.) increase gradient path: Through the split and merge strategy, the number of gradient paths can be doubled. Because of the cross-stage strategy, one can alleviate the disadvantages caused by using explicit feature map copy for concatenation; 2.) balance computation of each layer: usually, the channel number in the base layer of a DenseNet is much larger than the growth rate. Since the base layer channels involved in the dense layer operation in a partial dense block account for only half of the original number, it can effectively solve nearly half of the computational bottleneck; and 3.) reduce memory traffic: Assume the base feature map size of a dense block in a DenseNet is , the growth rate is , and there are in total dense layers. Then, the CIO of that dense block is , and the CIO of partial dense block is . While and are usually far smaller than , a partial dense block is able to save at most half of the memory traffic of a network.
Partial Transition Layer. The purpose of designing partial transition layers is to maximize the difference of gradient combination. The partial transition layer is a hierarchical feature fusion mechanism, which uses the strategy of truncating the gradient flow to prevent distinct layers from learning duplicate gradient information. Here we design two variations of CSPDenseNet to show how this sort of gradient flow truncating affects the learning ability of a network. 3 (c) and 3 (d) show two different fusion strategies. CSP (fusion first) means to concatenate the feature maps generated by two parts, and then do transition operation. If this strategy is adopted, a large amount of gradient information will be reused. As to the CSP (fusion last) strategy, the output from the dense block will go through the transition layer and then do concatenation with the feature map coming from part 1. If one goes with the CSP (fusion last) strategy, the gradient information will not be reused since the gradient flow is truncated. If we use the four architectures shown in 3 to perform image classification, the corresponding results are shown in Figure 4. It can be seen that if one adopts the CSP (fusion last) strategy to perform image classification, the computation cost is significantly dropped, but the top-1 accuracy only drop 0.1%. On the other hand, the CSP (fusion first) strategy does help the significant drop in computation cost, but the top-1 accuracy significantly drops 1.5%. By using the split and merge strategy across stages, we are able to effectively reduce the possibility of duplication during the information integration process. From the results shown in Figure 4, it is obvious that if one can effectively reduce the repeated gradient information, the learning ability of a network will be greatly improved.
Apply CSPNet to Other Architectures. CSPNet can be also easily applied to ResNet and ResNeXt, the architectures are shown in Figure 5. Since only half of the feature channels are going through Res(X)Blocks, there is no need to introduce the bottleneck layer anymore. This makes the theoretical lower bound of the Memory Access Cost (MAC) when the FLoating-point OPerations (FLOPs) is fixed.
Looking Exactly to predict perfectly. We propose EFM that captures an appropriate Field of View (FoV) for each anchor, which enhances the accuracy of the one-stage object detector. For segmentation tasks, since pixel-level labels usually do not contain global information, it is usually more preferable to consider larger patches for better information retrieval [liu2015parsenet]. However, for tasks like image classification and object detection, some critical information can be obscure when observed from image-level and bounding box-level labels. Li et al. [li2018tell] found that CNN can be often distracted when it learns from image-level labels and concluded that it is one of the main reasons that two-stage object detectors outperform one-stage object detectors.
Aggregate Feature Pyramid. The proposed EFM is able to better aggregate the initial feature pyramid. The EFM is based on YOLOv3 [redmon2018yolov3], which assigns exactly one bounding-box prior to each ground truth object. Each ground truth bounding box corresponds to one anchor box that surpasses the threshold IoU. If the size of an anchor box is equivalent to the FoV of the grid cell, then for the grid cells of the scale, the corresponding bounding box will be lower bounded by the scale and upper bounded by the scale. Therefore, the EFM assembles features from the three scales.
Balance Computation. Since the concatenated feature maps from the feature pyramid are enormous, it introduces a great amount of memory and computation cost. To alleviate the problem, we incorporate the Maxout technique to compress the feature maps.
We will use ImageNet’s image classification dataset [deng2009imagenet] used in ILSVRC 2012 to validate our proposed CSPNet. Besides, we also use the MS COCO object detection dataset [lin2014microsoft] to verify the proposed EFM. Details of the proposed architectures will be elaborated in the appendix.
ImageNet. In ImageNet image classification experiments, all hyper-parameters such as training steps, learning rate schedule, optimizer, data augmentation, etc., we all follow the settings defined in Redmon et al. [redmon2018yolov3]. For ResNet-based models and ResNeXt-based models, we set 8000,000 training steps. As to DenseNet-based models, we set 1,600,000 training steps. We set the initial learning rate 0.1 and adopt the polynomial decay learning rate scheduling strategy. The momentum and weight decay are respectively set as 0.9 and 0.005. All architectures use a single GPU to train universally in the batch size of 128. Finally, we use the validation set of ILSVRC 2012 to validate our method.
MS COCO. In MS COCO object detection experiments, all hyper-parameters also follow the settings defined in Redmon et al. [redmon2018yolov3]. Altogether we did 500,000 training steps. We adopt the step decay learning rate scheduling strategy and multiply with a factor 0.1 at the 400,000 steps and the 450,000 steps, respectively. The momentum and weight decay are respectively set as 0.9 and 0.0005. All architectures use a single GPU to execute multi-scale training in the batch size of 64. Finally, the COCO test-dev set is adopted to verify our method.
Ablation study of CSPNet on ImageNet. In the ablation experiments conducted on the CSPNet, we adopt PeleeNet [wang2018pelee] as the baseline, and the ImageNet is used to verify the performance of the CSPNet. We use different partial ratios and the different feature fusion strategies for ablation study. Table 1 shows the results of ablation study on CSPNet. In Table 1, SPeleeNet and PeleeNeXt are, respectively, the architectures that introduce sparse connection and group convolution to PeleeNet. As to CSP (fusion first) and CSP (fusion last), they are the two strategies proposed to validate the benefits of a partial transition.
From the experimental results, if one only uses the CSP (fusion first) strategy on the cross-stage partial dense block, the performance can be slightly better than SPeleeNet and PeleeNeXt. However, the partial transition layer designed to reduce the learning of redundant information can achieve very good performance. For example, when the computation is cut down by 21%, the accuracy only degrades by 0.1%. One thing to be noted is that when , the computation is cut down by 11%, but the accuracy is increased by 0.1%. Compared to the baseline PeleeNet, the proposed CSPPeleeNet achieves the best performance, it can cut down 13% computation, but at the same time upgrade the accuracy by 0.2%. If we adjust the partial ratio to , we are able to upgrade the accuracy by 0.8% and at the same time cut down 3% computation.
Ablation study of EFM on MS COCO. Next, we shall conduct an ablation study of EFM based on the MS COCO dataset. In this series of experiments, we compare three different feature fusion strategies shown in Figure 6. We choose two state-of-the-art lightweight models, PRN [wang2019enriching] and ThunderNet [qin2019thundernet], to make comparison. PRN is the feature pyramid architecture used for comparison, and the ThunderNet with Context Enhancement Module (CEM) and Spatial Attention Module (SAM) are the global fusion architecture used for comparison. We design a Global Fusion Model (GFM) to compare with the proposed EFM. Moreover, GIoU [rezatofighi2019generalized], SPP, and SAM are also applied to EFM to conduct an ablation study. All experiment results listed in Table 2 adopt CSPPeleeNet as the backbone.
As reflected in the experiment results, the proposed EFM is 2 fps slower than GFM, but its AP and AP are significantly upgraded by 2.1% and 2.4%, respectively. Although the introduction of GIoU can upgrade AP by 0.7%, the AP
is, however, significantly degraded by 2.7%. However, for edge computing, what really matters is the number and locations of the objects rather than their coordinates. Therefore, we will not use GIoU training in the subsequent models. The attention mechanism used by SAM can get a better frame rate and AP compared with SPP’s increase of FoV mechanism, so we use EFM (SAM) as the final architecture. In addition, although the CSPPeleeNet with swish activation can improve AP by 1%, its operation requires a lookup table on the hardware design to accelerate, we finally also abandoned the swish activation function.
We apply the proposed CSPNet to ResNet-10 [he2016deep], ResNeXt-50 [xie2017aggregated], PeleeNet [wang2018pelee], and DenseNet-201-Elastic [wang2019elastic] and compare with state-of-the-art methods. The experimental results are shown in Table 3.
It is confirmed by experimental results that no matter it is ResNet-based models, ResNeXt-based models, or DenseNet-based models, when the concept of CSPNet is introduced, the computational load is reduced at least by 10% and the accuracy is either remain unchanged or upgraded. Introducing the concept of CSPNet is especially useful for the improvement of lightweight models. For example, compared to ResNet-10, CSPResNet-10 can improve accuracy by 1.8%. As to PeleeNet and DenseNet-201-Elastic, CSPPeleeNet and CSPDenseNet-201-Elastic can respectively cut down 13% and 19% computation, and either upgrade a little bit or maintain the accuracy. As to the case of ResNeXt-50, CSPResNeXt-50 can cut down 22% computation and upgrade top-1 accuracy to 77.9%.
If compared with the state-of-the-art lightweight model – EfficientNet-B0, although it can achieve 76.8% accuracy when the batch size is 2048, when the experiment environment is the same as ours, that is, only one GPU is used, EfficientNet-B0 can only reach 70.0% accuracy. In fact, the swish activation function and SE block used by EfficientNet-B0 are not efficient on the mobile GPU. A similar analysis has been conducted during the development of EfficientNet-EdgeTPU. Here, for demonstrating the learning ability of CSPNet, we introduce swish and SE into CSPPeleeNet and then make a comparison with EfficientNet-B0*. In this experiment, SECSPPeleeNet-swish cut down computation by 3% and upgrade 1.1% top-1 accuracy.
Proposed CSPResNeXt-50 is compared with ResNeXt-50 [xie2017aggregated], ResNet-152 [he2016deep], DenseNet-264 [huang2017densely], and HarDNet-138s [chao2019hardnet], regardless of parameter quantity, amount of computation, and top-1 accuracy, CSPResNeXt-50 all achieve the best result. As to the 10-crop test, CSPResNeXt-50 also outperforms Res2Net-50 [gao2019res2net] and Res2NeXt-50 [gao2019res2net].
In the task of object detection, we aim at three targeted scenarios: (1) real-time on GPU: we adopt CSPResNeXt50 with PANet (SPP) [liu2018path]; (2) real-time on mobile GPU: we adopt CSPPeleeNet, CSPPeleeNet Reference, and CSPDenseNet Reference with the proposed EFM (SAM); and (3) real-time on CPU: we adopt CSPPeleeNet Reference and CSPDenseNet Reference with PRN [wang2019enriching]. The comparisons between the above models and the state-of-the-art methods are listed in Table 4. As to the analysis on the inference speed of CPU and mobile GPU will be detailed in the next subsection.
If compared to object detectors running at 30100 fps, CSPResNeXt50 with PANet (SPP) achieves the best performance in AP, AP and AP. They receive, respectively, 38.4%, 60.6%, and 41.6% detection rates. If compared to state-of-the-art LRF [wang2019learning] under the input image size 512512, CSPResNeXt50 with PANet (SPP) outperforms ResNet101 with LRF by 0.7% AP, 1.5% AP and 1.1% AP. If compared to object detectors running at 100200 fps, CSPPeleeNet with EFM (SAM) boosts 12.1% AP at the same speed as Pelee [wang2018pelee] and increases 4.1% [wang2018pelee] at the same speed as CenterNet [zhou2019objects].
If compared to very fast object detectors such as ThunderNet [qin2019thundernet], YOLOv3-tiny [redmon2018yolov3], and YOLOv3-tiny-PRN [wang2019enriching], the proposed CSPDenseNetb Reference with PRN is the fastest. It can reach 400 fps frame rate, i.e., 133 fps faster than ThunderNet with SNet49. Besides, it gets 0.5% higher on AP. If compared to ThunderNet146, CSPPeleeNet Reference with PRN (3l) increases the frame rate by 19 fps while maintaining the same level of AP.
Computational Bottleneck. Figure 7 shows the BLOPS of each layer of PeleeNet-YOLO, PeleeNet-PRN and proposed CSPPeleeNet-EFM. From Figure 7, it is obvious that the computational bottleneck of PeleeNet-YOLO occurs when the head integrates the feature pyramid. The computational bottleneck of PeleeNet-PRN occurs on the transition layers of the PeleeNet backbone. As to the proposed CSPPeleeNet-EFM, it can balance the overall computational bottleneck, which reduces the PeleeNet backbone 44% computational bottleneck and reduces PeleeNet-YOLO 80% computational bottleneck. Therefore, we can say that the proposed CSPNet can provide hardware with a higher utilization rate.
Memory Traffic. Figure 8 shows the size of each layer of ResNeXt50 and the proposed CSPResNeXt50. The CIO of the proposed CSPResNeXt (32.6M) is lower than that of the original ResNeXt50 (34.4M). In addition, our CSPResNeXt50 removes the bottleneck layers in the ResXBlock and maintains the same numbers of the input channel and the output channel, which is shown in Ma et al. [ma2018shufflenetv2] that this will have the lowest MAC and the most efficient computation when FLOPs are fixed. The low CIO and FLOPs enable our CSPResNeXt50 to outperform the vanilla ResNeXt50 by 22% in terms of computations.
Inference Rate. We further evaluate whether the proposed methods are able to be deployed on real-time detectors with mobile GPU or CPU. Our experiments are based on NVIDIA Jetson TX2 and Intel Core i9-9900K, and the inference rate on CPU is evaluated with the OpenCV DNN module. We do not adopt model compression or quantization for fair comparisons. The results are shown in Table5.
If we compare the inference speed executed on CPU, CSPDenseNetb Ref.-PRN receives higher AP than SNet49-TunderNet, YOLOv3-tiny, and YOLOv3-tiny-PRN, and it also outperforms the above three models by 55 fps, 48 fps, and 31 fps, respectively, in terms of frame rate. On the other hand, CSPPeleeNet Ref.-PRN (3l) reaches the same accuracy level as SNet146-ThunderNet but significantly upgrades the frame rate by 20 fps on CPU.
If we compare the inference speed executed on mobile GPU, our proposed EFM will be a good model to use. Since our proposed EFM can greatly reduce the memory requirement when generating feature pyramids, it is definitely beneficial to function under the memory bandwidth restricted mobile environment. For example, CSPPeleeNet Ref.-EFM (SAM) can have a higher frame rate than YOLOv3-tiny, and its AP is 11.5% higher than YOLOv3-tiny, which is significantly upgraded. For the same CSPPeleeNet Ref. backbone, although EFM (SAM) is 62 fps slower than PRN (3l) on GTX 1080ti, it reaches 41 fps on Jetson TX2, 3 fps faster than PRN (3l), and at AP 4.6% growth.
We have proposed the CSPNet that enables state-of-the-art methods such as ResNet, ResNeXt, and DenseNet to be light-weighted for mobile GPUs or CPUs. One of the main contributions is that we have recognized the redundant gradient information problem that results in inefficient optimization and costly inference computations. We have proposed to utilize the cross-stage feature fusion strategy and the truncating gradient flow to enhance the variability of the learned features within different layers. In addition, we have proposed the EFM that incorporates the Maxout operation to compress the features maps generated from the feature pyramid, which largely reduces the required memory bandwidth and thus the inference is efficient enough to be compatible with edge computing devices. Experimentally, we have shown that the proposed CSPNet with the EFM significantly outperforms competitors in terms of accuracy and inference rate on mobile GPU and CPU for real-time object detection tasks.