Object detection is one of the most fundamental problems in computer vision, which can serve a wide range of applications such as autonomous driving, intelligent video surveillance, remote sensing, and so on. In recent years, great progresses have been made for object detection thanks to the booming development of the deep convolutional networks, and a few excellent detectors have been proposed, e.g., SSD , Faster R-CNN , Retinanet, FPN , Mask R-CNN , Cascade R-CNN , etc.
Generally speaking, in a typical CNN based object detector, a backbone network is used to extract basic features for detecting objects, which is usually designed for the image classification task and pretrained on the ImageNet dataset. Not surprisingly, if a backbone can extract more representational features, its host detector will perform better accordingly. In other words, a more powerful backbone can bring better detection performance, as demonstrated in Table 1. Hence, starting from AlexNet , deeper and larger (i.e., more powerful) backbones have been exploited by the state-of-the-art detectors, such as VGG , ResNet , DenseNet , ResNeXt . Despite encouraging results achieved by the state-of-the-art detectors based on deep and large backbones, there is still plenty of room for performance improvement. Moreover, it is very expensive to achieve better detection performance by designing a novel more powerful backbone and pre-training it on ImageNet. In addition, since almost all of the existing backbone networks are originally designed for image classification task, directly employing them to extract basic features for object detection may result in suboptimal performance.
To deal with the issues mentioned above, as illustrated in Figure 1, we propose to assemble multiple identical backbones, in a novel way, to build a more powerful backbone for object detection. In particular, the assembled backbones are treated as a whole which we call Composite Backbone Network (CBNet). More specifically, CBNet consists of multiple identical backbones (specially called Assistant Backbones and Lead Backbone) and composite connections between neighbor backbones. From left to right, the output of each stage in an Assistant Backbone, namely higher-level features, flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone named Lead Backbone are used for object detection. Obviously, the features extracted by CBNet for object detection fuse the high-level and low-level features of multiple backbones, hence improve the detection performance. It is worth mentioning that, we do not need to pretrain CBNet for training a detector integrated with it. For instead, we only need to initialize each assembled backbone of CBNet with the pretrained model of the single backbone which is widely and freely available today, such as ResNet and ResNeXt. In other words, adopting the proposed CBNet is more economical and efficient than designing a novel more powerful backbone and pre-training it on ImageNet.
On the widely tested MS-COCO benchmark , we conduct experiments by applying the proposed Composite Backbone Network to several state-of-the-art object detectors, such as FPN , Mask R-CNN  and Cascade R-CNN . Experimental results show that the mAPs of all the detectors consistently increase by 1.5 to 3.0 percent, which demonstrates the effectiveness of our Composite Backbone Network. Moreover, with our Composite Backbone Network, the results of instance segmentation are also improved. Specially, using Triple-ResNeXt152, i.e., Composite Backbone Network architecture of three ResNeXt152  backbones, we achieve the new state-of-the-art result on COCO dataset, that is, mAP of 53.3, outperforming all the published object detectors.
To summarize, the major contributions of this work are two-fold:
We propose a novel method to build a more powerful backbone for object detection by assembling multiple identical backbones, which can significantly improve the performances of various state-of-the-art detectors.
We achieve the new state-of-the-art result on the MSCOCO dataset with single model, that is, the mAP of 53.3 for object detection.
2 Related work
Object detection Object detection is a fundamental problem in computer vision. The state-of-the-art methods for general object detection can be briefly categorized into two major branches. The first branch contains one-stage methods such as YOLO , SSD , Retinanet , FSAF  and NAS-FPN . The other branch contains two-stage methods such as Faster R-CNN , FPN , Mask R-CNN, Cascade R-CNN and Libra R-CNN . Although breakthrough has been made and encouraging results have been achieved by the recent CNN based detectors, there is still large room for performance improvement. For example, on MS COCO benchmark , the best publicly reported mAP is only 52.5 , which is achieved by model ensemble of four detectors.
Backbone for Object detection Backbone is a very important component of a CNN based detector to extract basic features for object detection. Following the original works (e.g., R-CNN  and OverFeat 
) of applying deep learning to object detection, almost all of the recent detectors adopt thepretraining and fine-tuning paradigm, that is, directly use the networks which are pre-trained for ImageNet classification task as their backbones. For instance, VGG , ResNet , ResNeXt  are widely used by the state-of-the-art detectors. Since these backbone networks are originally designed for image classification task, directly employing them to extract basic features for object detection may result in suboptimal performance. More recently, two sophisticatedly designed backbones, i.e., DetNet  and FishNet , are proposed for object detection. These two backbones are specifically designed for the object detection task, and they still need to be pretrained for ImageNet classification task before training (fine tuning) the detector based on them. It is well known that designing and pretraining a novel and powerful backbone like them requires much manpower and computation cost. In an alternative way, we propose a more economic and efficient solution to build a more powerful backbone for object detection, by assembling multiple identical existing backbones (e.g., ResNet and ResNeXt).
Recurrent Convolution Neural Network As shown in Figure 2
, the proposed architecture of Composite Backbone Network is somewhat similar to an unfolded recurrent convolutional neural network (RCNN) architecture. However, the proposed CBNet is quite different from this network. First, as illustrated in Figure 2, the architecture of CBNet is actually quite different, especially for the connections between the parallel stages. Second, in RCNN, the parallel stages of different time steps share the parameters, while in the proposed CBNet, the parallel stages of backbones do not share the parameters. Moreover, if we use RCNN as the backbone of a detector, we need to pretrain it on ImageNet. However, when we use CBNet, we do not need to pretrain it.
3 Proposed method
This section elaborates the proposed CBNet in detail. We first describe its architecture and variants in Section 3.1 and Section 3.2 respectively. And then, we describe the structure of detection network with CBNet in Section 3.3.
3.1 Architecture of CBNet
The architecture of the proposed CBNet consists of identical backbones (). Specially, we call the case of K = 2 (as shown in Figure 2.a) as Dual-Backbone (DB) for simplicity, and the case of K=3 as Triple- Backbone (TB).
As illustrated in Figure 1, the CBNet architecture consists of two types of backbones: the Lead Backbone and the Assistant Backbones . Each backbone comprises stages (generally ), and each stage consists of several convolutional layers with feature maps of the same size. The
-th stage of the backbone implements a non-linear transformation.
In the traditional convolutional network with only one backbone, the -th stage takes the output (denoted as ) of the previous -th stage as input, which can be expressed as:
Unlike this, in the CBNet architecture, we novelly employ Assistant Backbones to enhance the features of the Lead Backbone , by iteratively feeding the output features of the previous backbone as part of input features to the succeeding backbone, in a stage-by-stage fashion. To be more specific, the input of the -th stage of the backbone is the fusion of the output of the previous -th stage of (denoted as ) and the output of the parallel stage of the previous backbone (denoted as ). This operation can be formulated as following:
where denotes the composite connection, which consists of a 1
1 convolutional layer and batch normalization layer to reduce the channels and an upsample operation. As a result, the output features of the-th stage in is transformed to the input of the same stage in , and added to the original input feature maps to go through the corresponding layers. Considering that this composition style feeds the output of the adjacent higher-level stage of the previous backbone to the succeeding backbone, we call it as Adjacent Higher-Level Composition (AHLC).
For object detection task, only the output of Lead Backbone are taken as the input of RPN/detection head, while the output of each stage of Assistant Backbones is forwarded into its adjacent backbone. Moreover, the in CBNet can adopt various backbone architectures, such as  or ResNeXt , and can be initialized from the pre-trained model of the single backbone directly.
3.2 Other possible composite styles
Same Level Composition (SLC)
An intuitive and simple composite style is to fuse the output features from the same stage of backbones. This operation of Same Level Composite (SLC) can be formulated as:
To be more specific, Figure 3.b illustrates the structure of SLC when .
Adjacent Lower-Level Composition (ALLC)
Contrary to AHLC, another intuitive composite style is to feed the output of the adjacent lower-level stage of the previous backbone to the succeeding backbone. This operation of Adjacent Lower-Level Composition (ALLC). The operation of Inverse Level Composite (ILC) can be formulated as:
To be more specific, Figure 3.c illustrates the structure of ILC when .
Dense Higher-Level Composition (DHLC)
In DenseNet , each layer is connected to all subsequent layers to build a dense connection in a stage. Inspired by it, we can utilize dense composite connection in our CBNet architecture. The operation of DHLC can be expressed as follows:
As shown in Figure 3.d, when , we assemble the features from all the higher-level stages in the Assistant Backbone, and add the composite features to the output features of the previous stage in the Lead Backbone.
3.3 Architecture of detection network with CBNet
The CBNet architecture is applicable with various off-the-shelf object detectors without additional modifications to the network architectures. In practice, we attach layers of the Lead Backbone with functional networks, RPN  , detection head [21, 22, 29, 14, 8, 2].
|FPN + ResNet101||✓||39.4||61.5||42.8||-||-||-|
|Mask R-CNN + ResNet101||✓||40.0||61.2||43.6||35.9||57.9||38.0|
|Cascade R-CNN + ResNet101||✓||42.8||62.1||46.3||-||-||-|
|Cascade Mask R-CNN + ResNeXt152||✓||48.3||67.0||52.8||41.0||64.1||44.2|
In this section, we present experimental results on the bounding box detection task and instance segmentation task of the challenging MS-COCO benchmark . Following the protocol in MS-COCO, we use the trainval35k set for training, which is a union of 80k images from the train split and a random 35k subset of images from the 40k image validation split. We report COCO AP on the test-dev split for comparisons, which is tested on the evaluation server.
4.1 Implementation details
Baselines methods in this paper are reproduced by ourselves based on the Detectron framework 
. All the baselines are trained with single-scale strategy, except Cascade Mask R-CNN ResNeXt152. Specifically, the short side of input image is resized to 800, and the longer side is limited to 1333. We conduct experiments on a machine with 4 NVIDIA Titan X GPUs, CUDA 9.2 and cuDNN 7.1.4 for most experiments. In addition, we train Cascade Mask R-CNN with Dual-ResNeXt152 on a machine with 4 NVIDIA P40 GPUs and Cascade Mask R-CNN with Triple-ResNeXt152 on a machine with 4 NVIDIA V100 GPUs. The data augmentation is simply flipping the images. For most of the original baselines, batch size on a single GPU is two images. Due to the limitation of GPU memory for CBNet , we put one image on each GPU for training the detectors using CBNet. Meanwhile, we set the initial learning rate as the half of the default value and train for the same epoches as the original baselines. It is worth noting that, we do not change any other configuration of these baselines except the reduction of the initial learning rate and batch size.
During the inference, we completely use the configuration in the original baselines . For Cascade Mask R-CNN with different backbones, we run both single-scale test and multi-scale test. And for other baseline detectors, we run single-scale test, in which the short side of input image is resized to 800, and the longer side is limited to 1333. It is noted that we do not utilize Soft-NMS  during the inference for fair comparison .
|Faster R-CNN ||VGG16||21.9||42.7||-||-||-||-|
|Mask R-CNN ||ResNet101||39.8||62.3||43.4||22.1||43.2||51.2|
|Cascade R-CNN ||ResNet101||42.8||62.1||46.3||23.7||45.5||55.2|
|Libra R-CNN ||ResNext-101||43.0||64.0||47.0||25.3||45.6||54.6|
|SNIP (model ensemble) *||-||48.3||69.7||53.7||31.4||51.6||60.7|
|Cascade Mask R-CNN *||ResNeXt152||50.2||68.2||54.9||31.9||52.9||63.5|
|MegDet (model ensemble) *||-||52.5||-||-||-||-||-|
|Cascade Mask R-CNN *||Dual-ResNeXt152||52.8||70.6||58.0||34.9||55.4||65.3|
|Cascade Mask R-CNN *||Triple-ResNeXt152||53.3||71.9||58.5||35.5||55.8||66.7|
4.2 Detection results
To demonstrate the effectiveness of the proposed CBNet, we conduct a series of experiments with the baselines of state-of-the-art detectors, i.e., FPN , Mask R-CNN  and Cascade R-CNN , and the results are reported in Table 2. In each row of Table 2, we compare a baseline (provided by Detectron ) with its variants using the proposed CBNet, and one can see that our CBNet consistently improves all of these baselines with a significant margin. More specifically, the mAPs of these baselines increase by 1.5 to 3 percent.
Furthermore, as presented in Table 3, a new state-of-the-art detection result of 53.3 mAP on the MS-COCO benchmark is achieved by Cascade Mask R-CNN baseline equipped with the proposed CBNet. Notably, this result is achieved just by single model, without any other improvement for the baseline besides taking CBNet as backbone. Hence, this result demonstrates great effectiveness of the proposed CBNet architecture.
Moreover, as shown in Table 2, the proposed CBNet also improves the performances of the baselines for instance segmentation. Compared with bounding boxes prediction (i.e., object detection), pixel-wise classification (i.e., instance segmentation) tends to be more difficult and requires more representational features. And these results demonstrate the effectiveness of CBNet again.
4.3 Comparisons of different composite styles
We further conduct experiments to compare the suggested composite style AHLC with other possible composite styles illustrated in Figure 3, including SLC, ALLC, and DHLC. All of these experiments are conducted based on the Dual-Backbone architecture and the baseline of FPN ResNet101.
SLC v.s. AHLC As presented in Table 4, SLC gets even worse result than the original baseline. We think the major reason is that the architecture of SLC will bring serious parameter redundancy. To be more specific, the features extracted by the same stage of the two backbones in CBNet are similar, hence SLC cannot learn more semantic information than using single backbone. In other words, the network parameters are not fully utilized, but bring much difficulty on training, leading to a worse result.
ALLC v.s. AHLC As shown in Table 4, there is a great gap between ALLC and AHLC. We infer that, in our CBNet, if we directly add the lower-level (i.e., shallower) features of the previous backbone to the higher-level (i.e., deeper) ones of the succeeding backbone, the semantic information of the latter ones will be largely harmed. On the contrary, if we add the deeper features of the previous backbone to the shallow ones of the succeeding backbone, the semantic information of the latter ones can be largely enhanced.
DHLC v.s. AHLC The results in Table 4 show that DHLC does not bring performance improvement as AHLC, although it adds more composite connections than AHLC. We infer that, the success of Composite Backbone Network lies mainly in the composite connections between adjacent stages, while the other composite connections do not enrich much feature since they are too far away.
Obviously, CBNets of these composite styles have same amount of the network parameter (i.e., about twice amount of the network parameters than single backbone), but only AHLC brings optimal detection performance improvement. These experiment results prove that only increasing parameters or adding additional backbone may not bring better result. Moreover, these experiment also show that composite connections should be added properly. Hence, these experiment results actually demonstrate that the suggested composite style AHLC is effective and nontrivial.
4.4 Sharing weights for CBNet
Due to using more backbones, CBNet brings the sharp increase of network parameters. To further demonstrate that the improvement of detection performance mainly comes from the composite architecture rather than the increase of network parameters, we conduct experiments on FPN, with the configuration of sharing the weighs of two backbones in Dual-ResNet101, and the results are shown in Table 5. We can see that when sharing the weights of backbones in CBNet, the increment of parameters is negligible, but the detection result is still much better than the baseline (e.g., mAP 40.4 v.s. 39.4). However, when we do not share the weights, the improvement is not so much (mAP from 40.4 to 41.0), which proves that it is the composite architecture that boosts the performance dominantly, rather than the increase of network parameters.
|FPN + ResNet101||39.4||470|
4.5 Number of backbones in CBNet
We conduct experiments to investigate the relationship between the number of backbones in CBNet and the detection performance by taking FPN-ResNet101 as the baseline, and the results are shown in Figure 4. It can be noted that the detection mAP steadily increases with the number of backbones, and tends to converge when the number of backbones reaches three. Hence, considering the speed and memory cost, we suggest to utilize Dual-Backbone and Triple-Backbone architectures.
|FPN + ResNet101||39.4||8.1|
4.6 An accelerated version of CBNet
The major drawback of the proposed CBNet is that it will slows down the inference speed of the baseline detector since it uses more backbones to extract features thus increases the computation complexity. For example, as shown in Table 6, DB increases the AP of FPN by 1.6 percent but slows down the detection speed from 8.1 fps to 5.5 fps. To alleviate this problem, we further propose an accelerated version of the CBNet as illustrated in Figure 5, by removing the two early stages of the Assistant Backbone. As demonstrated in Table 6, this accelerated version can significantly improve the speed (from 5.5 fps to 6.9 fps) while not harming the detection accuracy (i.e., AP) a lot (from 41.0 to 40.8).
4.7 Effectiveness of basic feature enhancement by CBNet
We think the root cause that our CBNet can performs much better than the single backbone network for object detection task is: it can extract more representational basic features than the original single backbone network which is originally designed for classification problem. To verify this, as illustrated in Figure 6, we visualize and compare the intermediate the feature maps extracted by our CBNet and the original single backbone in the detectors for some examples. The example image in Figure 6 contains two foreground objects: a person and a tennis ball. Obviously, the person is the large-size object and the tennis ball is the small-size object. Hence, we correspondingly visualize the large scale feature maps (for detecting small objects) and the small scale feature maps (for detecting large objects) extracted by our CBNet and the original single backbone. One can see that, the feature maps extracted by our CBNet consistently have stronger activation values at the foreground object and weaker activation values at the background. This visualization example shows that our CBNet is more effective to extract representational basic features for object detection.
In this paper, a novel network architecture called Composite Backbone Network (CBNet) is proposed to boost the performance of state-of-the-art object detectors. CBNet consists of a series of backbones with same network structure and uses composite connections to link these backbones. Specifically, the output of each stage in a previous backbone flows to the parallel stage of the succeeding backbone as part of inputs through composite connections. Finally, the feature maps of the last backbone namely Lead Backbone are used for object detection. Extensive experimental results demonstrate that the proposed CBNet is beneficial for many state-of-the-art detectors, such as FPN, Mask R-CNN, and Cascade R-CNN, to improve their detection accuracy. To be more specific, the mAPs of the detectors mentioned above on the COCO dataset are increased by about 1.5 to 3 percent, and a new state-of-the art result on COCO with the mAP of 53.3 is achieved by simply integrating CBNet into the Cascade Mask R-CNN baseline. Simultaneously, experimental results show that it is also very effective to improve the instance segmentation performance. Additional ablation studies further demonstrate the effectiveness of the proposed architecture and the composite connection module.
-  (2017) Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5561–5569. Cited by: §4.1.
-  (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, Cited by: Table 1, §1, §1, §2, §3.3, §4.2, Table 3.
-  (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: Table 3.
Imagenet: a large-scale hierarchical image database.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 248–255. Cited by: §1.
-  (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7036–7045. Cited by: §2, Table 3.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §2.
-  (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: Table 1, §4.1, §4.1, §4.2, Table 3.
-  (2017) Mask r-cnn. In ICCV, pp. 2980–2988. Cited by: §1, §1, §2, §3.3, §4.2, Table 3.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §1, §2, §3.1.
-  (2017) Densely connected convolutional networks.. In CVPR, Vol. 1, pp. 3. Cited by: §1, §3.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1, §1.
-  (2018) Detnet: design backbone for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 334–350. Cited by: §2.
-  (2015) Recurrent convolutional neural network for object recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3367–3375. Cited by: Figure 2, §2.
-  (2017) Feature pyramid networks for object detection. In CVPR, Vol. 1, pp. 4. Cited by: §1, §1, §2, §3.3, §4.2, Table 3, Table 4.
-  (2017) Focal loss for dense object detection. In ICCV, Cited by: §2, Table 3.
-  (2018) Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: Table 1, §1, §2, §4.
-  (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §1, §2, Table 3.
-  (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §2, Table 3.
-  (2018) Megdet: a large mini-batch object detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6181–6189. Cited by: §2, Table 3.
-  (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §2, §3.3.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NIPS, pp. 91–99. Cited by: §1, §2, §3.3, Table 3.
-  (2013) Overfeat: integrated recognition, localization and detection using convolutional networks. arXiv preprint arXiv:1312.6229. Cited by: §2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2.
-  (2018) An analysis of scale invariance in object detection snip. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3578–3587. Cited by: Table 3.
-  (2018) SNIPER: efficient multi-scale training. In Advances in Neural Information Processing Systems, pp. 9333–9343. Cited by: Table 3.
-  (2018) Fishnet: a versatile backbone for image, region, and pixel level prediction. In Advances in Neural Information Processing Systems, pp. 762–772. Cited by: §2.
-  (2017) Aggregated residual transformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference on, pp. 5987–5995. Cited by: §1, §1, §2, §3.1.
-  (2018) Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203–4212. Cited by: §3.3, Table 3.
-  (2018) M2Det: a single-shot object detector based on multi-level feature pyramid network. arXiv preprint arXiv:1811.04533. Cited by: Table 3.
-  (2019) Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 840–849. Cited by: §2, Table 3.