Object detection is a fundamental research topic in image/video understanding. It can serve as a prerequisite for various image/video retrieval, intelligent surveillance and autonomous driving. Existing deep learning-based detectors can be briefly categorized into two branches: one-stage detectors such as SSD, RefineDet  and RetinaNet , which utilize CNN directly to predict the bounding boxes; and two-stage methods including Faster R-CNN , R-FCN  and Mask R-CNN , which generate a set of candidate proposals and then exploit the extracted region features from CNN for further refinement. Although encouraging progresses have been made, the existing detectors are still suffering from the problems caused by the scale variation across object instances.
An intuitive approach to solve the scale variation problem is to use a multi-scale image pyramid . However, the dramatic increase in inference time makes the image pyramid methods infeasible for practical applications. Other kinds of methods  aim to employ the feature pyramid within the network, to approximate the image pyramid at a lower computational cost. Feature Pyramid Network (FPN)  is the most representative one, which incorporates high-semantic information in both high-level and low-level features with a top-down pathway, achieving superior performance. However, this top-down architecture design has the following intrinsic limitations: (1) it only introduces high-semantics information from deep layer to shallow layer, but does not consider the assistance of shallow layer to deep layer; (2) the top-down architecture makes the features of small objects largely depend on the features of larger objects, and this dependence is not always beneficial. For instance, we conduct a toy experiment by change the number of FPN channels in the baseline detector RetinaNet-ResNet50 (input size 800)  to test the accuracy bottleneck, and the results are shown in Figure 1. It is notable that when the channel dimension increases to 768, the accuracy growth is negligible with a lot of additional computation and parameters. This experiment demonstrates that such a top-down FPN architecture has bottleneck restrictions.
To address these problems, we rethink the feature pyramid network and summarize the architectures of FPN with three different fashions: top-down, fusing-splitting and bottom-up. As illustrated in Figure 2 from top to bottom, we design an instance FPN for each FPN architecture. The Top-down FPN is an improved version of the original FPN , which introduces high-level semantic contexts to low-level features for better detecting small objects. In particular, we newly propose the bottom-up FPN, which introduces low-level details to high-level features, helping the high-level features obtain more spatial information thus can better detect large objects. Deviated from the interdependent relationship between deep and shallow features, we propose a novel Fusing-splitting FPN, which first fuses higher-level and lower-level features and then splits the fused feature into multi-scale features. Further, as illustrated in Figure 2, we propose a novel feature pyramid network that assembles these three FPNs of different architectures, named Mixture Feature Pyramid Network (MFPN). Experimental results show that the proposed MFPN can significantly enhance these FPN based detectors by about 2 percent Average Precision(AP), and can improve the detection performance of objects of all scale ranges (e.g., as depicted in Figure 2). Moreover, competitive single-model detection results are achieved by both one-stage and two-stage baseline detectors equipped with MFPN.
In summary, our main contributions are as follows:
We design three FPNs of different architectures, Top-down FPN, Bottom-up FPN, and Fusing-splitting FPN, which have better detection performance for small objects, large objects, medium-size objects respectively.
We propose a novel Mixture Feature Pyramid Network (MFPN) which inherits all the merits of the three FPNs, by assembling them in a parallel multi-branch architecture and mixing the features extracted by each branch.
We achieve significant better detection results than both one-stage and two-stage FPN-based detectors on MS COCO benchmark.
2 Related Work
Addressing scale variation issue is critical for object detection, segmentation and other tasks that require predictive location. To tackle the scale variation problem, an intuitive way is to use a multi-scale image pyramid during training and inference . Different from methods with fixed or random scale transform, SNIP  selectively back-propagates the gradients of object instances of different sizes as a function of image scale. In addition, SNIPER  samples low-resolution chips to accelerate multi-scale training. Multi-scale image pyramid greatly improves accuracy, but suffers a lot from increasing inference time.
The feature pyramid method, that is, constructing and using the feature pyramid within the network, is more widely used to deal with scale variation, due to its lower computation cost. Methods like SSD  and MS-CNN  directly perform small objects detection on higher resolution feature maps while large ones on lower resolution feature maps extracted by the backbone network (e.g., VGG). Due to the backbone networks are originally designed for classification task, directly using the features extracted by them leads to suboptimal performance. Hence, some recent works try to alleviate this problem by enhancing the features extracted by backbones with novel feature enhancement modules, e.g., RFBNet  and TridentNet . Feature Pyramid Networks (FPN)  is commonly exploited by state-of-the-art object detectors, e.g., Mask RCNN , RetinaNet, RefineDet , etc., which proposes a subnet with top-down architecture to construct feature pyramid. Recently, Multi-level FPN introduces multiple U-shape modules after a backbone network to extract multi-level pyramidal features, and builds a powerful one-stage detector. Libra R-CNN  and  are two recently proposed feature pyramid networks of Fusing-splitting architecture, who combine features of all scales and then generate features at each scale by a global attention operation on the combined features. As stated in section 1, these FPNs have their own intrinsic limitations since they are designed with only one specific kind of FPN architecture (i.e., top-down, or fusing-splitting, or bottom-up).
3 Proposed Method
In this work, we first introduce three kinds of FPN architectures, that is, Top-down, Bottom-up and Fusing-splitting. As illustrated in Figure 2, each pyramidal feature map (denoted as ) extracted by the backbone is followed by an extra convolution. Then, these feature maps (denoted ) are used to build feature pyramid for object detection by each FPN of different architectures as following.
3.1 Top-down FPN
The major characteristic of top-down FPN architecture is: the FPN feature maps (denoted as ) are sequentially constructed in a top-down manner, that is, the smaller scale (higher-level) feature map is constructed first. we adopt the most widely used top-down architecture FPN with some modifications. To be more specific, we plug an extra global average pooling(GAP) layer above the deepest layer of the backbone to extract the global context, i.e., G5. Moreover, GAP can learn richer semantic information and highlights the discriminative object regions detected by CNNs, thus can propagate more semantic information to the larger scale(lower-level) feature maps. Same as the original FPN, each feature map () of Top-down FPN is iteratively built by combining the same level backbone feature map () and the higher-level FPN feature map ():
where denotes the upsample operation with a factor of 2 and is a convolution filter. Since the top-down architecture iteratively propagates semantic information of higher-level backbone features to the more detailed lower-level FPN feature maps, it is better at detecting small objects.
3.2 Bottom-up FPN
Contrary to the top-down architecture, the major characteristic of bottom-up FPN is: the FPN feature maps (denoted as ) are sequentially constructed in a bottom-up manner, that is, the large scale (lower-level) feature map is constructed first. As illustrated in Figure 2.c, each feature map () of the Bottom-up FPN is obtained by merging the same level backbone feature map (), the backbone feature map () above it, and the FPN feature map () below it, which can be formulated as:
where denotes MaxPool operation with a factor of 2 and is a convolution filter. Because the bottom-up architecture propagates the spatial detail information of lower-level backbone features to the higher-level FPN features, it is better at detecting large objects. Obviously, Bottom-up FPN and Top-down FPN are complementary to each other.
3.3 Fusing-splitting FPN
Since the feature maps of the Top-down FPN and Bottom-up FPN are sequentially built, the earlier constructed features always affect the subsequent ones, and this interdependent design may lead to some intrinsic limitation. To address this problem, we design a Fusing-splitting FPN, which first combines the higher-level and lower-level backbone features, and then splitting the combined features to multi-scale FPN features.
|Bottom-up + Fusing-splitting||34.3||16.0||39.0||48.6|
|Top-down + Bottom-up||34.3||15.8||38.9||48.9|
|Top-down + Fusing-splitting||33.8||15.7||38.4||47.7|
In practice, the highest two backbone feature maps are merged into a combined feature map , and the lowest two backbone feature maps are merged into :
After obtaining the first-round combined features, we further fuse them as following,
where and are two convolution filters, and represents concat operation along channel dimension. After these operations, feature maps have fused informations from all level features. Finally, we simply resize into multi-scale pyramidal feature maps, that is,
By the above two rounds fusing and the splitting operations, all the feature maps of the Fusing-splitting FPN incorporate information from the backbone feature maps of all levels. Moreover, the two medium-scale feature maps ( and ) are obtained with less downsampling or upsampling operation. Hence, Fusing-splitting FPN has a stable improvement in detecting medium-sized objects.
3.4 Mixture Feature Pyramid Network (MFPN)
|Faster R-CNN FPN||36.4||21.5||40.0||46.6||82|
|Cascade Mask R-FPN||42.7||23.8||46.5||56.9||196|
Now we propose a more powerful feature pyramid network named MFPN by integrating the above three FPNs. Intuitively, MFPN inherits all the merits of the three FPNs and performs better to handle scale variation problem in object detection. By integrating the three FPNs in one network, we can avoid a large increase in the number of parameters by sharing one backbone network. The network architecture of MFPN is illustrated in Figure 2, each feature map of MFPN is obtained by summing the same level feature map of the three feature pyramids along spatial dimension, that is,
MFPN can play all the roles played by FPN, including as anchor feature to improve the accuracy, or as neck feature to boost RPN for better candidate proposals and connect with RoI Extractorfor better RoI features.
|Faster R-CNN w FPN ||ResNet101-FPN||36.2||59.1||39.0||18.2||39.0||48.2|
|Deformable R-FCN ||Inc-Res-v2||37.5||58.0||40.8||19.4||40.1||52.5|
|Mask R-CNN ||ResNeXt-101||39.8||62.3||43.4||22.1||43.2||51.2|
|Cascade R-CNN ||ResNet101-FPN||42.8||62.1||46.3||23.7||45.5||55.2|
|MFPN-Cascade Mask R-CNN||ResNext-101-64x4d||47.6||66.7||52.0||29.4||50.8||59.6|
4.1 Dataset and Implementation details
Dataset Description. We present experimental results on the bounding boxes detection task of the challenging MS-COCO benchmark . For training, validation and testing processes, we follow  and , train on the union of 11.8k training images(including the 80k train split and a random 35k subset of images from the 40k image val split), conduct ablation study on 5k minival split for convenience. Then, to compare the accuracy with state-of-the-art FPN-based methods, we report results of test-dev split images.
The backbones used in this paper are all pre-trained on ImageNet
. For ablation study experiments, we train detectors 12 epochs in total, with learning rate starting from 0.02 and the batch size is 16. Cascade Mask R-CNN-MFPN and RetinaNet-X101-MFPN are trained for 20 epochs and the initial learning rate is set to 0.01. For evaluation, detectors run on a single Titan X GPU with CUDA 9 and CUDNN 7, with a batch size of 1.
4.2 Ablation Studies
Compare the three FPNs
As shown in Table 1, Top-down FPN gets the highest score for small objects( of 15.2), while Bottom-up FPN wins for large objects( of 48.7) and Fusing-splitting FPN is best at detecting medium-sized objects( of 38.5). When we add up the three FPNs, the overall AP is 1.5 higher than FPN. We also conduct experiments of multiple combinations of Top-down FPN, Bottom-up FPN and Fusing-splitting FPN in Table 2. The combination of Top-down and Bottom-up gets the highest result (36.8) among the pair-wise combinations. At the same time, to further improve the accuracy of AP and enhance the detection accuracy of hard samples, we adopt a combination of three FPNs. These results fully confirm our expectations and prove that our design is reasonable and effective.
MFPN can significantly enhance FPN-based detectors We further evaluate the proposed MFPN with different backbones and detectors, using input image scale of 800 pixels. Results are detailed in Table 3. MFPN consistently improves the detection accuracy for various backbones. For MFPN-Retinanet and MFPN-Faster R-CNN, we adopt balanced loss instead of smooth L1 to better handle sample imbalance problem. Our MFPN introduces marginal computation cost to the whole detection network, leading to negligible loss of inference speed. Especially, we improve RetinaNet by 2.1 AP on Retinanet ResNeXt-101 without additional inference latency increment, and 1.6 percent of AP on Cascade Mask RCNN-ResNet 101 with only 8ms latency increment.
MFPN can learn better features for object detection To verify that the proposed MFPN can learn effective feature for detecting objects of various sizes, we visualize the activation values of the output of FPN and MFPN along scale and level dimensions, such an example shown in Figure 3. The input image contains four dogs with different sizes. We can find that: 1) For detecting the smallest dog, the lowest feature from Top-down FPN achieves clearer and noise-free semantics than that from FPN. 2) Compared with FPN, Bottom-up FPN obtains better high-level FPN features with three clear activation points in , and can better detecting the biggest dog. 3) from Fusing-splitting FPN have larger activation regions than FPN, containing more detailed information, thus cann better detecting the two medium-sized dogs. 4) The responses of MFPN to objects are accurate, while the ones of FPN are hindered by meaningless noise. This implies: 1) MFPN is good at learning the characteristics of objects. 2) It is necessary to use MFPN to detect objects of various sizes.
4.3 Compare with state-of-the-art FPN-based methods
We evaluate MFPN on the COCO test-dev set and compare it with recent state-of-the-art FPN-based methods. The model is trained using scale jitter over scales . For fair comparison, we only compare the results produced from single models without ensemble or multi-scale testing. As shown in Table 4, MFPN based detectors, RetinaNet-MFPN, and Cascade Mask R-CNN-MFPN, achieve superior results without bells and whistles. RetinaNet-MFPN gets AP (43.4), which surpasses all other one-stage detectors. Cascade Mask R-CNN-MFPN obtains AP of 47.6, outperforms TridentNet, SNIP and SNIPER, who uses image pyramid training and testing strategies. In conclude, MFPN is compatible with both powerful one-stage detectors and two-stage detectors and can achieve very competitive single-model results.
In this paper, we first describe three FPNs of different architectures(i.e.,, Top-down, Bottom-up, and Fusing-splitting) for extracting multi-scale features to solve the scale variation problem for object detection. Based on them, we propose a novel Mixture Feature Pyramid Network(MFPN), which is effective for learning powerful multi-scale features and can be simply assembled into both one-stage detectors and two-stage detectors. On the MS-COCO benchmark, MFPN improves the performance for all scale-ranges and enhances both one-stage and two-stage FPN-based detectors with 2 % AP increment, which leads to very competitive results.
-  (2018) An analysis of scale invariance in object detection–snip. In CVPR, Cited by: §1, §2, Table 4.
A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection. In ECCV, Cited by: §2.
-  (2018) Cascade R-CNN: delving into high quality object detection. In CVPR, Cited by: Table 4.
-  (2016) R-FCN: object detection via region-based fully convolutional networks. In NIPS, Cited by: §1, §2, §3.4.
-  (2017) Deformable convolutional networks. In ICCV, Cited by: §2, Table 4.
-  (2017) Mask R-CNN. In ICCV, Cited by: §1, §2, §3.4, Table 4.
-  (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, Cited by: §2.
-  (2017) RON: reverse connection with objectness prior networks for object detection. In CVPR, Cited by: §1.
-  (2018) Deep Feature Pyramid Reconfiguration for Object Detection. In ECCV, Cited by: §2.
-  (2018) CornerNet: detecting objects as paired keypoints. In ECCV, Cited by: Table 4.
-  (2019) Scale-Aware Trident Networks for Object Detection. In ICCV, Cited by: §2, Table 4.
-  (2014) Network in network. In ICLR, Cited by: §3.1.
-  (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §1, §1, §2, §3.1, §4.1, Table 4.
-  (2017) Focal Loss for Dense Object Detection. In ICCV, Cited by: §1, §1, §2, §3.4, Table 4.
-  Cited by: §4.1.
-  (2018) Receptive field block net for accurate and fast object detection. In ECCV, Cited by: §2.
-  (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1, §1, §2, Table 4.
-  (2019) Libra R-CNN: Towards Balanced Learning for Object Detection. In CVPR, External Links: Cited by: §2, §4.2.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §3.4.
-  (2015) ImageNet large scale visual recognition challenge. In IJCV, Cited by: §4.1.
-  (2016) Beyond skip connections: top-down modulation for object detection. In CoRR, Cited by: §1.
-  (2018) Sniper: Efficient multi-scale training. In NIPS, Cited by: §2, Table 4.
-  (2015) Going deeper with convolutions. In CVPR, Cited by: §3.1.
-  (2018) Single-shot refinement neural network for object detection. In CVPR, Cited by: §1, §2, §4.1, Table 4.
-  (2019) M2Det: a single-shot object detector based on multi-level feature pyramid network. In AAAI, Cited by: §2, Table 4.
-  (2019) Feature selective anchor-free module for single-shot object detection. In CVPR, Cited by: Table 4.