MFPN: A Novel Mixture Feature Pyramid Network of Multiple Architectures for Object Detection

Feature pyramids are widely exploited in many detectors to solve the scale variation problem for object detection. In this paper, we first investigate the Feature Pyramid Network (FPN) architectures and briefly categorize them into three typical fashions: top-down, bottom-up and fusing-splitting, which have their own merits for detecting small objects, large objects, and medium-sized objects, respectively. Further, we design three FPNs of different architectures and propose a novel Mixture Feature Pyramid Network (MFPN) which inherits the merits of all these three kinds of FPNs, by assembling the three kinds of FPNs in a parallel multi-branch architecture and mixing the features. MFPN can significantly enhance both one-stage and two-stage FPN-based detectors with about 2 percent Average Precision(AP) increment on the MS-COCO benchmark, at little sacrifice in running time latency. By simply assembling MFPN with the one-stage and two-stage baseline detectors, we achieve competitive single-model detection results on the COCO detection benchmark without bells and whistles.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 5

11/12/2018

M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network

Feature pyramids are widely exploited by both the state-of-the-art one-s...
11/27/2019

Residual Bi-Fusion Feature Pyramid Network for Accurate Single-shot Object Detection

State-of-the-art (SoTA) models have improved the accuracy of object dete...
12/25/2020

Implicit Feature Pyramid Network for Object Detection

In this paper, we present an implicit feature pyramid network (i-FPN) fo...
12/02/2019

IPG-Net: Image Pyramid Guidance Network for Object Detection

For Convolutional Neural Network based object detection, there is a typi...
11/22/2017

An Analysis of Scale Invariance in Object Detection - SNIP

An analysis of different techniques for recognizing and detecting object...
09/24/2020

MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection

Modern object detection methods can be divided into one-stage approaches...
12/08/2019

Dually Supervised Feature Pyramid for Object Detection and Segmentation

Feature pyramid architecture has been broadly adopted in object detectio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection is a fundamental research topic in image/video understanding. It can serve as a prerequisite for various image/video retrieval, intelligent surveillance and autonomous driving. Existing deep learning-based detectors can be briefly categorized into two branches: one-stage detectors such as SSD

[17], RefineDet [24] and RetinaNet [14], which utilize CNN directly to predict the bounding boxes; and two-stage methods including Faster R-CNN [19], R-FCN [4] and Mask R-CNN [6], which generate a set of candidate proposals and then exploit the extracted region features from CNN for further refinement. Although encouraging progresses have been made, the existing detectors are still suffering from the problems caused by the scale variation across object instances.

Figure 1: Left: the numbers of parameters with different numbers of channels of FPN. Right: the detection accuracies with different numbers of channels of FPN. The baseline detector is RetinaNet500-ResNet50.
Figure 2: Exsample results of object detectors using feature pyramid networks of different architectures (the baseline detector is RetinaNet500-ResNet50). Our MFPN performs best: detecting objects of small-size, medium-size and large-size with the highest IoU. Green boxes: ground truth, Red boxes: detection result.

An intuitive approach to solve the scale variation problem is to use a multi-scale image pyramid [1]. However, the dramatic increase in inference time makes the image pyramid methods infeasible for practical applications. Other kinds of methods [13][21][8][17] aim to employ the feature pyramid within the network, to approximate the image pyramid at a lower computational cost. Feature Pyramid Network (FPN) [13] is the most representative one, which incorporates high-semantic information in both high-level and low-level features with a top-down pathway, achieving superior performance. However, this top-down architecture design has the following intrinsic limitations: (1) it only introduces high-semantics information from deep layer to shallow layer, but does not consider the assistance of shallow layer to deep layer; (2) the top-down architecture makes the features of small objects largely depend on the features of larger objects, and this dependence is not always beneficial. For instance, we conduct a toy experiment by change the number of FPN channels in the baseline detector RetinaNet-ResNet50 (input size 800) [14] to test the accuracy bottleneck, and the results are shown in Figure 1. It is notable that when the channel dimension increases to 768, the accuracy growth is negligible with a lot of additional computation and parameters. This experiment demonstrates that such a top-down FPN architecture has bottleneck restrictions.

To address these problems, we rethink the feature pyramid network and summarize the architectures of FPN with three different fashions: top-down, fusing-splitting and bottom-up. As illustrated in Figure 2 from top to bottom, we design an instance FPN for each FPN architecture. The Top-down FPN is an improved version of the original FPN [13], which introduces high-level semantic contexts to low-level features for better detecting small objects. In particular, we newly propose the bottom-up FPN, which introduces low-level details to high-level features, helping the high-level features obtain more spatial information thus can better detect large objects. Deviated from the interdependent relationship between deep and shallow features, we propose a novel Fusing-splitting FPN, which first fuses higher-level and lower-level features and then splits the fused feature into multi-scale features. Further, as illustrated in Figure 2, we propose a novel feature pyramid network that assembles these three FPNs of different architectures, named Mixture Feature Pyramid Network (MFPN). Experimental results show that the proposed MFPN can significantly enhance these FPN based detectors by about 2 percent Average Precision(AP), and can improve the detection performance of objects of all scale ranges (e.g., as depicted in Figure 2). Moreover, competitive single-model detection results are achieved by both one-stage and two-stage baseline detectors equipped with MFPN.

In summary, our main contributions are as follows:

  • We design three FPNs of different architectures, Top-down FPN, Bottom-up FPN, and Fusing-splitting FPN, which have better detection performance for small objects, large objects, medium-size objects respectively.

  • We propose a novel Mixture Feature Pyramid Network (MFPN) which inherits all the merits of the three FPNs, by assembling them in a parallel multi-branch architecture and mixing the features extracted by each branch.

  • We achieve significant better detection results than both one-stage and two-stage FPN-based detectors on MS COCO benchmark.

2 Related Work

Addressing scale variation issue is critical for object detection, segmentation and other tasks that require predictive location[1]. To tackle the scale variation problem, an intuitive way is to use a multi-scale image pyramid during training and inference [4][5][7]. Different from methods with fixed or random scale transform, SNIP [1] selectively back-propagates the gradients of object instances of different sizes as a function of image scale. In addition, SNIPER [22] samples low-resolution chips to accelerate multi-scale training. Multi-scale image pyramid greatly improves accuracy, but suffers a lot from increasing inference time.

The feature pyramid method, that is, constructing and using the feature pyramid within the network, is more widely used to deal with scale variation, due to its lower computation cost. Methods like SSD [17] and MS-CNN [2] directly perform small objects detection on higher resolution feature maps while large ones on lower resolution feature maps extracted by the backbone network (e.g., VGG). Due to the backbone networks are originally designed for classification task, directly using the features extracted by them leads to suboptimal performance. Hence, some recent works try to alleviate this problem by enhancing the features extracted by backbones with novel feature enhancement modules, e.g., RFBNet [16] and TridentNet [11]. Feature Pyramid Networks (FPN) [13] is commonly exploited by state-of-the-art object detectors, e.g., Mask RCNN [6], RetinaNet[14], RefineDet [24], etc., which proposes a subnet with top-down architecture to construct feature pyramid. Recently, Multi-level FPN[25] introduces multiple U-shape modules after a backbone network to extract multi-level pyramidal features, and builds a powerful one-stage detector. Libra R-CNN [18] and [9] are two recently proposed feature pyramid networks of Fusing-splitting architecture, who combine features of all scales and then generate features at each scale by a global attention operation on the combined features. As stated in section 1, these FPNs have their own intrinsic limitations since they are designed with only one specific kind of FPN architecture (i.e., top-down, or fusing-splitting, or bottom-up).

3 Proposed Method

In this work, we first introduce three kinds of FPN architectures, that is, Top-down, Bottom-up and Fusing-splitting. As illustrated in Figure 2, each pyramidal feature map (denoted as ) extracted by the backbone is followed by an extra convolution. Then, these feature maps (denoted ) are used to build feature pyramid for object detection by each FPN of different architectures as following.

3.1 Top-down FPN

The major characteristic of top-down FPN architecture is: the FPN feature maps (denoted as ) are sequentially constructed in a top-down manner, that is, the smaller scale (higher-level) feature map is constructed first. we adopt the most widely used top-down architecture FPN[13] with some modifications. To be more specific, we plug an extra global average pooling(GAP)[12] layer above the deepest layer of the backbone to extract the global context, i.e., G5. Moreover, GAP can learn richer semantic information and highlights the discriminative object regions detected by CNNs[23], thus can propagate more semantic information to the larger scale(lower-level) feature maps. Same as the original FPN[13], each feature map () of Top-down FPN is iteratively built by combining the same level backbone feature map () and the higher-level FPN feature map ():

(1)

where denotes the upsample operation with a factor of 2 and is a convolution filter. Since the top-down architecture iteratively propagates semantic information of higher-level backbone features to the more detailed lower-level FPN feature maps, it is better at detecting small objects.

3.2 Bottom-up FPN

Contrary to the top-down architecture, the major characteristic of bottom-up FPN is: the FPN feature maps (denoted as ) are sequentially constructed in a bottom-up manner, that is, the large scale (lower-level) feature map is constructed first. As illustrated in Figure 2.c, each feature map () of the Bottom-up FPN is obtained by merging the same level backbone feature map (), the backbone feature map () above it, and the FPN feature map () below it, which can be formulated as:

(2)

where denotes MaxPool operation with a factor of 2 and is a convolution filter. Because the bottom-up architecture propagates the spatial detail information of lower-level backbone features to the higher-level FPN features, it is better at detecting large objects. Obviously, Bottom-up FPN and Top-down FPN are complementary to each other.

3.3 Fusing-splitting FPN

Since the feature maps of the Top-down FPN and Bottom-up FPN are sequentially built, the earlier constructed features always affect the subsequent ones, and this interdependent design may lead to some intrinsic limitation. To address this problem, we design a Fusing-splitting FPN, which first combines the higher-level and lower-level backbone features, and then splitting the combined features to multi-scale FPN features.

Method Parameters(M) AP
FPN (Baseline) 8.00 33.2 15.0 37.5 47.4
Top-down 8.52 33.5 15.2 38.1 47.6
Bottom-up 8.52 33.5 14.4 37.9 48.7
Fusing-splitting 6.49 33.6 14.7 38.5 48.1
MFPN 11.47 34.8 16.8 39.1 49.0
Table 1: Object detection result comparison on COCO minival for the three FPNs of different architectures and the proposed MFPN. The baseline is RetinaNet500-ResNet50.
Method AP
Baseline 33.2 15.0 37.5 47.4
Bottom-up + Fusing-splitting 34.3 16.0 39.0 48.6
Top-down + Bottom-up 34.3 15.8 38.9 48.9
Top-down + Fusing-splitting 33.8 15.7 38.4 47.7
MFPN 34.8 16.8 39.1 49.0
Table 2: Object detection results comparison on COCO minival for different combinations of the three kinds of FPN architectures. The baseline is RetinaNet500-ResNet50.

In practice, the highest two backbone feature maps are merged into a combined feature map , and the lowest two backbone feature maps are merged into :

(3)

After obtaining the first-round combined features, we further fuse them as following,

(4)

where and are two convolution filters, and represents concat operation along channel dimension. After these operations, feature maps have fused informations from all level features. Finally, we simply resize into multi-scale pyramidal feature maps, that is,

(5)

By the above two rounds fusing and the splitting operations, all the feature maps of the Fusing-splitting FPN incorporate information from the backbone feature maps of all levels. Moreover, the two medium-scale feature maps ( and ) are obtained with less downsampling or upsampling operation. Hence, Fusing-splitting FPN has a stable improvement in detecting medium-sized objects.

3.4 Mixture Feature Pyramid Network (MFPN)

Baseline Method time(ms)
Retinanet-R50 FPN 35.6 20.0 39.6 46.8 85
ours 37.9 21.4 41.9 49.7 86
Retinanet-X101 FPN 40.0 23.0 44.3 52.7 196
ours 42.1 24.9 46.8 55.3 196
Faster R-CNN FPN 36.4 21.5 40.0 46.6 82
-R50 ours 38.6 22.6 42.8 49.7 93
Cascade Mask R-FPN 42.7 23.8 46.5 56.9 196
CNN-R101 ours 44.4 25.9 48.1 58.2 204

40.023.044.352.719636.421.540.046.68242.723.846.556.9196

Table 3: Performance comparison between FPN and MFPN on the COCO minival. R: ResNet. X: ResNext-101-64x4d.

Now we propose a more powerful feature pyramid network named MFPN by integrating the above three FPNs. Intuitively, MFPN inherits all the merits of the three FPNs and performs better to handle scale variation problem in object detection. By integrating the three FPNs in one network, we can avoid a large increase in the number of parameters by sharing one backbone network. The network architecture of MFPN is illustrated in Figure 2, each feature map of MFPN is obtained by summing the same level feature map of the three feature pyramids along spatial dimension, that is,

(6)

MFPN can play all the roles played by FPN, including as anchor feature to improve the accuracy[14], or as neck feature to boost RPN[19] for better candidate proposals and connect with RoI Extractor[19][4][6]for better RoI features.

4 Experiment

Method Backbone AP AP AP AP AP AP
one-stage:
SSD512 [17] VGG-16 28.8 48.5 30.3 10.9 31.8 43.5
RefineDet512 [24] ResNet-101 36.4 57.5 39.5 16.6 39.9 51.4
RetinaNet800 [14] Res101-FPN 39.1 59.1 42.3 21.8 42.7 50.2
CornerNet [10] Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9
M2Det [25] VGG-16 41.0 59.7 45.0 22.1 46.5 53.8
FSAF [26] ResNext-101-64x4d 42.9 63.8 46.3 26.6 46.2 52.7
two-stage:
Faster R-CNN w FPN [13] ResNet101-FPN 36.2 59.1 39.0 18.2 39.0 48.2
Deformable R-FCN [5] Inc-Res-v2 37.5 58.0 40.8 19.4 40.1 52.5
Mask R-CNN [6] ResNeXt-101 39.8 62.3 43.4 22.1 43.2 51.2
TridentNet [11] ResNet-101-Deformable 42.7 63.6 46.5 23.9 46.6 56.6
Cascade R-CNN [3] ResNet101-FPN 42.8 62.1 46.3 23.7 45.5 55.2
SNIP [1] ResNet-101-Deformable 44.4 66.2 44.9 27.3 47.4 56.9
SNIPER [22] ResNet-101-Deformable 46.1 67.0 51.6 29.6 48.9 58.1
Ours:
MFPN-Cascade Mask R-CNN ResNext-101-64x4d 47.6 66.7 52.0 29.4 50.8 59.6
MFPN-RetinaNet ResNext-101-64x4d 43.4 63.4 46.5 26.1 47.3 54.0
Table 4: Detection accuracy comparisons with the state-of-the-art FPN-based methods on MS-COCO test-dev set.

4.1 Dataset and Implementation details

Dataset Description. We present experimental results on the bounding boxes detection task of the challenging MS-COCO benchmark [15]. For training, validation and testing processes, we follow [24] and [13], train on the union of 11.8k training images(including the 80k train split and a random 35k subset of images from the 40k image val split), conduct ablation study on 5k minival split for convenience. Then, to compare the accuracy with state-of-the-art FPN-based methods, we report results of test-dev split images.

Implementation Details.

The backbones used in this paper are all pre-trained on ImageNet

[20]

. For ablation study experiments, we train detectors 12 epochs in total, with learning rate starting from 0.02 and the batch size is 16. Cascade Mask R-CNN-MFPN and RetinaNet-X101-MFPN are trained for 20 epochs and the initial learning rate is set to 0.01. For evaluation, detectors run on a single Titan X GPU with CUDA 9 and CUDNN 7, with a batch size of 1.

4.2 Ablation Studies

Compare the three FPNs

Figure 3: Heatmap visualization exsamples of MFPN and FPN.

As shown in Table 1, Top-down FPN gets the highest score for small objects( of 15.2), while Bottom-up FPN wins for large objects( of 48.7) and Fusing-splitting FPN is best at detecting medium-sized objects( of 38.5). When we add up the three FPNs, the overall AP is 1.5 higher than FPN. We also conduct experiments of multiple combinations of Top-down FPN, Bottom-up FPN and Fusing-splitting FPN in Table 2. The combination of Top-down and Bottom-up gets the highest result (36.8) among the pair-wise combinations. At the same time, to further improve the accuracy of AP and enhance the detection accuracy of hard samples, we adopt a combination of three FPNs. These results fully confirm our expectations and prove that our design is reasonable and effective.

MFPN can significantly enhance FPN-based detectors We further evaluate the proposed MFPN with different backbones and detectors, using input image scale of 800 pixels. Results are detailed in Table 3. MFPN consistently improves the detection accuracy for various backbones. For MFPN-Retinanet and MFPN-Faster R-CNN, we adopt balanced loss[18] instead of smooth L1 to better handle sample imbalance problem. Our MFPN introduces marginal computation cost to the whole detection network, leading to negligible loss of inference speed. Especially, we improve RetinaNet by 2.1 AP on Retinanet ResNeXt-101 without additional inference latency increment, and 1.6 percent of AP on Cascade Mask RCNN-ResNet 101 with only 8ms latency increment.

MFPN can learn better features for object detection To verify that the proposed MFPN can learn effective feature for detecting objects of various sizes, we visualize the activation values of the output of FPN and MFPN along scale and level dimensions, such an example shown in Figure 3. The input image contains four dogs with different sizes. We can find that: 1) For detecting the smallest dog, the lowest feature from Top-down FPN achieves clearer and noise-free semantics than that from FPN. 2) Compared with FPN, Bottom-up FPN obtains better high-level FPN features with three clear activation points in , and can better detecting the biggest dog. 3) from Fusing-splitting FPN have larger activation regions than FPN, containing more detailed information, thus cann better detecting the two medium-sized dogs. 4) The responses of MFPN to objects are accurate, while the ones of FPN are hindered by meaningless noise. This implies: 1) MFPN is good at learning the characteristics of objects. 2) It is necessary to use MFPN to detect objects of various sizes.

4.3 Compare with state-of-the-art FPN-based methods

We evaluate MFPN on the COCO test-dev set and compare it with recent state-of-the-art FPN-based methods. The model is trained using scale jitter over scales . For fair comparison, we only compare the results produced from single models without ensemble or multi-scale testing. As shown in Table 4, MFPN based detectors, RetinaNet-MFPN, and Cascade Mask R-CNN-MFPN, achieve superior results without bells and whistles. RetinaNet-MFPN gets AP (43.4), which surpasses all other one-stage detectors. Cascade Mask R-CNN-MFPN obtains AP of 47.6, outperforms TridentNet, SNIP and SNIPER, who uses image pyramid training and testing strategies. In conclude, MFPN is compatible with both powerful one-stage detectors and two-stage detectors and can achieve very competitive single-model results.

5 Conclusion

In this paper, we first describe three FPNs of different architectures(i.e.,, Top-down, Bottom-up, and Fusing-splitting) for extracting multi-scale features to solve the scale variation problem for object detection. Based on them, we propose a novel Mixture Feature Pyramid Network(MFPN), which is effective for learning powerful multi-scale features and can be simply assembled into both one-stage detectors and two-stage detectors. On the MS-COCO benchmark, MFPN improves the performance for all scale-ranges and enhances both one-stage and two-stage FPN-based detectors with 2 % AP increment, which leads to very competitive results.

References

  • [1] S. Bharat and D. L. S (2018) An analysis of scale invariance in object detection–snip. In CVPR, Cited by: §1, §2, Table 4.
  • [2] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos (2016)

    A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

    .
    In ECCV, Cited by: §2.
  • [3] Z. Cai and N. Vasconcelos (2018) Cascade R-CNN: delving into high quality object detection. In CVPR, Cited by: Table 4.
  • [4] J. Dai, Y. Li, K. He, and J. Sun (2016) R-FCN: object detection via region-based fully convolutional networks. In NIPS, Cited by: §1, §2, §3.4.
  • [5] J. Dai, H. Qi, Y. Xiong, Yi. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, Cited by: §2, Table 4.
  • [6] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. In ICCV, Cited by: §1, §2, §3.4, Table 4.
  • [7] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, Cited by: §2.
  • [8] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen (2017) RON: reverse connection with objectness prior networks for object detection. In CVPR, Cited by: §1.
  • [9] T. Kong, F. Sun, W. Huang, and H. Liu (2018) Deep Feature Pyramid Reconfiguration for Object Detection. In ECCV, Cited by: §2.
  • [10] H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. In ECCV, Cited by: Table 4.
  • [11] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-Aware Trident Networks for Object Detection. In ICCV, Cited by: §2, Table 4.
  • [12] M. Lin, Q. Chen, and S. Yan (2014) Network in network. In ICLR, Cited by: §3.1.
  • [13] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §1, §1, §2, §3.1, §4.1, Table 4.
  • [14] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal Loss for Dense Object Detection. In ICCV, Cited by: §1, §1, §2, §3.4, Table 4.
  • [15] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick Cited by: §4.1.
  • [16] S. Liu, D. Huang, and Y. Wang (2018) Receptive field block net for accurate and fast object detection. In ECCV, Cited by: §2.
  • [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1, §1, §2, Table 4.
  • [18] J. Pang, K. Chen, J. Shi, Hu. Feng, W. Ouyang, and D. Lin (2019) Libra R-CNN: Towards Balanced Learning for Object Detection. In CVPR, External Links: 1904.02701 Cited by: §2, §4.2.
  • [19] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §3.4.
  • [20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge. In IJCV, Cited by: §4.1.
  • [21] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta (2016) Beyond skip connections: top-down modulation for object detection. In CoRR, Cited by: §1.
  • [22] B. Singh, M. Najibi, and L. S. Davis (2018) Sniper: Efficient multi-scale training. In NIPS, Cited by: §2, Table 4.
  • [23] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In CVPR, Cited by: §3.1.
  • [24] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Li (2018) Single-shot refinement neural network for object detection. In CVPR, Cited by: §1, §2, §4.1, Table 4.
  • [25] Q. Zhao, T. Sheng, Y. Wang, and Z. Tang (2019) M2Det: a single-shot object detector based on multi-level feature pyramid network. In AAAI, Cited by: §2, Table 4.
  • [26] C. Zhu, Y. He, and M. Savvides (2019) Feature selective anchor-free module for single-shot object detection. In CVPR, Cited by: Table 4.