Log In Sign Up

Rethinking Classification and Localization in R-CNN

by   Yue Wu, et al.

Modern R-CNN based detectors share the RoI feature extractor head for both classification and localization tasks, based upon the correlation between the two tasks. In contrast, we found that different head structures (i.e. fully connected head and convolution head) have opposite preferences towards these two tasks. Specifically, the fully connected head is more suitable for the classification task, while the convolution head is more suitable for the localization task. Therefore, we propose a double-head method to separate these two tasks into different heads (i.e. a fully connected head for classification and a convolution head for box regression). Without bells and whistles, our method gains +3.4 and +2.7 points mAP on MS COCO dataset from Feature Pyramid Network baselines with ResNet-50 and ResNet-101 backbones, respectively.


page 8

page 11

page 12

page 13


Light-Head R-CNN: In Defense of Two-Stage Object Detector

In this paper, we first investigate why typical two-stage methods are no...

Mutual Supervision for Dense Object Detection

The classification and regression head are both indispensable components...

Scale-Equalizing Pyramid Convolution for Object Detection

Feature pyramid has been an efficient method to extract features at diff...

Neural networks for classification of strokes in electrical impedance tomography on a 3D head model

We consider the problem of the detection of brain hemorrhages from three...

Revisiting the Sibling Head in Object Detector

The “shared head for classification and localization” (sibling head), fi...

FQDet: Fast-converging Query-based Detector

Recently, two-stage Deformable DETR introduced the query-based two-stage...

A regression framework to head-circumference delineation from US fetal images

Background and Objectives Measuring head-circumference (HC) length from ...

1 Introduction

A majority of two-stage object detectors [girshick15fastrcnn, girshick2014rcnn, ren2015faster, Dai_RFCN, Lin_FPN]

share the similar structure: learning two tasks (classification and bounding box regression) by sharing the feature extraction head on proposals, as these tasks are highly correlated. Two different head structures are widely used: convolution head (conv5) on single feature map (conv4) in Faster R-CNN

[ren2015faster] and fully connected head (2-fc) on multiple level feature maps in FPN [Lin_FPN]. However, there is a lack of understanding of the correlation between the classification and localization tasks on these two head structures.

Figure 1: Overview of the double-head detector. (a) the original FPN with a fully connected (2-fc) head, (b) modified FPN with a convolution head (used for toy experiment), and (c) our proposed double-head FPN, which splits classification and localization into two heads. The fully connected head is used for classification during inference, with the localization as an auxiliary task during training. The convolution head is used for bounding box regression during inference, with the classification as an auxiliary task during training.

Related to this problem, recent COCO Detection 18 Challenge winner (Megvii111 proposed to combine bounding box regression and segmentation in a convolution head, and leave classification alone in 2-fc head. This motivates us to rethink the classification and localization with respect to different head structures. Intuitively, spatial information is crucial for object classification to determine if a complete object (not just part of the object) is covered by the region proposal. The fully connected head fits well for this task, as it is spatial sensitive. In contrast, the regression task requires the object level context to determine the offset of the proposed bounding box in terms of center, width and height. The convolution is more suitable for this task due to its capability to extract object level context. Therefore, we believe that neither a single fully connected head nor a single convolution head is good enough to handle classification and localization simultaneously.

In this paper, we propose a double-head detector, which includes a fully connected head (FC-Head) for classification and a convolution head (Conv-Head) for box regression (see Figure 1-(c)), to leverage the advantage of both heads. Firstly, we found that our double-head design is better than using either head (FC-Head or Conv-Head) alone for both classification and localization tasks. It also outperforms two single head detectors (Figure 1-(a), (b)). This demonstrates that the fully connected head prefers the classification task while the convolution head prefers the localization task. Secondly, we found that our double-head model can be further improved by using the other task for supervision, i.e. adding localization supervision on FC-Head and classification supervision on Conv-Head.

Our double-head detector outperforms FPN baseline by a non-negligible margin. Experimental results on MS COCO dataset demonstrates that our approach gains 3.4 and 2.7 of mAP over FPN baselines with ResNet-50 and ResNet-101 backbones, respectively.

2 Related Work

One-stage object detectors: One-stage methods attract more attention recently, mostly due to the computational efficiency. OverFeat [sermanet2013overfeat] detects objects by sliding multi-scale windows on the shared convolutional feature maps. Recently, SSD [liu2016ssd, fu2017dssd] and YOLO [redmon2016you, redmon2017yolo9000] have been tuned for speed by predicting object classes and locations directly. RetinaNet [lin2018focal] alleviates the extreme foreground-background class imbalance problem by introducing the focal loss.

Two-stage object detectors: RCNN [girshick2014rich]

applies a deep neural network to extract features for proposals generated by selective search


and fed them into SVM classifiers. SPPNet

[he2014spatial] speeds up RCNN significantly by introducing a spatial pyramid pooling layer to reuse features computed over feature maps generated at different scales. Fast RCNN [girshick15fastrcnn] utilizes a differentiable RoI Pooling operation to fine-tune all layers end-to-end, and further improves the speed and performance over SPPNet. Later, faster RCNN [ren2015faster] introduces Region Proposal Network (RPN) into the network. R-FCN [Dai_RFCN]

employs the position sensitive RoI pooling to address the translation-variance problem in object detection. Feature Pyramid Network (FPN)

[Lin_FPN] builds a top-down architecture with lateral connections to utilize high-level semantic feature maps at all scales, which benefits the small object detection more as finner feature maps are utilized. Deformable ConvNet [dai2017deformable] proposes deformable convolution and deformable RoI pooling to augment the spatial sampling locations. Cascade RCNN [Cai_2018_CVPR] constructs a sequence of detectors trained with increasing intersection over union (IoU) threshold, which improves the object detection progressively. IoU-Net [jiang2018acquisition] introduces another standalong branch to predict the IoU between each detected bounding box and the matched ground-truth, which generates localization confidence to replace the classification confidence for non-maximum suppression (NMS).

Next, we further compare backbone networks and detection heads for two-state detectors.
Backbone Networks: Fast RCNN [girshick15fastrcnn] and Faster RCNN [ren2015faster] extract features in stage conv4, while FPN [Lin_FPN] utilizes features from multiple layers (conv2 to conv5). Deformable-v1 [dai2017deformable] applies deformable convolution at the last few convolution layers and Deformable-v2 [zhu2018deformable] adds more deformable convolution at all convolution layers in stages conv3, conv4, and conv5. Trident Network [li2019scale] generates scale-aware feature maps with multi-branch architecture.
Detection Heads: Light-Head RCNN [li2017light] utilizes thin feature maps and a cheap subnet in detection heads to reduce the computational cost in detection heads. Cascade RCNN [Cai_2018_CVPR] builds multiple detection heads in a cascade manner. Mask RCNN [he2017mask] introduces an extra head for object segmentation. IoU-Net [jiang2018acquisition] proposes an extra head to predict the IoU score of each proposal. Similar to IoU Net, Mask scoring RCNN [huang2019msrcnn] presents an extra head to predict the MaskIoU score of each generated segmentation mask. In contrast to the existing detection heads which share the same RoI feature extractor for both classification and bounding box regression, we propose to split these two tasks into different heads to leverage the power of both the fully connected head and convolution head.

3 Hypothesis on Detection Heads