A majority of two-stage object detectors [girshick15fastrcnn, girshick2014rcnn, ren2015faster, Dai_RFCN, Lin_FPN]
share the similar structure: learning two tasks (classification and bounding box regression) by sharing the feature extraction head on proposals, as these tasks are highly correlated. Two different head structures are widely used: convolution head (conv5) on single feature map (conv4) in Faster R-CNN[ren2015faster] and fully connected head (2-fc) on multiple level feature maps in FPN [Lin_FPN]. However, there is a lack of understanding of the correlation between the classification and localization tasks on these two head structures.
Related to this problem, recent COCO Detection 18 Challenge winner (Megvii111http://cocodataset.org/#detection-leaderboard) proposed to combine bounding box regression and segmentation in a convolution head, and leave classification alone in 2-fc head. This motivates us to rethink the classification and localization with respect to different head structures. Intuitively, spatial information is crucial for object classification to determine if a complete object (not just part of the object) is covered by the region proposal. The fully connected head fits well for this task, as it is spatial sensitive. In contrast, the regression task requires the object level context to determine the offset of the proposed bounding box in terms of center, width and height. The convolution is more suitable for this task due to its capability to extract object level context. Therefore, we believe that neither a single fully connected head nor a single convolution head is good enough to handle classification and localization simultaneously.
In this paper, we propose a double-head detector, which includes a fully connected head (FC-Head) for classification and a convolution head (Conv-Head) for box regression (see Figure 1-(c)), to leverage the advantage of both heads. Firstly, we found that our double-head design is better than using either head (FC-Head or Conv-Head) alone for both classification and localization tasks. It also outperforms two single head detectors (Figure 1-(a), (b)). This demonstrates that the fully connected head prefers the classification task while the convolution head prefers the localization task. Secondly, we found that our double-head model can be further improved by using the other task for supervision, i.e. adding localization supervision on FC-Head and classification supervision on Conv-Head.
Our double-head detector outperforms FPN baseline by a non-negligible margin. Experimental results on MS COCO dataset demonstrates that our approach gains 3.4 and 2.7 of mAP over FPN baselines with ResNet-50 and ResNet-101 backbones, respectively.
2 Related Work
One-stage object detectors: One-stage methods attract more attention recently, mostly due to the computational efficiency. OverFeat [sermanet2013overfeat] detects objects by sliding multi-scale windows on the shared convolutional feature maps. Recently, SSD [liu2016ssd, fu2017dssd] and YOLO [redmon2016you, redmon2017yolo9000] have been tuned for speed by predicting object classes and locations directly. RetinaNet [lin2018focal] alleviates the extreme foreground-background class imbalance problem by introducing the focal loss.
Two-stage object detectors: RCNN [girshick2014rich]
applies a deep neural network to extract features for proposals generated by selective search[uijlings2013selective]
and fed them into SVM classifiers. SPPNet[he2014spatial] speeds up RCNN significantly by introducing a spatial pyramid pooling layer to reuse features computed over feature maps generated at different scales. Fast RCNN [girshick15fastrcnn] utilizes a differentiable RoI Pooling operation to fine-tune all layers end-to-end, and further improves the speed and performance over SPPNet. Later, faster RCNN [ren2015faster] introduces Region Proposal Network (RPN) into the network. R-FCN [Dai_RFCN]
employs the position sensitive RoI pooling to address the translation-variance problem in object detection. Feature Pyramid Network (FPN)[Lin_FPN] builds a top-down architecture with lateral connections to utilize high-level semantic feature maps at all scales, which benefits the small object detection more as finner feature maps are utilized. Deformable ConvNet [dai2017deformable] proposes deformable convolution and deformable RoI pooling to augment the spatial sampling locations. Cascade RCNN [Cai_2018_CVPR] constructs a sequence of detectors trained with increasing intersection over union (IoU) threshold, which improves the object detection progressively. IoU-Net [jiang2018acquisition] introduces another standalong branch to predict the IoU between each detected bounding box and the matched ground-truth, which generates localization confidence to replace the classification confidence for non-maximum suppression (NMS).
Next, we further compare backbone networks and detection heads for two-state detectors.
Backbone Networks: Fast RCNN [girshick15fastrcnn] and Faster RCNN [ren2015faster] extract features in stage conv4, while FPN [Lin_FPN] utilizes features from multiple layers (conv2 to conv5). Deformable-v1 [dai2017deformable] applies deformable convolution at the last few convolution layers and Deformable-v2 [zhu2018deformable] adds more deformable convolution at all convolution layers in stages conv3, conv4, and conv5. Trident Network [li2019scale] generates scale-aware feature maps with multi-branch architecture.
Detection Heads: Light-Head RCNN [li2017light] utilizes thin feature maps and a cheap subnet in detection heads to reduce the computational cost in detection heads. Cascade RCNN [Cai_2018_CVPR] builds multiple detection heads in a cascade manner. Mask RCNN [he2017mask] introduces an extra head for object segmentation. IoU-Net [jiang2018acquisition] proposes an extra head to predict the IoU score of each proposal. Similar to IoU Net, Mask scoring RCNN [huang2019msrcnn] presents an extra head to predict the MaskIoU score of each generated segmentation mask. In contrast to the existing detection heads which share the same RoI feature extractor for both classification and bounding box regression, we propose to split these two tasks into different heads to leverage the power of both the fully connected head and convolution head.