Object detection is a location sensitive task. A location sensitive task requires location-sensitive features, which means the input feature should vary with its reference box. Thus aligning the features with its corresponding bounding box is at the core of object detection. From the early sliding window methods to the state-of-the-art generalized R-CNNs Ren et al. (2015); He et al. (2017), detectors make efforts to ensure the feature alignment. The alignment of feature and box can be done in the image-level like R-CNN Girshick et al. (2014), in the feature-level like Fast R-CNN Girshick (2015), and even in the result-level like R-FCN Dai et al. (2016). The alignment can be achieved by crop and resize, or various kinds of region feature extractors such as RoIPooling Girshick (2015) and RoIAlign He et al. (2017).
The introduction of the anchor box Ren et al. (2015) changes the game. Anchor boxes are a set of virtual rectangles of different scales and aspect ratios serving as references for classification and regression. Anchor shapes are hand-crafted, chosen by clustering ground-truth bounding boxes or learned from previous stages in the network Zhang et al. (2018); Wang et al. (2019). Each spatial location of a feature map is associated with multiple anchors of different shapes. Sharing feature for multiple anchor boxes violate the one-to-one correspondence between reference boxes and features, which breaks the strict location sensitivity of object detection.
One-stage detectors Liu et al. (2016); Lin et al. (2017b); Redmon and Farhadi (2018) suffer most from the misalignment due to the lack of explicit alignment operations like RoIPooling in two-stage detectors. The multi-scale features widely adopted in one-stage detectors partially resolve this misalignment by assigning anchors of different scales to proper feature levels. Nevertheless, the implicit alignment conducted by multi-scale features only relieves the misalignment incurred by scales. They can not help the aspect ratios induced misalignment. Besides, feature adaptation Zhang et al. (2018); Wang et al. (2019) can also be seen as a simple form of feature alignment.
In this work, by discovering the deep connection between
im2col Chellapilla et al. (2006) and RoIAlign, we propose a novel RoI Convolution (RoIConv) which performs accurate feature alignment in one-stage detectors for the first time. Our RoIConv shares the same computation complexity the vanilla convolution and can be seamlessly integrated into any existing one-stage detectors in a plug-and-play manner. Based on that, we also propose a Fully Convolutional AlignDet detector which perfectly combines the flexibility of learned anchors and the preciseness of our RoIConv. Our method enjoys the explicit alignment of two-stage detectors while remains the fully convolutional nature and computation cost of one-stage detectors, getting the bests of both worlds.
To summarize, our contributions are as following:
We propose a plug-and-play RoI Convolution (RoIConv) operation, which performs exact feature alignment densely for one-stage detectors for the first time.
We propose a Fully Convolutional AlignDet model which fully utilizes the benefits of learned anchors with our RoIConv.
Our methods achieve state-of-the-art 44.1 mAP with the ResNeXt-101 FPN backbone, improving over the strong RetinaNet baseline by 3.3 mAP while keeping the same speed.
2 Related Works
Redmon et al. propose YOLO Redmon et al. (2016), which is first work to use an end-to-end convolution network for detection directly on the dense feature map. SSD Liu et al. (2016) introduces anchors and multi-scale feature maps into one-stage detection. RetinaNet Lin et al. (2017b) proposes focal loss to address the overwhelming easy background samples introduced by dense multi-scale feature maps. Keypoint-based detection methods Law and Deng (2018); Duan et al. (2019); Zhou et al. (2019a, b) have shown promising results by detecting and grouping the corners for objects. Besides, recently emerging anchor-free methods Zhu et al. (2019); Kong et al. (2019); Tian et al. (2019); Huang et al. (2015); Yang et al. (2018) explore the potential of detecting objects without virtual anchor boxes.
Feature Alignment in Object Detection
SPP-net He et al. (2015b) is the first work to extract fixed-length features from candidate windows in convolution networks. RoIPooling Girshick (2015) improves over SPP-net by enabling end-to-end training. Both SPP-net and RoIPooling round the sub-region to the nearest integer boundary which incurs quantization errors. To address the quantization error of RoIPooling, RoIAlign He et al. (2017)
uses bilinear interpolation to compute the exact values at sampled locations in each RoI bins, showing significant gains for localization. Deformable RoIPoolingDai et al. (2017) adds offsets to each sub-region for RoIPool, bringing adaptiveness for the region feature. Guided Anchor Wang et al. (2019) tries to adapt features for learned anchors with anchor-guided deformable convolutions.
Cascaded Object Detection System
RefineDet Zhang et al. (2018) introduces an anchor refinement module to adjust center points and sizes of anchors, providing better reference boxes for further regression. Cai et al. propose the Cascade R-CNN Cai and Vasconcelos (2018), which improves the quality of proposals in cascaded stages with increasing IoU thresholds. Region features are re-extracted for the refined proposals in each stage.
3 Pilot Experiement
3.1 Multi-scale Features and One-stage Object Detection
Along with the development of detectors, multi-scale feature maps Liu et al. (2016); Lin et al. (2017a) play a central role for the handling scale variations of objects. We argue that multi-scale feature maps are especially essential for one stage detectors since they lack the ability to align features and corresponding bounding boxes. To demonstrate the importance of multi-scale features, let us consider two settings for one-stage detectors. For a RetinaNet Lin et al. (2017b) detector with a
FPN backbone, the strides for different pyramid levels are. When equipped with an anchor box of a scale factor of 4, this detector yields a set of anchor boxes of size across five scales of features. To demonstrate the effectiveness of multi-scale features, we can construct a detector with only the feature, which has a single stride of 16. Given a set of anchors with scale factors of , the second detector yields the same set of anchor boxes, but on a single scale of feature. However, the former detector gives an mAP of 32.4 on the COCO minival set, but the later one only gives an mAP of 20.4. We also test the same setting for a standard two-stage detector Faster R-CNN. It only shows a minor drop (mAP 33.9 down to 31.6) when using only one scale of features. The drastic performance drop for the one-stage detector is unusual, considering two-stage detectors that adopt a single-scale feature map can still achieve comparable results.
|Faster R-CNN Lin et al. (2017a)||ResNet-50 FPN||33.9||56.9||17.8||37.7||45.8||-|
|Faster R-CNN Lin et al. (2017a)||ResNet-50||31.6||53.1||13.2||35.6||47.1|
A prominent difference between one-stage and two-stage detectors is that one-stage detectors lack RoI feature extractors like RoIPooling Girshick (2015) and RoIAlign He et al. (2017). RoI feature extractor provides aligned features for each RoI. But for one-stage detectors, all anchor boxes(one-stage counterpart of RoI in two-stage detectors) share the same features for the same spatial location. Multi-scale feature maps merely alleviate this misalignment issue by limiting the scale of each feature map. We hypothesize that the misalignment of features and anchor boxes leads to the catastrophic performance degradation.
4.1 Im2Col is a special RoIAlign
To fully reveal the essence of feature misalignment, we now take a deeper look at RoI feature extractors of two-stage detectors. Taking RoIAlign He et al. (2017) as an example, it first divides an RoI evenly into sub-regions and then take the center point111This is only true for sampling ratio = 1. Since the sampling ratio has little impact as indicated in Lin et al. (2017a), we stick to sampling ratio = 1 to simplify the discussion.
of each sub-region on the bi-linear interpolated feature map. The features extracted from each sub-region are then concatenated to give a featurefor the given RoI. As shown in Figure 1, the whole process strikes a remarkable resemblance of the
im2colChellapilla et al. (2006) operation, which is a core part in the implementation of convolution.
transforms the 3-D input feature tensorinto a 2-D tensor , where and are the height and width of the convolution kernel. Each column in represents a tile which the convolution kernel slides on. The only difference is
im2coloperates on a fixed set of spatial locations on the input feature map, whereas RoIAlign operates on locations defined by the RoI.
Im2colis essentially a special case of RoIAlign. Since convolution is the combination of
im2coland fully connected layer, one-stage detectors are implicitly performing RoIAlign on the backbone feature maps with fixed bounding box size.
After revealing the connection between RoIAlign and
im2col, it is natural to find out what is the RoI of a convolution.
For a convolution, its sampling locations for of the output feature map are given as:
and for a RoIAlign with a RoI of and a output of , its sampling locations on the feature map of stride are given as
where and for both operators. Solving gives
Equation 3 shows that a convolution on a feature map of stride is equivalent to a RoIAlign for each input location, followed by a fully connected layer with weight .
We now revisit the example introduced in the pilot experiment. For the feature map of the FPN, the total stride is 16, and thus a convolution on feature gives an RoI of . This RoI partially aligns with the anchor boxes of size and , but incurs heavy misalignment for the other three anchors. Stacking multiple convolutions in the detection head may increase the implicit RoI range for each spatial location, but the misalignment between a single RoI and multiple anchors persist. Detectors like SSD and RetinaNet utilize multi-scale feature maps to solves this problem. By assigning anchors to feature maps of the proper stride, the RoI of the convolution and the anchor matches. We take RetinaNet as an example. The implicit RoIs of convolutions for the FPN spans from to , covering the anchors from to . This operation partially addresses the misalignment to some degree, but it still cannot handle harder cases such as extreme aspect ratio, etc.
4.2 RoI Convolution
To address the challenges mentioned above, we devise an operator called RoIConv, which aligns features and corresponding anchor boxes in a principled way. By inspecting Equation 3 closely, we can find that the misalignment between the feature and the anchor box is indeed caused by the misalignment between the implicit RoI of the convolution and the actual bounding box. Inspired by deformable convolutionDai et al. (2017), we can adaptively sample locations of the convolution by introducing an offset map for each location. Deformable convolution learns a for each location of the output feature map. Each pair of offset describe the deviation from the regular convolution sampling points. Different from deformable convolution, our offsets are now calculated as the difference of the pre-defined anchor box and the implicit RoI instead of learned. For a RoIConv on a feature map of stride and its corresponding anchor box , the offsets for a specific location on the output feature map are given as
It is worth noting that our RoIConv requires no addition computation compared with the vanilla convolution, which helps its seamless integration into any existing one-stage detectors. The offsets are a linear combination of and , which can be obtained with a convolution and an element-wise addition. This help to keep one-stage detectors fully convolutional.
4.3 Fully Convolutional AlignDet
With the proposed RoI Convolution, we can explore more flexible anchor settings. Following previous learned anchor works Zhang et al. (2018); Wang et al. (2019), we propose Fully Convolutional AlignDet, which consists of a dense proposal module(DPM) and an aligned detection module(ADM). The dense proposal module could be any network that gives a dense bounding box prediction on the feature maps, including RPN Ren et al. (2015), SSD Liu et al. (2016), RetinaNet Lin et al. (2017b) and even recently proposed anchor-free detectors Zhu et al. (2019); Kong et al. (2019); Tian et al. (2019); Huang et al. (2015); Yang et al. (2018). The dense proposal module learns the bounding box distribution from data and thus liberates us from setting anchor manually. As shown in Figure 2, the aligned detection module consists of a RoIConv for aligning the backbone feature with the learned anchors and the subsequent detection head for predicting the final scores and bounding boxes. Feature alignment is especially important for that learned anchors are far more varying in scales, aspect ratios, and locations than hand-crafted ones.
Comparison with Other Feature Alignment Alternatives
There are also other dense detectors adopting the learned anchor paradigm which requires feature alignment. RefineDet Zhang et al. (2018) consists of an anchor refinement module which refines the initial anchor boxes and an object detection module which predicts the final class and bounding box of the refined anchors. RefineDet performs feature adaptation by a vanilla convolution, which serves as a baseline for feature alignment. Guided Anchor is an anchor-free dense detector which directly predicts the shape of the anchor box for each spatial location. Due to the varying shape of the learned anchor, Guided Anchor Wang et al. (2019) tries to adapt backbone features to fit the anchor shape by learning deformable offsets from the predicted anchor shapes. The feature adaptation in Guided Anchor is less precise compared with our RoIConv in two ways. First, it only considers the shape for anchors but ignores the location of anchor boxes. As indicated in Equation 4
, the optimal offsets comprise both the shape and the location of the anchor box. Second, the learned offsets are just heuristic approximation of the offsets calculated from Equation4. Our method can mathematically guarantee the strict alignment between features and their corresponding anchors, while theirs cannot.
5.1 Implementation Details
We train the models on the COCO Lin et al. (2014) trainval35k split and report results in mAP on the COCO minival split. We use RetinaNet Lin et al. (2017b) with a single anchor of scale and aspect ratio as our DPM, followed by the ADM. DPM only does bounding regression during the test phase. We set loss weight of both DPM and ADM to . The focal losses in both DPM and ADM are set to and . Specifically, we use ResNet-50 FPN Lin et al. (2017a) and ResNet-101 FPN as our backbone. We use feature pyramids from to
. The backbone is initialized from ImageNetRussakovsky et al. (2015) 1k pre-training, and the newly added FPN layers are initialized with He initialization He et al. (2015a). The newly added head layers are initialized with Gaussian initializer with . We freeze the backbone up to as well as all BN parameters. Input images are resized to a short side of 800 and a long side not exceeding 1333 and horizontal flip is adopted during training. We train all models in SGD for 90k iterations with a starting learning rate of 0.01 and divide the learning rate by 10 in 60k and 80k iterations using a total batch size of 16 over 8 GPUs. We adopt learning rate warmup for 500 iterations. The weight decay is set to 0.0001. NMS with IoU threshold 0.5 is adopted for post-processing.
5.2 RoI Convolution for Single-scale One-stage Detection
We first demonstrate the importance of feature alignment by comparing the performances of RetinaNets based on a single-scale feature map with and without RoIConv. We use the same settings as in the pilot experiments in Sec. 3.1. For RetinaNet + RoIConv, we add one RoIConv on with the pre-defined anchors as RoIs. From the results of Table 2, we can see that RoIConv is an effective and efficient way for feature alignment. With merely one extra convolution, the highly misaligned single-scale RetinaNet recovers an mAP of 5.0.
|RetinaNet + RoIConv||ResNet-50||25.4||42.2||8.6||28.6||40.9|
5.3 Fully Convolutional Aligned Detection
We now present the results of our Fully Convolutional AlignDet. We compare our methods with and without ADM. AlignDet w/o ADM is essentially a RetinaNet with a single anchor. We compare our methods with the original RetinaNet with 3 scales and 3 aspect ratio anchors. As shown in Table 3, despite that our ADM is simply a 1024c RoIConv followed by a 1024c convolution, AlignDet achieves 5.5/5.3 mAP improvement over the baseline without ADM for ResNet-50/101. The results prove the effectiveness of the feature alignment for one-stage object detection. AlignDet also achieves an improvement of 2.2/2.1 mAP over RetinaNet for ResNet-50/ResNet-101 backbones. Compared with original RetinaNet, AlignDet uses minimal anchors, which liberates the users from cumbersome hyper-parameter selection for anchors. This shows the learned anchors are on par with the expert-crafted anchors, which echos with recent trends in designing anchor-free one-stage detectors.
|AlignDet w/o ADM||ResNet-50 FPN||1||32.4||52.9||17.5||35.9||43.0|
|AlignDet w/o ADM||ResNet-101 FPN||1||34.5||55.7||18.1||38.4||45.6|
Variants for Feature Alignment
As discussed in Section 4.3, there are different methods for feature alignment. We conduct controlled experiments in this section to find out the most effective design choice. For all variants, we employ a convolution with an output channel of 256 regardless of the convolution types to ensure the same parameter number. For (b) and (c) the convolutions for the offset generation are initialized with a Gaussian of
to ensure a roughly zero offset from the beginning. For (c) we first derive the height and width for each anchor box and then apply Batch Normalization to address the large variance of learned anchor shapes. The normalized anchor shapes are then used to generate offsets. From the results of Table4, we can see that the proposed RoI Convolution surpasses all other variants for feature alignment. Surprisingly, (b) and (c) do not improve over the vanilla convolution. To better understand this, we decode the implicit RoIs from the offsets learned of deformable convolution and calculate the IoU between the implicit RoIs the learned anchors. As shown in Figure 4, deformable convolutions actually learn better alignment than vanilla convolutions but the overall alignment with learned anchors is still far from ideal. We hypothesize that it may due to the supervision from the classification loss tends to drive the focus of the convolution to the most discriminative part of object, which may hurt the alignment.
|(c)||Anchor Guided Deform Conv||35.2||54.8||37.9||18.5||38.7||47.3|
In a cascaded detection pipeline, different parts should be specialized for different purposes. In our Fully Convolutional AlignDet, the DPM is specialized for refining the initial anchor boxes, so we lower the ground truth matching criterion to increase the training samples for the DPM regressor. These thresholds only affect the label assignment during training and all refined anchors are used for prediction during testing. Due to the multi-threshold nature of mAP@0.5:0.95 of COCO, a correct detection box of 0.95 IoU has ten times larger weights of boxes of 0.5 IoU, so we raise the ground truth matching criterion for ADM to bias towards high IoU boxes.
|DPM fg / bg||ADM fg / bg||AP||AP||AP||AP||AP||AP||AP|
|0.5 / 0.4||0.5 / 0.4||34.2||55.8||36.3||7.5||18.5||37.4||45.7|
|0.5 / 0.4||0.6 / 0.6||35.6||55.9||38.3||9.7||19.5||38.5||47.2|
|0.5 / 0.4||0.7 / 0.7||35.5||55.1||38.6||10.4||19.1||38.6||47.3|
|0.4 / 0.3||0.6 / 0.6||36.0||56.6||38.5||10.2||19.5||39.2||48.2|
|0.4 / 0.3||0.7 / 0.7||36.2||56.1||39.2||11.3||19.2||39.7||48.5|
RoI Convolution Design
We now explore different design choices of RoIConv. As shown in Table 6, the performance improves steadily as the kernel size increases. A large kernel creates dense sampling points which minimize information loss during the process of alignment. High dimension output features also help to reduce information loss. Take a RoIConv as an example.
Im2col generates a 12544-D feature for each anchor box. Increasing the output channel of RoIConv allows us to preserve more information for the final prediction. A 1024-D output feature with a convolutions as head already outperforms 256-D output feature with four convolutions. Although large kernel size and high dimension output incur substantial computation cost, AlignDet still runs reasonable fast on modern hardware. As shown in the last column of Table 6, our variant gains a 1.0 mAP improvement over the original RetinaNet while being 15% faster with fewer anchors. To further accelerate our AlignDet, we propose the variant which reduces the kernel size of RoIConv for the feature to . The variant is even faster than the original RetinaNet while being 1.9 mAP better.
|Kernel Size||Out Channels||ADM head||AP||AP||AP||AP||AP||AP||Speed|
|256||4 conv 256c||36.2||56.1||39.2||19.2||39.7||48.5||47 ms|
|1024||1 conv 1024c||36.7||56.9||40.4||20.1||40.4||49.2||49 ms|
|1024||1 conv 1024c||37.2||57.1||40.7||20.6||40.3||49.9||63 ms|
|1024||1 conv 1024c||37.9||57.7||41.7||21.5||41.1||50.8||86 ms|
|1024||1 conv 1024c||37.6||57.3||41.6||20.7||41.0||50.4||56 ms|
5.5 Comparisons with Other Methods
In this section, we compare our method with other proposed one-stage detectors on the COCO test-dev
set. We use the same model and hyperparameters as in previous sections. Follow the convention in RetinaNet, we extend the training schedule toand adopt scale jittering from 640 to 800 during training. Compared with other methods, AlignDet achieves much higher AP@0.75, which demonstrates the benefits of aligned features for precise localization.
|YOLOv3 Redmon and Farhadi (2018)||DarkNet-53||608||51ms/M||33.0||57.9||34.4||18.3||35.4||41.9|
|SSD Liu et al. (2016); Fu et al. (2017)||ResNet-101||513||125ms/M||31.2||50.4||33.3||10.2||34.5||49.8|
|DSSD Fu et al. (2017)||ResNet-101||513||156ms/M||33.2||53.3||35.2||13.0||35.4||51.1|
|RefineDet Zhang et al. (2018)||ResNet-101||512||-||36.4||57.5||39.5||16.6||39.9||51.4|
|CornetNet Law and Deng (2018)||Hourglass-104||511||300ms/P||40.5||56.5||43.1||19.4||42.7||53.9|
|ExtremeNet Zhou et al. (2019b)||Hourglass-104||511||322ms/P||40.1||55.3||43.2||20.3||43.2||53.1|
|CenterNet Duan et al. (2019)||Hourglass-104||511||340ms/P||44.9||62.4||48.1||25.6||47.4||57.4|
|CenterNet Zhou et al. (2019a)||Hourglass-104||511||128ms/P||42.1||61.1||45.9||24.1||45.5||52.8|
|RetinaNet Lin et al. (2017b)||ResNet-101 FPN||800||104ms/P||39.1||59.1||42.3||21.8||42.7||50.2|
|FoveaBox Kong et al. (2019)||ResNet-101 FPN||800||-||40.6||60.1||43.5||23.3||45.2||54.5|
|FCOS Tian et al. (2019)||ResNet-101 FPN||800||-||41.0||60.7||44.1||24.0||44.1||51.0|
|FSAF Zhu et al. (2019)||ResNet-101 FPN||800||109ms/P||40.9||61.5||44.0||24.0||44.2||51.3|
|RPDet Yang et al. (2019)||ResNet-101 FPN||800||-||41.0||62.9||44.3||23.6||44.1||51.7|
|RetinaNet Lin et al. (2017b)||ResNeXt-101-328d FPN||800||177ms/P||40.8||61.1||44.1||24.1||44.2||51.2|
|FoveaBox Kong et al. (2019)||ResNeXt-101-328d FPN||800||-||42.1||61.9||45.2||24.9||46.8||55.6|
|FCOS Tian et al. (2019)||ResNeXt-101-328d FPN||800||-||42.1||62.1||45.2||25.6||44.9||52.0|
|FSAF Zhu et al. (2019)||ResNeXt-101-644d FPN||800||188ms/P||42.9||63.8||46.3||26.6||46.2||52.7|
indicates using flip test
indicates using soft NMS
In this work, we investigate the misalignment issue in one-stage detectors. We first discover the close connection between convolution and existing region feature extractors. Based on our findings, we propose a novel RoIConv operator which aligns features with its corresponding bounding box effectively and efficiently for one-stage detectors. Then based on RoIConv, we propose an AlignDet detector, which is fast and performant. Benchmarks on large scale dataset and detailed analyses verify the strength of our proposed method.
-  (2018) Cascade R-CNN: delving into high quality object detection. In CVPR, Cited by: §2.
Flexible, high performance convolutional neural networks for image classification. In Workshop on Frontiers in Handwriting Recognition, Cited by: §1, §4.1.
-  (2016) R-FCN: object detection via region-based fully convolutional networks. In NIPS, Cited by: §1.
-  (2017) Deformable convolutional networks. In ICCV, Cited by: §2, §4.2.
-  (2019) CenterNet: keypoint triplets for object detection. arXiv:1904.08189. Cited by: §2, Table 7.
-  (2017) DSSD: deconvolutional single shot detector. arXiv:1701.06659. Cited by: Table 7.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §1.
-  (2015) Fast R-CNN. In ICCV, Cited by: §1, §2, §3.1.
-  (2017) Mask R-CNN. In ICCV, Cited by: §1, §2, §3.1, §4.1.
-  (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In ICCV, Cited by: §5.1.
-  (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (9), pp. 1904–1916. Cited by: §2.
-  (2015) DenseBox: unifying landmark localization with end to end object detection. arXiv:1509.04874. Cited by: §2, §4.3.
-  (2019) FoveaBox: beyond anchor-based object detector. arXiv:1904.03797Z. Cited by: §2, §4.3, Table 7.
-  (2018) CornerNet: detecting objects as paired keypoints. In ECCV, Cited by: §2, Table 7.
-  (2017) Feature pyramid networks for object detection.. In CVPR, Cited by: §3.1, Table 1, §5.1, footnote 1.
-  (2017) Focal loss for dense object detection. In ICCV, Cited by: §1, §2, §3.1, §4.3, §5.1, Table 7.
-  (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §5.1.
-  (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1, §2, §3.1, §4.3, Table 7.
-  (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §2.
-  (2018) YOLOv3: an incremental improvement. arXiv:1804.02767. Cited by: §1, Table 7.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, §1, §4.3.
ImageNet large scale visual recognition challenge.
International Journal of Computer Vision115 (3), pp. 211–252. Cited by: §5.1.
-  (2019) FCOS: fully convolutional one-stage object detection. arXiv:1904.01355. Cited by: §2, §4.3, Table 7.
-  (2019) Region proposal by guided anchoring. In CVPR, Cited by: §1, §1, §2, §4.3, §4.3.
-  (2018) MetaAnchor: learning to detect objects with customized anchors. In NIPS, Cited by: §2, §4.3.
-  (2019) RepPoints: point set representation for object detection. arXiv:1904.11490. Cited by: Table 7.
-  (2018) Single-shot refinement neural network for object detection. In CVPR, Cited by: §1, §1, §2, §4.3, §4.3, Table 7.
-  (2019) Objects as points. arXiv:1904.07850. Cited by: §2, Table 7.
-  (2019) Bottom-up object detection by grouping extreme and center points. In CVPR, Cited by: §2, Table 7.
-  (2019) Feature selective anchor-free module for single-shot object detection. In CVPR, Cited by: §2, §4.3, Table 7.