Object detection requires an algorithm to predict a bounding box location and a category label for each instance of interest in an image. Prior to deep learning, the sliding-window approach was the main method [43, 35, 7]
, which exhaustively classifies every possible location, thus requiring feature extraction and classification evaluation to be very fast. With deep learning, detection has been largely shifted to the use of fully convolutional networks (FCNs) since the invention of Faster R-CNN. All current mainstream detectors such as Faster R-CNN , SSD  and YOLOv2, v3  rely on a set of pre-defined anchor boxes and it has long been believed that the use of anchor boxes is the key to modern detectors’ success. Despite their great success, it is important to note that anchor-based detectors suffer some drawbacks:
As shown in Faster R-CNN and RetinaNet , detection performance is sensitive to the sizes, aspect ratios and number of anchor boxes. For example, in RetinaNet, varying these hyper-parameters affects the performance up to in AP on the COCO benchmark . As a result, these hyper-parameters need to be carefully tuned in anchor-based detectors.
Even with careful design, because the scales and aspect ratios of anchor boxes are kept fixed, detectors encounter difficulties to deal with object candidates with large shape variations, particularly for small objects. The pre-defined anchor boxes also hamper the generalization ability of detectors, as they need to be re-designed on new detection tasks with different object sizes or aspect ratios.
In order to achieve a high recall rate, an anchor-based detector is required to densely place anchor boxes on the input image (e.g., more than 180K anchor boxes in feature pyramid networks (FPN)  for an image with its shorter side being 800). Most of these anchor boxes are labeled as negative samples during training. The excessive number of negative samples aggravates the imbalance between positive and negative samples in training.
Anchor boxes also involve complicated computation such as calculating the intersection-over-union (IoU) scores with ground-truth bounding boxes.
, depth estimation[24, 48], keypoint detection  and counting. As one of high-level vision tasks, object detection might be the only one deviating from the neat fully convolutional per-pixel prediction framework mainly due to the use of anchor boxes.
It is natural to ask a question: Can we solve object detection in the neat per-pixel prediction fashion, analogue to FCN for semantic segmentation, for example? Thus those fundamental vision tasks can be unified in (almost) one single framework. We show in this work that the answer is affirmative. Moreover, we demonstrate that, the much simpler FCN-based detector can surprisingly achieve even better performance than its anchor-based counterparts.
In the literature, some works attempted to leverage the FCNs-based framework for object detection such as DenseBox . Specifically, these FCN-based frameworks directly predict a 4D vector plus a class category at each spatial location on a level of feature maps. As shown in Fig. 1 (left), the 4D vector depicts the relative offsets from the four sides of a bounding box to the location. These frameworks are similar to the FCNs for semantic segmentation, except that each location is required to regress a 4D continuous vector.
However, to handle the bounding boxes with different sizes, DenseBox  crops and resizes training images to a fixed scale. Thus DenseBox has to perform detection on image pyramids, which is against FCN’s philosophy of computing all convolutions once.
In the sequel, we take a closer look at the issue and show that with FPN this ambiguity can be largely eliminated. As a result, our method can already obtain similar or even better detection accuracy with those traditional anchor based detectors. Furthermore, we observe that our method may produce a number of low-quality predicted bounding boxes at the locations that are far from the center of an target object. It is easy to see that the locations near the center of its target bounding box can make more reliable predictions. As a result, we introduce a novel “center-ness” score to depict the deviation of a location to the center, as defined in Eq. (3), which is used to down-weigh low-quality detected bounding boxes and thus helps to suppress these low-quality detections in NMS. The center-ness score is predicted by a branch (only one layer) in parallel with the bounding box regression branch, as shown in Fig. 2. The simple yet effective center-ness branch remarkably improves the detection performance with a negligible increase in computational time.
This new detection framework enjoys the following advantages.
Detection is now unified with many other FCN-solvable tasks such as semantic segmentation, making it easier to re-use ideas from those tasks. An example is shown in , where a structured knowledge distillation method was developed for dense prediction tasks. Thanks to the standard FCN framework of FCOS, the developed technique can be immediately applied to FCOS based object detection.
Detection becomes proposal free and anchor free, which significantly reduces the number of design parameters. The design parameters typically need heuristic tuning and many tricks are involved in order to achieve good performance. Therefore, our new detection framework makes the detector, particularly its training,considerably simpler.
By eliminating the anchor boxes, our new detector completely avoids the complicated computation related to anchor boxes such as the IOU computation and matching between the anchor boxes and ground-truth boxes during training, resulting in faster training and testing than its anchor-based counterpart.
Without bells and whistles, we achieve state-of-the-art results among one-stage detectors. Given its improved accuracy of the much simpler anchor-free detector, we encourage the community to rethink the necessity of anchor boxes in object detection, which are currently considered as the de facto standard for designing detection methods.
With considerably reduced design complexity, our proposed detector outperforms previous strong baseline detectors such as Faster R-CNN , RetinaNet , YOLOv3  and SSD . More importantly, due to its simple design, FCOS can be easily extended to solve other instance-level recognition tasks with minimal modification, as already evidenced by instance segmentation [47, 51, 20, 2], keypoint detection , text spotting , and tracking [44, 13]. We expect to see more instance recognition methods built upon FCOS.
2 Related Work
Here we review some work that is closest to ours.
Anchor-based detectors inherit the ideas from traditional sliding-window and proposal based detectors such as Fast R-CNN . In anchor-based detectors, the anchor boxes can be viewed as pre-defined sliding windows or proposals, which are classified as positive or negative patches, with an extra offsets regression to refine the prediction of bounding box locations. Therefore, the anchor boxes in these detectors may be viewed as training samples. Unlike previous detectors like Fast RCNN, which compute image features for each sliding window/proposal repeatedly, anchor boxes make use of the feature maps of CNNs and avoid repeated feature computation, speeding up detection process dramatically. The design of anchor boxes are popularized by Faster R-CNN in its RPNs , SSD  and YOLOv2 , and has become the convention in a modern detector.
However, as described above, anchor boxes result in excessively many hyper-parameters, which typically need to be carefully tuned in order to achieve good performance. Besides the above hyper-parameters describing anchor shapes, the anchor-based detectors also need other hyper-parameters to label each anchor box as a positive, ignored or negative sample. In previous works, they often employ intersection over union (IOU) between anchor boxes and ground-truth boxes to determine the label of an anchor box (e.g., a positive anchor if its IOU is in ). These hyper-parameters have shown a great impact on the final accuracy, and require heuristic tuning. Meanwhile, these hyper-parameters are specific to detection tasks, making detection tasks deviate from a neat fully convolutional network architectures used in other dense prediction tasks such as semantic segmentation.
The most popular anchor-free detector might be YOLOv1 . Instead of using anchor boxes, YOLOv1 predicts bounding boxes at points near the center of objects. Only the points near the center are used since they are considered to be able to produce higher-quality detection. However, since only points near the center are used to predict bounding boxes, YOLOv1 suffers from low recall as mentioned in YOLOv2 . As a result, YOLOv2  employs anchor boxes as well. Compared to YOLOv1, FCOS can take advantages of all points in a ground truth bounding box to predict the bounding boxes and the low-quality detected bounding boxes can be suppressed by the proposed “center-ness” branch. As a result, FCOS is able to provide comparable recall with anchor-based detectors as shown in our experiments.
CornerNet  is a recently proposed one-stage anchor-free detector, which detects a pair of corners of a bounding box and groups them to form the final detected bounding box. CornerNet requires much more complicated post-processing to group the pairs of corners belonging to the same instance. An extra distance metric is learned for the purpose of grouping.
Another family of anchor-free detectors such as  are based on DenseBox . The family of detectors have been considered unsuitable for generic object detection due to difficulty in handling overlapping bounding boxes and the recall being relatively low. In this work, we show that both problems can be largely alleviated with multi-level FPN prediction. Moreover, we also show together with our proposed center-ness branch, the much simpler detector can achieve much better detection performance than its anchor-based counterparts. Recently, FSAF 
was proposed to employ an anchor-free detection branch as a complement to an anchor-based detection branch since they consider that a totally anchor-free detector cannot achieve good performance. They also make use of a feature selection module to improve the performance of the anchor-free branch, making the anchor-free detector have a comparable performance to its anchor-based counterpart. However, in this work, we surprisingly show that the totally anchor-free detector can actually obtain better performance than its anchor-based counterpart, without the need for the feature selection module in FSAF. Even more surprisingly, it can outperform the combination of anchor-free and anchor-based detectors in FSAF. As a result, the long-standing anchor-boxes can be completely eliminated, making detection significantly simpler.
3 Our Approach
In this section, we first reformulate object detection in a per-pixel prediction fashion. Next, we show that how we make use of multi-level prediction to improve the recall and resolve the ambiguity resulted from overlapped bounding boxes. Finally, we present our proposed “center-ness” branch, which helps suppress the low-quality detected bounding boxes and improves the overall performance by a large margin.
3.1 Fully Convolutional One-Stage Object Detector
Let be the feature maps at layer of a backbone CNN and
be the total stride until the layer. The ground-truth bounding boxes for an input image are defined as, where . Here and denote the coordinates of the left-top and right-bottom corners of the bounding box. is the class that the object in the bounding box belongs to. is the number of classes, which is for the MS-COCO dataset.
For each location on the feature map , we can map it back onto the input image as , which is near the center of the receptive field of the location . Different from anchor-based detectors, which consider the location on the input image as the center of (multiple) anchor boxes and regress the target bounding box with these anchor boxes as references, we directly regress the target bounding box at the location. In other words, our detector directly views locations as training samples instead of anchor boxes in anchor-based detectors, which is the same as FCNs for semantic segmentation .
Specifically, location is considered as a positive sample if it falls into the center area of any ground-truth box, by following . The center area of a box centered at is defined as the sub-box , where is the total stride until the current feature maps and is a hyper-parameter being on COCO. The sub-box is clipped so that it is not beyond the original box. Note that this is different from our original conference version , where we consider the locations positive as long as they are in a ground-truth box. The class label of the location is the class label of the ground-truth box. Otherwise it is a negative sample and (background class). Besides the label for classification, we also have a 4D real vector being the regression targets for the location. Here , , and are the distances from the location to the four sides of the bounding box, as shown in Fig. 1 (left). If a location falls into the center area of multiple bounding boxes, it is considered as an ambiguous sample. We simply choose the bounding box with minimal area as its regression target. In the next section, we will show that with multi-level prediction, the number of ambiguous samples can be reduced significantly and thus they hardly affect the detection performance. Formally, if location is associated to a bounding box , the training regression targets for the location can be formulated as,
where is the total stride until the feature maps , which is used to scale down regression targets and prevents the gradients from exploding during training. Together with these designs, FCOS can detect objects in an anchor-free way and everything is learned by the networks without the need for any pre-defined anchor-boxes. It is worth noting that this is not identical to an anchor-based detector with one anchor-box per location, the crucial difference is the way we define positive and negative samples. The single-anchor detector still uses pre-defined anchor-boxes as a prior and uses IoUs between the anchor-boxes and ground-truth boxes to determine the labels for these anchor-boxes. In FCOS, we remove the need for the prior and the locations are labeled by their inclusion in ground-truth boxes. In experiments, we will show that using a single anchor can only achieve inferior performance.
Corresponding to the training targets, the final layer of our networks predicts an 80D vector for classification and a 4D vector encoding bounding-box coordinates. Following , instead of training a multi-class classifier, we train binary classifiers. Similar to , we add two branches, respectively with four convolutional layers (exclude the final prediction layers) after the feature maps produced by FPNs for classification and regression tasks, respectively. Moreover, since the regression targets are always positive, we employ to map any real number to on the top of the regression branch. It is worth noting that FCOS has fewer network output variables than the popular anchor-based detectors [22, 32] with 9 anchor boxes per location, which is of great importance when FCOS is applied to keypoint detection  or instance segmentation .
We define our training loss function as follows:
where is focal loss as in  and is the GIoU loss . As shown in experiments, the GIoU loss has better performance than the IoU loss in UnitBox , which is used in our preliminary version . denotes the number of positive samples and being in this paper is the balance weight for . The summation is calculated over all locations on the feature maps . is the indicator function, being if and otherwise.
The inference of FCOS is straightforward. Given an input images, we forward it through the network and obtain the classification scores and the regression prediction for each location on the feature maps . Following , we choose the location with as positive samples and invert Eq. (1) to obtain the predicted bounding boxes.
3.2 Multi-level Prediction with FPN for FCOS
Here we show that how two possible issues of the proposed FCOS can be resolved with multi-level prediction with FPN .
First, the large stride (e.g., 16) of the final feature maps in a CNN can result in a relatively low best possible recall (BPR)111Upper bound of the recall rate that a detector can achieve.. For anchor based detectors, low recall rates due to the large stride can be compensated to some extent by lowering the IOU score requirements for positive anchor boxes. For FCOS, at the first glance one may think that the BPR can be much lower than anchor-based detectors because it is impossible to recall an object which no location on the final feature maps encodes due to a large stride. Here, we empirically show that even with a large stride, FCOS is still able to produce a good BPR, and it can even better than the BPR of the anchor-based detector RetinaNet  in the official implementation Detectron  (refer to Table I). Therefore, the BPR is actually not a problem of FCOS. Moreover, with multi-level FPN prediction , the BPR can be improved further to match the best BPR the anchor-based RetinaNet can achieve.
Second, as shown in Fig. 1 (right), overlaps in ground-truth boxes can cause intractable ambiguity, i.e., which bounding box should a location in the overlap regress? This ambiguity results in degraded performance. In this work, we show that the ambiguity can be greatly resolved with multi-level prediction, and FCOS can obtain on par, sometimes even better, performance compared with anchor-based ones.
Specifically, following FPN , we detect different size objects on different feature map levels. we make use of five levels of feature maps defined as . As shown in Fig. 2, , and are produced by the backbone CNNs’ feature maps , and with the top-down connections as in . and are produced by applying one convolutional layer with the stride being 2 on and , respectively. Note that this is different from the original RetinaNet, which obtain and from the backbone feature maps . We find both schemes achieve similar performance but the one we use has fewer parameters. Moreover, the feature levels , , , and have strides 8, 16, 32, 64 and 128, respectively.
Anchor-based detectors assign different scale anchor boxes to different feature levels. Since anchor boxes and ground-boxes are associated by their IoU scores, this enables different FPN feature levels to handle different scale objects. However, this couples the sizes of anchor boxes and the target object sizes of each FPN level, which is problematic. The anchor box sizes should be data-specific, which might be changed from one dataset to another. The target object sizes of each FPN level should depend on the receptive field of the FPN level, which depends on the network architecture. FCOS removes the coupling as we only need focus on the target object sizes of each FPN level and need not design the anchor box sizes. Unlike anchor-based detectors, in FCOS, we directly limit the range of bounding box regression for each level. More specifically, we first compute the regression targets , , and for each location on all feature levels. Next, if a location at feature level satisfies , or , it is set as a negative sample and thus not required to regress a bounding box anymore. Here is the maximum distance that feature level needs to regress. In this work, , , , , and are set as , , , , and , respectively. We argue that bounding the maximum distance is a better way to determine the range of target objects for each feature level because this makes sure that the complete objects are always in the receptive field of each feature level. Moreover, since objects of different sizes are assigned to different feature levels and overlapping mostly happens between objects with considerably different sizes, the aforementioned ambiguity can be largely alleviated. If a location, even with multi-level prediction used, is still assigned to more than one ground-truth boxes, we simply choose the ground-truth box with minimal area as its target. As shown in our experiments, with the multi-level prediction, both anchor-free and anchor-based detectors can achieve the same level performance.
Finally, following [21, 22], we share the heads between different feature levels, not only making the detector parameter-efficient but also improving the detection performance. However, we observe that different feature levels are required to regress different size range (e.g., the size range is for and for ), and therefore it may not be the optimal design to make use of identical heads for different feature levels. In our preliminary version , this issue is addressed by multiplying a learnable scalar to the convolutional layer’s outputs. In this version, since the regression targets are scaled down by the stride of FPN feature levels, as shown in Eq. (1), the scalars become less important. However, we still keep them for compatibility.
3.3 Center-ness for FCOS
After using multi-level prediction, FCOS can already achieve better performance than its anchor-based counterpart RetinaNet. Furthermore, we observed that there are a lot of low-quality detections produced by the locations far away from the center of an object.
We propose a simple yet effective strategy to suppress these low-quality detections. Specifically, we add a single-layer branch, in parallel with the regression branch (as shown in Fig. 2) to predict the “center-ness” of a location222This is different from our conference version which positions the center-ness on the classification branch, but it has been shown that positioning it on the regression branch can obtain better performance.. The center-ness depicts the normalized distance from the location to the center of the object that the location is responsible for, as shown Fig. 4. Given the regression targets , , and for a location, the center-ness target is defined as,
We employ here to slow down the decay of the center-ness. The center-ness ranges from to and is thus trained with binary cross entropy (BCE) loss. The loss is added to the loss function Eq. (3.1). When testing, the final score (used for ranking the detections in NMS) is the square root of the product of the predicted center-ness and the corresponding classification score . Formally,
where is used to calibrate the order of magnitude of the final score and has no effect on average precision (AP).
Consequently, center-ness can down-weight the scores of bounding boxes far from the center of an object. As a result, with high probability, these low-quality bounding boxes might be filtered out by the final non-maximum suppression (NMS) process, improving the detection performanceremarkably.
Our experiments are conducted on the large-scale detection benchmark COCO . Following the common practice [22, 21, 32], we use the COCO split (115K images) for training and split (5K images) as validation for our ablation study. We report our main results on the -dev split (20K images) by uploading our detection results to the evaluation server.
are used. Specifically, our network is trained with stochastic gradient descent (SGD) for 90k iterations with the initial learning rate being 0.01 and a mini-batch of 16 images. The learning rate is reduced by a factor of 10 at iteration 60k and 80k, respectively. Weight decay and momentum are set as 0.0001 and 0.9, respectively. We initialize our backbone networks with the weights pre-trained on ImageNet. For the newly added layers, we initialize them as in . Unless specified, the input images are resized to have their shorter side being 800 and their longer side less or equal to 1333.
We firstly forward the input image through the network and obtain the predicted bounding boxes with the predicted class scores. The next post-processing of FCOS exactly follows that of RetinaNet . The post-processing hyper-parameters are also the same except that we use NMS threshold instead of in RetinaNet. Experiments will be conducted to show the effect of the NMS threshold. Moreover, we use the same sizes of input images as in training.
4.1 Analysis of FCOS
4.1.1 Best Possible Recall (BPR) of FCOS
|Method||w/ FPN||Low-quality matches||BPR (%)|
|w/ ctr. sampling||w/ FPN|
|RetinaNet (#A=1) w/ imprv.||35.2||55.6||37.0||19.9||39.2||45.2||30.4||49.9||53.5||33.6||57.7||68.2|
|RetinaNet (#A=9) w/ imprv.||37.6||56.6||40.6||21.5||42.1||48.0||32.1||52.2||56.4||35.5||60.2||72.7|
|FCOS w/o ctr.-ness||38.0||57.2||40.9||21.5||42.4||49.1||32.1||52.4||56.2||36.6||60.6||71.9|
|FCOS w/ ctr.-ness||38.9||57.5||42.2||23.1||42.7||50.2||32.4||53.8||57.5||38.5||62.1||72.9|
We first address the concern that is FCOS might not provide a good best possible recall (BPR) (i.e., upper bound of the recall rate). In the section, we show that the concern is not necessary by comparing BPR of FCOS and that of its anchor-based counterpart RetinaNet on the COCO split. The following analyses are based on the FCOS implementation in . Formally, BPR is defined as the ratio of the number of ground-truth boxes that a detector can recall at the most to the number of all ground-truth boxes. A ground-truth box is considered recalled if the box is assigned to at least one training sample (i.e., a location in FCOS or an anchor box in anchor-based detectors), and a training sampling can be associated to at most one ground-truth box. As shown in Table I, both with FPN, FCOS and RetinaNet obtain similar BPR ( vs )333One might think that the BPR of RetinaNet should be if all the low-quality matches are used. However, this is not true in some cases as each anchor can only be associated to the ground-truth box with the highest IOU to it. For example, if two boxes A and B, both of which are small and contained in the common of all the anchor boxes at the same location. Clearly, for all these anchor boxes, the box with larger area has higher IOU scores and thus all the anchor boxes will be associated to it. Another one will be missing.. Due to the fact that the best recall of current detectors are much lower than , the small BPR gap (less than ) between FCOS and the anchor-based RetinaNet will not actually affect the performance of a detector. It is also confirmed in Table III, where FCOS achieves better or similar AR than RetinaNet under the same training and testing settings. Even more surprisingly, only with feature level with stride being 16 (i.e., no FPN), FCOS can obtain a decent BPR of . The BPR is much higher than the BPR of of the RetinaNet in the official implementation , where only the low-quality matches with IOU are used. Therefore, the concern about low BPR may not be necessary.
4.1.2 Ambiguous Samples in FCOS
Another concern about the FCN-based detector is that it may have a large number of ambiguous samples due to the overlap in ground-truth boxes, as shown in Fig. 1 (right). In Table II, we show the ratios of the ambiguous samples to all positive samples on split. If a location should be associated to multiple ground-truth boxes without using the rule of choosing the box with minimum area, the location is defined as an “ambiguous sample”. As shown in the table, there are indeed a large amount of ambiguous samples () if FPN is not used (i.e., only used). However, with FPN, the ratio can be significantly reduced to only since most of overlapped objects are assigned to different feature levels. Furthermore, if the center sampling is used, the ambiguous samples can be significantly reduced. As shown in Table II, even without FPN, th ratio is only . By further applying FPN, the ratio is reduced to . Note that it does not imply that there are locations where FCOS makes mistakes. As mentioned before, these locations are associated with the smallest one among the ground-truth boxes associated to the same location. Therefore, these locations only take the risk of missing some larger objects. In other words, it may harm the recall of FCOS. However, as shown in Table I, the recall gap between FCOS and RetinaNet is negligible, which suggests that the ratio of the missing objects is extremely low.
4.1.3 The Effect of Center-ness
|w/ ctr.-ness (L1)||38.9||57.6||42.0||23.0||42.3||51.0|
As mentioned before, we propose “center-ness” to suppress the low-quality detected bounding boxes produced by the locations far from the center of an object. As shown in Table IV, the center-ness branch can boost AP from to . Compared to our conference version , the gap is relatively smaller since we make use of the center sampling by default and it already eliminates a large number of false positives. However, the improvement is still impressive as the center-ness branch only adds negligible computational time. Moreover, we will show later that the center-ness can bring a large improvement in crowded scenarios. One may note that center-ness can also be computed with the predicted regression vector without introducing the extra center-ness branch. However, as shown in Table IV, the center-ness computed from the regression vector cannot improve the performance and thus the separate center-ness is necessary.
We visualize the effect of applying the center-ness in Fig. 4. As shown in the figure, after applying the center-ness scores to the classification scores, the boxes with low IoU scores but high confidence scores are largely eliminated (i.e., the points under the line in the Fig. 4), which are potential false positives.
4.1.4 Other Design Choices
Other design choices are also investigated. As shown Table V, removing group normalization (GN)  in both the classification and regression heads drops the performance by AP. By replacing GIoU  with the origin IoU loss in , the performance drops by AP. Using instead of also degrades the performance. Moreover, using can reduce the number of the network parameters. We also conduct experiments for the radius of positive sample regions. As shown in Table VI, has the best performance on COCO split.
We also conduct experiments with different strategies of assigning objects to FPN levels. First, we experiment with the assigning strategy when FPN  assigns the object proposals (i.e., ROIs) to FPN levels. It assigns the objects according to the formulation , where is the target FPN level, and are the ground-truth box’s width and height, respectively, and is the target level which an object with scale should be mapped into. We use . As shown in Table VII, this strategy results in degraded performance ( AP). We conjecture that it may be because the strategy cannot make sure the complete object be within the receptive field of the target FPN level.
Similarly, and also deteriorate the performance. Eventually, achieves the best performance as the strategy makes sure that the complete target objects are always in the effective receptive field of the FPN level. Moreover, this implies that the range hyper-parameters of each FPN level (i.e., ) is mainly related to the network architecture (which determines the receptive fields). This is a desirable feature since it eliminates the hyper-parameter tuning when FCOS is applied to different datasets.
4.2 FCOS vs. Anchor-based Counterparts
Here, we compare FCOS with its anchor-based counterpart RetinaNet on the challenging benchmark COCO, demonstrating that the much simpler anchor-free FCOS is superior.
In order to make a fair comparison, we add the universal improvements in FCOS to RetinaNet. The improved RetinaNet is denoted as “RetinaNet w/ imprv.” in Table III. As shown the table, even without the center-ness branch, FCOS achieves better AP than “RetinaNet (#A=9) w/ imprv.” ( vs in AP). The performance of FCOS can be further boosted to with the help of the proposed center-ness branch. Moreover, it is worth noting that FCOS achieves much better performance than the RetinaNet with a single anchor per location “RetinaNet (#A=1) w/ imprv.” ( vs ), which suggests that FCOS is not equivalent to the single-anchor RetinaNet. The major difference is FCOS does not employ IoU scores between anchor boxes and ground-truth boxes to determine the training labels.
Given the superior performance and merits of the anchor-free detector (e.g., much simpler and fewer hyper-parameters), we encourage the community to rethink the necessity of anchor boxes in object detection.
4.3 Comparison with State-of-the-art Detectors on COCO
|Faster R-CNN+++ ||ResNet-101||34.9||55.7||37.4||15.6||38.7||50.9|
|Faster R-CNN w/ FPN ||ResNet-101-FPN||36.2||59.1||39.0||18.2||39.0||48.2|
|Faster R-CNN by G-RMI ||Inception-ResNet-v2 ||34.7||55.5||36.7||13.5||38.1||52.0|
|Faster R-CNN w/ TDM ||Inception-ResNet-v2-TDM||36.8||57.7||39.2||16.2||39.8||52.1|
|YOLOv2 ||DarkNet-19 ||21.6||44.0||19.2||5.0||22.4||35.5|
|FCOS w/ deform. conv. v2 ||ResNeXt-32x8d-101-FPN||46.6||65.9||50.8||28.6||49.1||58.6|
|FCOS w/ deform. conv. v2||ResNeXt-32x8d-101-BiFPN||47.9||66.9||51.9||30.2||50.3||59.9|
|w/ test-time augmentation:|
|FCOS w/ deform. conv. v2||ResNeXt-32x8d-101-FPN||49.1||68.0||53.9||31.7||51.6||61.0|
|FCOS w/ deform. conv. v2||ResNeXt-32x8d-101-BiFPN||50.4||68.9||55.0||33.2||53.0||62.7|
We compare FCOS with other state-of-the-art object detectors on - split of MS-COCO benchmark. For these experiments, following previous works [22, 25], we make use of multi-scale training. To be specific, during training, the shorter side of the input image is sampled from with a step of . Moreover, we double the number of iterations to 180K (with the learning rate change points scaled proportionally). Other settings are exactly the same as the model with AP on in Table III.
As shown in Table VIII, with ResNet-101-FPN, FCOS outperforms the original RetinaNet with the same backbone by AP ( vs ). Compared to other one-stage detectors such as SSD  and DSSD , we also achieve much better performance. Moreover, FCOS also surpasses the classical two-stage anchor-based detector Faster R-CNN by a large margin ( vs. ). To our knowledge, it is the first time that an anchor-free detector, without any bells and whistles, outperforms anchor-based detectors by a large margin. Moreover, FCOS also outperforms the previous anchor-free detector CornerNet  and CenterNet  while being much simpler since they requires to group corners with embedding vectors, which needs special design for the detector. Thus, we argue that FCOS is more likely to serve as a strong and simple alternative to current mainstream anchor-based detectors. Quantitative results are shown in Fig. 6. It appears that FCOS works well with a variety of challenging cases.
We also introduce some complementary techniques to FCOS. First, deformable convolutions are used in stages and of the backbone, and also replace the last convolutional layers in the classification and regression towers (i.e., the convolutions shown in Fig. 2). As shown in Table VIII, by applying deformable convolutions [5, 55] to ResNeXt-32x8d-101-FPN based FCOS, the performance is improved from to AP, as shown in Table VIII. In addition, we also attempt to replace FPN in FCOS with BiFPN . We make use of BiFPN in D3 model in . To be specific, the single cell of BiFPN is repeated times and the number of its output channels is set to . Note that unlike the original BiFPN, we do not employ depthwise separable convolutions in it. As a result, BiFPN generally improves all FCOS models by AP and pushes the performance of the best model to .
We also report the result of using test-time data augmentation. Specifically, in inference, the input image is respectively resized to pixels with step . At each scale, the original image and its horizontal flip are evaluated. The results from these augmented images are merged by NMS. As shown in Table VIII, the test-time augmentation improves the best performance to AP.
4.4 Real-time FCOS
|YOLOv3 (Darknet-53) ||26||33.0|
|CenterNet (DLA-34) ||52||37.4|
|FCOS-RT w/ shtw. (DLA-34)||52||39.1||39.2|
We also design a real-time version FCOS-RT. In the real-time settings, we reduce the shorter side of input images from to and the maximum longer size from to , which decreases the inference time per image by . With the smaller input size, the higher feature levels and become less important. Thus, following BlendMask-RT , we remove and , further reducing the inference time. Moreover, in order to boost the performance of the real-time version, we employ a more aggressive training strategy. Specifically, during training, multi-scale data augmentation is used and the shorter size of input image is sampled from to with interval
. Synchronized batch normalization (SyncBN) is used. We also increase the training iterations to(i.e., ). The learning rate is decreased by a factor of at iteration and .
The resulting real-time models are shown in Table IX. With ResNet-50, FCOS-RT can achieve AP at FPS on a single 1080Ti GPU card. We further replace ResNet-50 with the backbone DLA-34 , which results in a better speed/accuracy trade-off ( AP at FPS). In order to compare with CenterNet, we share the towers (i.e., conv. layers shown in Fig. 2) between the classification and regression branches, which improves the speed from FPS to FPS but deteriorate the performance by AP. However, as shown in Table IX, the model still outperforms CenterNet  by AP at the same speed. For the real-time models, we also replace FPN with BiFPN as in Section 4.3, resulting AP improvement (from to ) at similar speed. A speed/accuracy comparison between FCOS and a few recent detection methods is shown in Fig. 3.
4.5 FCOS on CrowdHuman
|RetinaNet w/ imprv.||81.60||57.36||72.88|
|FCOS w/o ctr.-ness||83.16||59.04||73.09|
|FCOS w/ ctr.-ness||85.0||51.34||74.97|
|Set NMS ||87.28||51.21||77.34|
We also conduct experiments on the highly crowded dataset CrowdHuman . CrowdHuman consists of images for training, for validation and images for testing. Following previous works on crowded benchmark [4, 34], we use AP, long-average Miss Rate on False Positive Per Image in (MR)  and Jaccard Index
(JI) as the evaluation metrics. Note that lower MRis better. Following , all experiments here are trained on the split for epochs with batch size and then evaluated on the split. Two some changes are made when FCOS is applied to the benchmark. First, the NMS threshold is set as instead of . We find that it has large impact on MR and JI. Second, when a location is supposed to be associated to multiple ground-truth boxes, on COCO, we choose the object with minimal area as the target for the location. On CrowdHuman, we instead choose the target with minimal distance to the location. The distance between a location and an object is defined as the distance from the location to the center of the object. On COCO, both schemes result in similar performance. However, the latter has much better performance than the former on the highly crowded dataset. Other settings are the same as that of COCO.
First, we count the ambiguous sample ratios on CrowdHuman set. With FPN-based FCOS, there are unambiguous positive samples (with one ground-truth box), with two ground-truth boxes, with three ground-truth boxes and the rest () with more than three ground-truth boxes. Given the much higher ambiguous sample ratio than COCO, it is expected that FCOS will have inferior performance on the highly crowded dataset.
We compare FCOS without center-ness with the the improved RetinaNet (i.e., “RetinaNet w/ imprv.”). To our surprise, even without center-ness, FCOS can already achieve decent performance. As shown in Table X, FCOS compares favorably with its anchor-based counterpart RetinaNet on two out of three metrics (AP and JI), which suggests that anchor-based detectors have no large advantages even under the highly crowded scenario. The higher MR of FCOS denotes that FCOS might have a large number of false positives with high confidence. By using the center-ness, MR can be significantly reduced from to . As a result, FCOS can achieve better results under all the three metrics.
Furthermore, as shown in , it is more reasonable to let one proposal make multiple predictions under the highly crowded scenario (i.e., multiple instance prediction (MIP)). After that, these predictions are merged by Set NMS , which skips the suppression for the boxes from the same location. A similar idea can be easily incorporated into FCOS. To be specific, if a location should be associated to multiple objects, instead of choosing a single target (i.e., the closest one to the location), the location’s targets are set as the -closest objects. Accordingly, the network is required to make predictions per location. Moreover, we do not make use of the earth mover’s distance (EMD) loss for simplicity. Finally, the results are merged by Set NMS . As shown in Table X, with MIP and Set NMS, improved performance is achieved under all the three metrics.
In this work, we have proposed an anchor-free and proposal-free one-stage detector FCOS. Our experiments demonstrate that FCOS compares favourably against the widely-used anchor-based one-stage detectors, including RetinaNet, YOLO and SSD, but with much less design complexity. FCOS completely avoids all computation and hyper-parameters related to anchor boxes and solves the object detection in a per-pixel prediction fashion, similar to other dense prediction tasks such as semantic segmentation. FCOS also achieves state-of-the-art performance among one-stage detectors. We also present some real-time models of our detector, which have state-of-the-art performance and inference speed. Given its effectiveness and efficiency, we hope that FCOS can serve as a strong and simple alternative of current mainstream anchor-based detectors.
This work was in part supported by ARC DP grant #DP200103797. We would like to thank the author of  for the tricks of center sampling and GIoU. We also thank Chaorui Deng for his suggestion of positioning the center-ness branch with box regression.
-  (2019) Note: https://github.com/yqyao/FCOS_PLUS Cited by: §3.1, TABLE III, Acknowledgments.
-  (2020) BlendMask: top-down meets bottom-up for instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 5th item, §4.4.
Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In Proc. IEEE Int. Conf. Comp. Vis., . External Links: Cited by: §1.
-  (2020) Detection in crowded scenes: one proposal, multiple predictions. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §4.5, §4.5, TABLE X.
-  (2017) Deformable convolutional networks. In Proc. IEEE Int. Conf. Comp. Vis., pp. 764–773. Cited by: §4.3.
-  (2009) ImageNet: a large-scale hierarchical image database. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 248–255. Cited by: §4.
-  (2014) Fast feature pyramids for object detection. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: §1.
-  (2011) Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34 (4), pp. 743–761. Cited by: §4.5.
-  (2019) CenterNet: keypoint triplets for object detection. In Proc. IEEE Int. Conf. Comp. Vis., pp. 6569–6578. Cited by: §4.3, TABLE VIII.
-  (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §4.3, TABLE VIII.
-  (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §3.2, §4.1.1, TABLE I.
-  (2015) Fast R-CNN. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 1440–1448. Cited by: §2.
-  (2020) SiamCAR: siamese fully convolutional classification and regression for visual tracking. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 5th item.
-  (2016) Deep residual learning for image recognition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 770–778. Cited by: §4, TABLE VIII.
-  (2019-06) Knowledge adaptation for efficient semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §1.
-  (2018) An end-to-end textspotter with explicit alignment and attention. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5020–5029. Cited by: §1.
-  (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7310–7311. Cited by: TABLE VIII.
-  (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §1, §1, §1, §2.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proc. Eur. Conf. Comp. Vis., pp. 734–750. Cited by: §2, §4.3, TABLE VIII.
-  (2020) CenterMask: real-time anchor-free instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 5th item.
-  (2017) Feature pyramid networks for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2117–2125. Cited by: 3rd item, §3.2, §3.2, §3.2, §3.2, §4.1.4, TABLE VIII, §4.
-  (2017) Focal loss for dense object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2980–2988. Cited by: 1st item, 5th item, Fig. 3, §3.1, §3.1, §3.1, §3.2, §3.2, §4, §4, §4.3, TABLE VIII, §4.
-  (2014) Microsoft COCO: common objects in context. In Proc. Eur. Conf. Comp. Vis., pp. 740–755. Cited by: 1st item, §4.
-  (2016) Learning depth from single monocular images using deep convolutional neural fields. IEEE Trans. Pattern Anal. Mach. Intell. (), pp. . Note: Cited by: §1.
-  (2016) SSD: single shot multibox detector. In Proc. Eur. Conf. Comp. Vis., pp. 21–37. Cited by: 5th item, §1, §2, §4.3, §4.3, TABLE VIII.
-  (2020) Structured knowledge distillation for dense prediction. IEEE Trans. Pattern Anal. Mach. Intell.. Cited by: 1st item.
-  (2020) ABCNet: real-time scene text spotting with adaptive Bezier-curve network. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 5th item.
-  (2015) Fully convolutional networks for semantic segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3431–3440. Cited by: §1, §3.1.
-  (2016) You only look once: unified, real-time object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 779–788. Cited by: §2.
-  (2017) YOLO9000: better, faster, stronger. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 7263–7271. Cited by: §2, §2, TABLE VIII.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: 5th item, §1, Fig. 3, TABLE VIII, TABLE IX.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Proc. Adv. Neural Inf. Process. Syst., pp. 91–99. Cited by: 5th item, §1, §2, §3.1, §4.
-  (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 658–666. Cited by: §3.1, §4.1.4.
-  (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. Cited by: §4.5.
-  (2013) Training effective node classifiers for cascade classification. Int. J. Comp. Vis. 103 (3), pp. 326–347. Cited by: §1.
-  (2017) Beyond skip connections: top-down modulation for object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: TABLE VIII.
Inception-v4, inception-resnet and the impact of residual connections on learning. In Proc. AAAI Conf. Artificial Intell., Cited by: TABLE VIII.
-  (2020) EfficientDet: scalable and efficient object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §4.3, TABLE VIII.
-  (2019) DirectPose: direct end-to-end multi-person pose estimation. arXiv preprint arXiv:1911.07451. Cited by: 5th item, §3.1.
-  (2019) Decoders matter for semantic segmentation: data-dependent decoding enables flexible feature aggregation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 3126–3135. Cited by: §1.
-  (2019) FCOS: fully convolutional one-stage object detection. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §3.1, §3.1, §3.2, §4.1.3.
-  (2020) Conditional convolutions for instance segmentation. arXiv preprint arXiv:2003.05664. Cited by: §3.1.
-  (2001) Robust real-time object detection. Cited by: §1.
-  (2020) Tracking by instance detection: a meta-learning approach. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 5th item.
-  (2018) Group normalization. In Proc. Eur. Conf. Comp. Vis., pp. 3–19. Cited by: §4.1.4, TABLE III.
-  (2019) Detectron2. Note: https://github.com/facebookresearch/detectron2 Cited by: TABLE III.
-  (2020) PolarMask: single shot instance segmentation with polar representation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Note: arxiv.org/abs/1909.13226 Cited by: 5th item.
-  (2019) Enforcing geometric constraints of virtual normal for depth prediction. In Proc. IEEE Int. Conf. Comp. Vis., Cited by: §1.
-  (2018) Deep layer aggregation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 2403–2412. Cited by: §4.4.
-  (2016) Unitbox: an advanced object detection network. In Proc. ACM Int. Conf. Multimedia, pp. 516–520. Cited by: §1, §2, §3.1, §4.1.4, TABLE V.
-  (2020) Mask encoding for single shot instance segmentation. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: 5th item.
-  (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: Fig. 3, §4.4, TABLE IX.
-  (2017) EAST: an efficient and accurate scene text detector. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 5551–5560. Cited by: §1.
-  (2019-06) Feature selective anchor-free module for single-shot object detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., Cited by: §2, TABLE VIII.
-  (2019) Deformable convnets v2: more deformable, better results. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pp. 9308–9316. Cited by: §4.3, TABLE VIII.