Probabilistic two-stage detection

by   Xingyi Zhou, et al.

We develop a probabilistic interpretation of two-stage object detection. We show that this probabilistic interpretation motivates a number of common empirical training practices. It also suggests changes to two-stage detection pipelines. Specifically, the first stage should infer proper object-vs-background likelihoods, which should then inform the overall score of the detector. A standard region proposal network (RPN) cannot infer this likelihood sufficiently well, but many one-stage detectors can. We show how to build a probabilistic two-stage detector from any state-of-the-art one-stage detector. The resulting detectors are faster and more accurate than both their one- and two-stage precursors. Our detector achieves 56.4 mAP on COCO test-dev with single-scale testing, outperforming all published results. Using a lightweight backbone, our detector achieves 49.2 mAP on COCO at 33 fps on a Titan Xp, outperforming the popular YOLOv4 model.


Focal Loss Dense Detector for Vehicle Surveillance

Deep learning has been widely recognized as a promising approach in diff...

MimicDet: Bridging the Gap Between One-Stage and Two-Stage Object Detection

Modern object detection methods can be divided into one-stage approaches...

RetinaMask: Learning to predict masks improves state-of-the-art single-shot detection for free

Recently two-stage detectors have surged ahead of single-shot detectors ...

ASAP-NMS: Accelerating Non-Maximum Suppression Using Spatially Aware Priors

The widely adopted sequential variant of Non Maximum Suppression (or Gre...

Light-Head R-CNN: In Defense of Two-Stage Object Detector

In this paper, we first investigate why typical two-stage methods are no...

POD: Practical Object Detection with Scale-Sensitive Network

Scale-sensitive object detection remains a challenging task, where most ...

Revisiting a single-stage method for face detection

Although accurate, two-stage face detectors usually require more inferen...

Code Repositories

1 Introduction

Object detection aims to find all objects in an image and identify their locations and class likelihoods (Girshick et al., 2014). One-stage detectors jointly infer the location and class likelihood in a probabilistically sound framework (Lin et al., 2017b; Liu et al., 2016; Redmon and Farhadi, 2017). They are trained to maximize the log-likelihood of annotated ground-truth objects, and predict proper likelihood scores at inference. A two-stage detector first finds potential objects and their location (Uijlings et al., 2013; Zitnick and Dollár, 2014; Ren et al., 2015)

and then (in the second stage) classifies these potential objects. The first stage is designed to maximize recall 

(Ren et al., 2015; He et al., 2017; Cai and Vasconcelos, 2018), while the second stage maximizes a classification objective over regions filtered by the first stage. While the second stage has a probabilistic interpretation, the combination of the two stages does not.

In this paper, we develop a probabilistic interpretation of two-stage detectors. We present a simple modification of standard two-stage detector training by optimizing a lower bound to a joint probabilistic objective over both stages. A probabilistic treatment suggests changes to the two-stage architecture. Specifically, the first stage needs to infer a calibrated object likelihood. The current region proposal network (RPN) in two-stage detectors is designed to maximize the proposal recall, and does not produce accurate likelihoods. However, full-fledged one-stage detectors can.

(a) First stage:Object likelihood
(b) Second stage: Conditional classification
Figure 1: Illustration of our framework. A class-agnostic one-stage detector predicts object likelihood. A second stage then predicts a classification score conditioned on a detection. The final detection score combines the object likelihood and the conditional classification score.

We build a probabilistic two-stage detector on top of state-of-the-art one-stage detectors. For each one-stage detection, our model extracts region-level features and classifies them. We use either a Faster R-CNN (Ren et al., 2015) or a cascade classifier (Cai and Vasconcelos, 2018) in the second stage. The two stages are trained together to maximize the log-likelihood of ground-truth objects. At inference, our detectors use this final log-likelihood as the detection score.

A probabilistic two-stage detector is faster and more accurate than both its one- and two-stage precursors. Compared to two-stage anchor-based detectors (Cai and Vasconcelos, 2018), our first stage is more accurate and allows the detector to use fewer proposals in RoI heads (256 vs. 1K), making the detector both more accurate and faster overall. Compared to single-stage detectors, our first stage uses a leaner head design and only has one output class for dense image-level prediction. The speedup due to the drastic reduction in the number of classes more than makes up for the additional costs of the second stage. Our second stage makes full use of years of progress in two-stage detection (Cai and Vasconcelos, 2018; Chen et al., 2019a) and yields a significant increase in detection accuracy over one-stage baselines. It also easily scales to large-vocabulary detection.

Experiments on COCO (Lin et al., 2014), LVIS (Gupta et al., 2019), and Objects365 (Shao et al., 2019) demonstrate that our probabilistic two-stage framework boosts the accuracy of a strong CascadeRCNN model by 1-3 mAP, while also improving its speed. Using a standard ResNeXt-101-DCN backbone with a CenterNet (Zhou et al., 2019a) first stage, our detector achieves 50.2 mAP on COCO test-dev. With a strong Res2Net-101-DCN-BiFPN (Gao et al., 2019a; Tan et al., 2020) backbone and self-training (Zoph et al., 2020), it achieves 56.4 mAP with single-scale testing, outperforming all published results. Using a small DLA-BiFPN backbone and lower input resolution, we achieve 49.2 mAP on COCO at 33 fps on a Titan Xp, outperforming the popular YOLOv4 model (43.5 mAP at 33 fps) on the same hardware. Code and models are release at

2 Related Work

One-stage detectors jointly predict an output class and location of objects densely throughout the image. RetinaNet (Lin et al., 2017b) classifies a set of predefined sliding anchor boxes and handles the foreground-background imbalance by reweighting losses for each output. FCOS (Tian et al., 2019) and CenterNet (Zhou et al., 2019a) eliminate the need of multiple anchors per pixel and classify foreground/background by location. ATSS (Zhang et al., 2020b) and PAA (Kim and Lee, 2020) further improve FCOS by changing the definition of foreground and background. GFL (Li et al., 2020b) and Autoassign (Zhu et al., 2020a) change the hard foreground-background assignment to a weighted soft assignment. AlignDet (Chen et al., 2019b) uses a deformable convolution layer before the output to gather richer features for classification and regression. RepPoint (Yang et al., 2019) and DenseRepPoint (Yang et al., 2020) encode bounding boxes as the outline of a set of points and use the features of the point set for classification. BorderDet (Qiu et al., 2020) pools features along the bounding box for better localization. Most one-stage detectors have a sound probabilistic interpretation.

While one-stage detectors have achieved competitive performance (Zhang et al., 2020b; Kim and Lee, 2020; Zhang et al., 2019; Li et al., 2020b; Zhu et al., 2020a), they usually rely on heavier separate classification and regression branches than two-stage models. In fact, they are no longer faster than their two-stage counterparts if the vocabulary (i.e., the set of object classes) is large (as in the LVIS or Objects365 datasets). Also, one-stage detectors only use the local feature of the positive cell for regression and classification, which is sometimes misaligned with the object (Chen et al., 2019b; Song et al., 2020).

Our probabilistic two-stage framework retains the probabilistic interpretation of one-stage detectors, but factorizes the probability distribution over multiple stages, improving both accuracy and speed.

Two-stage detectors first use a region proposal network (RPN) to generate coarse object proposals, and then use a dedicated per-region head to classify and refine them. FasterRCNN (Ren et al., 2015; He et al., 2017) uses two fully-connected layers as the RoI heads. CascadeRCNN (Cai and Vasconcelos, 2018) uses three cascaded stages of FasterRCNN, each with a different positive threshold so that the later stages focus more on localization accuracy. HTC (Chen et al., 2019a) utilizes additional instance and semantic segmentation annotations to enhance the inter-stage feature flow of CascadeRCNN. TSD (Song et al., 2020) decouples the classification and localization branches for each RoI.

Two-stage detectors are still more accurate in many settings (Gupta et al., 2019; Sun et al., 2020; Kuznetsova et al., 2018). Currently, all two-stage detectors use a relatively weak RPN that maximizes the recall of the top 1K proposals, and does not utilize the proposal score at test time. The large number of proposals slows the system down, and the recall-based proposal network does not directly offer the same clear probabilistic interpretation as one-stage detectors. Our framework addresses this, and integrates a strong class-agnostic single-stage object detector with later classification stages. Our first stage uses fewer, but higher quality, regions, yielding both faster inference and higher accuracy.

Other detectors. A family of object detectors identify objects via points in the image. CornerNet (Law and Deng, 2018) detects the top-left and bottom-right corners and groups them using an embedding feature. ExtremeNet (Zhou et al., 2019b) detects four extreme points and groups them using an additional center point. Duan et al. (2019) detect the center point and use it to improve corner grouping. Corner Proposal Net (Duan et al., 2020) uses pairwise corner groupings as region proposals. CenterNet (Zhou et al., 2019a) detects the center point and regresses the bounding box parameters from it.

(a) one-stage detector
(b) two-stage detector
(c) Probabilistic two-stage detector
Figure 2: Illustration of the structural differences between existing one-stage and two-stage detectors and our probabilistic two-stage framework. (a) A typical one-stage detector applies separate heavy classification and regression heads and produces a dense classification map. (b) A typical two-stage detector uses a light proposal network and extracts many () region features for classification. (c) Our probabilistic two-stage framework uses a one-stage detector with shared heads to produce region proposals and extracts a few () regions for classification. The proposal score from the first stage is used in the second stage in a probabilistically sound framework. Typically, .

DETR (Carion et al., 2020) and Deformable DETR (Zhu et al., 2020c) remove the dense output in a detector, and instead use a Transformer (Vaswani et al., 2017) that directly predicts a set of bounding boxes.

The major difference between point-based detectors, DETR, and conventional detectors lies in the network architecture. Point-based detectors use a fully-convolutional network (Newell et al., 2016; Yu et al., 2018)

, usually with symmetric downsampling and upsampling layers, and produce a single feature map with a small stride (i.e., stride 4). DETR-style detectors 

(Carion et al., 2020; Zhu et al., 2020c) use a transformer as the decoder. Conventional one- and two-stage detectors commonly use an image classification network augmented by lightweight upsampling layers, and produce multi-scale features (FPN) (Lin et al., 2017a).

3 Preliminaries

An object detector aims to predict the location and class-specific likelihood score for any object for a predefined set of classes . The object location is most often described by two corners of an axis-aligned bounding box (Ren et al., 2015; Carion et al., 2020) or through an equivalent center+size representation (Tian et al., 2019; Zhou et al., 2019a; Zhu et al., 2020c). The main difference between object detectors lies in their representation of the class likelihood, reflected in their architectures.

One-stage detectors (Redmon and Farhadi, 2018; Lin et al., 2017b; Tian et al., 2019; Zhou et al., 2019a) jointly predict the object location and likelihood score in a single network. Let indicate a positive detection for object candidate and class , and let indicate background. Most one-stage detectors (Lin et al., 2017b; Tian et al., 2019; Zhou et al., 2019a)

then parametrize the class likelihood as a Bernoulli distribution using an independent sigmoid per class:

, where is a feature produced by the backbone and

is a class-specific weight vector. During training, this probabilistic interpretation allows one-stage detectors to simply maximize the log-likelihood

or the focal loss (Lin et al., 2017b) of ground-truth annotations. One-stage detectors differ from each other in the definition of positive and negative samples. Some use anchor overlap (Lin et al., 2017b; Zhang et al., 2020b; Kim and Lee, 2020), others use locations (Tian et al., 2019)

. However, all optimize log-likelihood and use the class probability to score boxes. All directly regress to bounding box coordinates.

Two-stage detectors (Ren et al., 2015; Cai and Vasconcelos, 2018) first extract potential object locations, called object proposals, using an objectness measure . They then extract features for each potential object, classify them into classes or background with , and refine the object location. Each stage is supervised independently. In the first stage, a Region Proposal Network (RPN) learns to classify annotated objects as foreground and other boxes as background. This is commonly done through a binary classifier trained with a log-likelihood objective. However, an RPN defines background regions very conservatively. Any prediction that overlaps an annotated object

or more may be considered foreground. This label definition favors recall over precision and accurate likelihood estimation. Many partial objects receive a large proposal score. In the second stage, a softmax classifier learns to classify each proposal into one of the foreground classes or background. The classifier uses a log-likelihood objective, with foreground labels consisting of annotated objects and background labels coming from high-scoring first-stage proposals without annotated objects close-by. During training, this categorical distribution is implicitly conditioned on positive detections of the first stage, as it is only trained and evaluated on them. Both the first and second stage have a probabilistic interpretation, and under their positive and negative definition estimate the log-likelihood of objects or classes respectively. However, the entire detector does not. It combines multiple heuristics and sampling strategies to independently train the first and second stages 

(Cai and Vasconcelos, 2018; Ren et al., 2015). The final output comprises boxes with classification scores of the second stage alone.

Next, we develop a simple probabilistic interpretation of two-stage detectors that considers the two stages as part of a single class-likelihood estimate. We show how this affects the design of the first stage, and how to train the two stages efficiently.

4 A probabilistic interpretation of two-stage detection

For each image, our goal is to produce a set of detections as bounding boxes with an associated class distribution for classes or background to each object . In this work, we keep the bounding-box regression unchanged and only focus on the class distribution. A two-stage detector factorizes this distribution into two parts: A class-agnostic object likelihood (first stage) and a conditional categorical classification (second stage). Here indicates a positive detection in the first stage, while corresponds to background. Any negative first-stage detection leads to a background classification: . In a multi-stage detector (Cai and Vasconcelos, 2018), the classification is done by an ensemble of multiple cascaded stages, while two-stage detectors use a single classifier (Ren et al., 2015). The joint class distribution of the two-stage model then is


Training objective.

We train our detectors using maximum likelihood estimation. For annotated objects, we maximize


which reduces to independent maximum-likelihood objectives for the first and second stage respectively.

For the background class, the maximum-likelihood objective does not factorize:

This objective ties the first- and second-stage probability estimates in their loss and gradient computation. An exact evaluation requires a dense evaluation of the second stage for all first-stage outputs, which would slow down training prohibitively. We instead derive two lower bounds to the objective, which we jointly optimize. The first lower bound uses Jensen’s inequality with , , and :


This lower bound maximizes the log-likelihood of background of the second stage for any high-scoring object in the first stage. It is tight for or , but can be arbitrarily loose for and . Our second bound involves just the first-stage objective:


It uses with the monotonicity of the . This bound is tight for . Ideally, the tightest bound is obtained by using the maximum of Eq. (3) and Eq. (4). This lower bound is within of the actual objective, as shown in the supplementary material. In practice however, we found optimizing both bounds jointly to work better.

With lower bound Eq. (4) and the positive objective Eq. (2), first-stage training reduces to a maximum-likelihood estimate with positive labels at annotated objects and negative labels for all other locations. It is equivalent to training a binary one-stage detector, or an RPN with a strict negative definition that encourages likelihood estimation and not recall.

Detector design.

The key difference between our formulation and standard two-stage detectors lies in the use of the class-agnostic detection in the detection score Eq. (1). In our probabilistic formation, the classification score is multiplied by the class-agnostic detection score. This requires a strong first stage detector that not only maximizes the proposal recall (Ren et al., 2015; Uijlings et al., 2013), but also predicts a reliable object likelihood for each proposal. In our experiments, we use strong one-stage detectors to estimate this log-likelihood, as described in the next section.

5 Building a probabilistic two-stage detector

The core component of a probabilistic two-stage detector is a strong first stage. This first stage needs to predict an accurate object likelihood that informs the overall detection score, rather than maximizing the object coverage. We experiment with four different first-stage designs based on popular one-stage detectors. For each, we highlight the design choices needed to convert them from a single-stage detector to a first stage in a probabilistic two-stage detector.

RetinaNet (Lin et al., 2017b) closely resembles the RPN of traditional two-stage detectors with three critical differences: a heavier head design (4 layers vs. 1 layer in RPN), a stricter positive and negative anchor definition, and the focal loss. Each of these components increases RetinaNet’s ability to produce calibrated one-stage detection likelihoods. We use all of these in our first-stage design. RetinaNet by default uses two separate heads for bounding box regression and classification. In our first-stage design, we found it sufficient to have a single shared head for both tasks, as object-or-not classification is easier and requires less network capacity. This speeds up inference.

CenterNet (Zhou et al., 2019a) finds objects as keypoints located at their center, then regresses to box parameters. The original CenterNet operates at a single scale, whereas conventional two-stage detectors use a feature pyramid (FPN) (Lin et al., 2017a). We upgrade CenterNet to multiple scales using an FPN. Specifically, we use the RetinaNet-style ResNet-FPN as the backbone (Lin et al., 2017b), with output feature maps from stride 8 to 128 (i.e., P-P). We apply a 4-layer classification branch and regression branch (Tian et al., 2019) to all FPN levels to produce a detection heatmap and bounding box regression map. During training, we assign ground-truth center annotations to specific FPN levels based on the object size, within a fixed assignment range (Tian et al., 2019). Inspired by GFL (Li et al., 2020b), we add locations in the neighborhood of the center that already produce high-quality bounding boxes (i.e., with a regression loss ) as positives. We use the distance to boundaries as the bounding box representation (Tian et al., 2019), and use the gIoU loss for bounding box regression (Rezatofighi et al., 2019). We evaluate both one-stage and probabilistic two-stage versions of this architecture. We refer to the improved CenterNet as CenterNet*.

ATSS (Zhang et al., 2020b) models the class likelihood of a one-stage detector with an adaptive IoU threshold for each object, and uses centerness (Tian et al., 2019) to calibrate the score. In a probabilistic two-stage baseline, we use ATSS (Zhang et al., 2020b) as is, and multiply the centerness and the foreground classification score for each proposal. We again merge the classification and regression heads for a slight speedup.

GFL (Li et al., 2020b) uses regression quality to guide the object likelihood training. In a probabilistic two-stage baseline, we remove the integration-based regression and only use the distance-based regression (Tian et al., 2019) for consistency, and again merge the two heads.

The above one-stage architectures infer . For each, we combine them with the second stage that infers . We experiment with two basic second-stage designs: FasterRCNN (Ren et al., 2015) and CascadeRCNN (Cai and Vasconcelos, 2018).


A two-stage detector (Ren et al., 2015) typically uses FPN levels P2-P6 (stride 4 to stride 64), while most one-stage detectors use FPN levels P3-P7 (stride 8 to stride 128). To make it compatible, we use levels P3-P7 for both one- and two-stage detectors. This modification slightly improves the baselines. Following Wang et al. (2019), we increase the positive IoU threshold in the second stage from to for Faster RCNN (and for CascadeRCNN) to compensate for the IoU distribution change in the second stage. We use a maximum of 256 proposal boxes in the second stage for probabilistic two-stage detectors, and use the default 1K boxes for RPN-based models unless stated otherwise. We also increase the NMS threshold from to

for our probabilistic detectors as we use fewer proposals. These hyperparameter-changes is necessary for probabilistic detectors, but we found they do not improve the RPN-based detector in our experiments.

We implement our method based on detectron2 (Wu et al., 2019). Our default model follows the standard setting in detectron2 (Wu et al., 2019). Specifically, we train the network with the SGD optimizer for 90K iterations (1x schedule). The base learning rate is for two-stage detectors and for one-stage detectors, and is dropped by 10x at iterations 60K and 80K. We use multi-scale training with the short edge in the range [640,800] and the long edge up to 1333. During training, we set the first-stage loss weight to as one-stage detectors are typically trained with learning rate . During testing, we use a fixed short edge at 800 and long edge up to 1333.

We instantiate our probabilistic two-stage framework on four different backbones. We use a default ResNet-50 (He et al., 2016) model for most ablations and comparisons among design choices, and then compare to state-of-the-art methods using the same large ResNeXt-32x8d-101-DCN (Xie et al., 2017) backbone, and use a lightweight DLA (Yu et al., 2018) backbone for a real-time model. We also integrate the most recent advances (Zoph et al., 2020; Tan et al., 2020; Gao et al., 2019a) and design an extra-large backbone for the high-accuracy regime. Further details about each backbone are in the supplement.

FasterRCNN-RPN (original) 37.9 46ms 55ms
CascadeRCNN-RPN (original) 41.6 48ms 78ms
RetinaNet (Lin et al., 2017b) 37.4 82ms 82ms
FasterRCNN-RetinaNet 40.4 60ms 63ms
CascadeRCNN-RetinaNet 42.6 61ms 69ms
GFL (Li et al., 2020b) 40.2 51ms 51ms
FasterRCNN-GFL 41.7 46ms 50ms
CascadeRCNN-GFL 42.7 46ms 57ms
ATSS (Zhang et al., 2020b) 39.7 56ms 56ms
FasterRCNN-ATSS 41.5 47ms 50ms
CascadeRCNN-ATSS 42.7 47ms 57ms
CenterNet* 40.2 51ms 51ms
FasterRCNN-CenterNet 41.5 46ms 50ms
CascadeRCNN-CenterNet 42.9 47ms 57ms
Table 1: Performance and runtime of a number of two-stage detectors, one-stage detectors, and corresponding probabilistic two-stage detectors (our approach). Results on COCO validation. Top block: two-stage FasterRCNN and CascadeRCNN detectors. Other blocks: Four one-stage detectors, each with two corresponding probabilistic two-stage detectors, one based on FasterRCNN and one based on CascadeRCNN. For each detector, we list its first-stage runtime () and total runtime (). All results are reported using standard Res50-1x with multi-scale training.

6 Results

We evaluate our framework on three large detection datasets: COCO (Lin et al., 2014), LVIS (Gupta et al., 2019), and Objects365 (Gao et al., 2019b)

. Details of each dataset can be found in the supplement. We use COCO to perform ablation studies and comparisons to the state of the art. We use LVIS and Objects365 to test the generality of our framework, particularly in the large-vocabulary regime. In all datasets, we report the standard mAP. Runtimes are reported on a Titan Xp GPU with PyTorch 1.4.0 and CUDA 10.1.

Table 1 compares one- and two-stage detectors to corresponding probabilistic two-stage detectors designed via our framework. The first block of the table shows the performance of the original reference two-stage detectors, FasterRCNN and CascadeRCNN. The following blocks show the performance of four one-stage detectors (discussed in Section 5) and the corresponding probabilistic two-stage detectors, obtained when using the respective one-stage detector as the first stage in a probabilistic two-stage framework. For each one-stage detector, we show two versions of probabilistic two-stage models, one based on FasterRCNN and one based on CascadeRCNN.

Backbone Epochs mAP Runtime
FCOS-RT DLA-BiFPN-P3 48 42.1 21ms
CenterNet2 DLA-BiFPN-P3 48 43.7 25ms
CenterNet DLA 230 37.6 18ms
YOLOV4 CSPDarknet-53 300 43.5 30ms
EfficientDet EfficientNet-B2 500 43.5 23ms*
EfficientDet EfficientNet-B3 500 46.8 37ms*
CenterNet2 DLA-BiFPN-P3 288 45.6 25ms
CenterNet2 DLA-BiFPN-P5 288 49.2 30ms
Table 2: Performance of real-time object detectors on COCO validation. Top: we compare CenterNet2 to realtime-FCOS under exactly the same setting. Bottom: we compare to detectors with different backbones and training schedules. *The runtime of EfficientDet is taken from the original paper (Tan et al., 2020) as the official model is not available. Other runtimes are measured on the same machine.
CornerNet (Law and Deng, 2018) Hourglass-104 40.6 56.4 43.2 19.1 42.8 54.3
CenterNet (Zhou et al., 2019a) Hourglass-104 42.1 61.1 45.9 24.1 45.5 52.8
Duan et al. (Duan et al., 2019) Hourglass-104 44.9 62.4 48.1 25.6 47.4 57.4
RepPoint (Yang et al., 2019) ResNet101-DCN 45.0 66.1 49.0 26.6 48.6 57.5
MAL (Ke et al., 2020) ResNeXt-101 45.9 65.4 49.7 27.8 49.1 57.8
FreeAnchor (Zhang et al., 2019) ResNeXt-101 46.0 65.6 49.8 27.8 49.5 57.7
CentripetalNet (Dong et al., 2020) Hourglass-104 46.1 63.1 49.7 25.3 48.7 59.2
FCOS (Tian et al., 2019) ResNeXt-101-DCN 46.6 65.9 50.8 28.6 49.1 58.6
TridentNet (Li et al., 2019) ResNet-101-DCN 46.8 67.6 51.5 28.0 51.2 60.5
CPN (Duan et al., 2020) Hourglass-104 47.0 65.0 51.0 26.5 50.2 60.7
SAPD (Zhu et al., 2020b) ResNeXt-101-DCN 47.4 67.4 51.1 28.1 50.3 61.5
ATSS (Zhang et al., 2020b) ResNeXt-101-DCN 47.7 66.6 52.1 29.3 50.8 59.7
BorderDet (Yang et al., 2019) ResNeXt-101-DCN 48.0 67.1 52.1 29.4 50.7 60.5
GFL (Li et al., 2020b) ResNeXt-101-DCN 48.2 67.4 52.6 29.2 51.7 60.2
PAA (Kim and Lee, 2020) ResNeXt-101-DCN 49.0 67.8 53.3 30.2 52.8 62.2
TSD (Song et al., 2020) ResNeXt-101-DCN 49.4 69.6 54.4 32.7 52.5 61.0
RepPointv2 (Yang et al., 2019) ResNeXt-101-DCN 49.4 68.9 53.4 30.3 52.1 62.3
AutoAssign (Zhu et al., 2020a) ResNeXt-101-DCN 49.5 68.7 54.0 29.9 52.6 62.0
Deformable DETR (Zhu et al., 2020c) ResNeXt-101-DCN 50.1 69.7 54.6 30.6 52.8 65.6
CascadeRCNN (Cai and Vasconcelos, 2018) ResNeXt-101-DCN 48.8 67.7 52.9 29.7 51.8 61.8
CenterNet* ResNeXt-101-DCN 49.1 67.8 53.3 30.2 52.4 62.0
CenterNet2 (ours) ResNeXt-101-DCN 50.2 68.0 55.0 31.2 53.5 63.6
CRCNN-ResNeSt (Zhang et al., 2020a) ResNeSt-200 49.1 67.8 53.2 31.6 52.6 62.8
GFLV2 (Li et al., 2020a) Res2Net-101-DCN 50.6 69.0 55.3 31.3 54.3 63.5
DetectRS (Qiao et al., 2020) ResNeXt-101-DCN-RFP 53.3 71.6 58.5 33.9 56.5 66.9
EfficientDet-D7x (Tan et al., 2020) EfficientNet-D7x-BiFPN 55.1 73.4 59.9 - - -
ScaledYOLOv4 (Wang et al., 2020) CSPDarkNet-P7 55.4 73.3 60.7 38.1 59.5 67.4
CenterNet2 (ours) Res2Net-101-DCN-BiFPN 56.4 74.0 61.6 38.7 59.7 68.6
Table 3: Comparison to the state of the art on COCO test-dev. We list object detection accuracy with single-scale testing. We retrained our baselines, CascadeRCNN (ResNeXt-101-DCN) and CenterNet*, under comparable settings. Other results are taken from the original publications. Top: detectors with comparable backbones (ResNeXt-101-DCN) and training schedules (2x). Bottom: detectors with their best-fit backbones, input size, and schedules.

All probabilistic two-stage detectors outperform their one-stage and two-stage precursors. Each probabilistic two-stage FasterRCNN model improves upon its one-stage precursor by 1 to 2 percentage points in mAP, and outperforms the original two-stage FasterRCNN by up to 3 percentage points in mAP. More interestingly, each two-stage probabilistic FasterRCNN is faster than its one-stage precursor due to the leaner head design. A number of probabilistic two-stage FasterRCNN models are faster than the original two-stage FasterRCNN, due to more efficient FPN levels (P3-P7 vs. P2-P6) and because the probabilistic detectors use fewer proposals (256 vs. 1K). We observe similar trends with the CascadeRCNN models.

The CascadeRCNN-CenterNet design performs best among these probabilistic two-stage models. We thus adopt this basic structure in the following experiments and refer to it as CenterNet2 for brevity.

Real-time models.

Table 2 compares our real-time model to other real-time detectors. CenterNet2 outperforms realtime-FCOS (Tian et al., 2020) by mAP with the same backbone and training schedule, and is only 4 ms slower. Using the same FCOS-based backbone with longer training schedules (Tan et al., 2020; Bochkovskiy et al., 2020), it improves upon the original CenterNet (Zhou et al., 2019a) by mAP, and comfortably outperforms the popular YOLOv4 (Bochkovskiy et al., 2020) and EfficientDet-B2 (Tan et al., 2020) detectors with mAP at 40 fps. Using a slightly different FPN structure and combining with self-training  (Zoph et al., 2020), CenterNet2 gets mAP at 33 fps. While most existing real-time detectors are one-stage, here we show that two-stage detectors can be as fast as one-stage designs, while delivering higher accuracy.

State-of-the-art comparison.

Table 3 compares our large models to state-of-the-art detectors on COCO test-dev. Using a “standard” large backbone ResNeXt101-DCN, CenterNet2 achieves 50.2 mAP, outperforming all existing models with the same backbone, both one- and two-stage. Note that CenterNet2 outperforms the corresponding CascadeRCNN model with the same backbone by 1.4 percentage points in mAP. This again highlights the benefits of a probabilistic treatment of two-stage detection.

To push the state-of-the-art of object detection, we further switch to a stronger backbone Res2Net (Gao et al., 2019a) with BiFPN (Tan et al., 2020), a larger input resolution ( in training and in testing) with heavy crop augmentation (ratio 0.1 to 2) (Tan et al., 2020), and a longer schedule () with self-training  (Zoph et al., 2020) on COCO unlabeled images. Our final model achieves mAP with a single model, outperforming all published numbers in the literature. More details about the extra-large model can be found in the supplement.

P3-P7 256p. 4 l. loss prob mAP
37.9 46ms 55ms
38.6 38ms 45ms
38.5 38ms 45ms
38.3 38ms 40ms
38.9 60ms 70ms
38.6 60ms 63ms
39.1 60ms 63ms
40.4 60ms 63ms
Table 4: A detailed ablation between FasterRCNN-RPN (top) and a probabilistic two-stage FasterRCNN-RetinaNet (bottom). FasterRCNN-RetinaNet changes the FPN levels (P2-P6 to P3-P7), uses 256 instead of 1000 proposals, a 4-layer first-stage head, a stricter IoU threshold with focal loss (loss), and multiplies the first and second stage probabilities (prob). All results are reported using standard Res50-1x with multi-scale training.
CascadeRCNN-RPN (P3-P7) 42.1
CascadeRCNN-RPN w. prob. 42.1
CascadeRCNN-CenterNet 42.1
CascadeRCNN-CenterNet w. prob. (Ours) 42.9
Table 5: Ablation of our probabilistic modeling (w. prob.) of CascadeRCNN with the default RPN and CenterNet proposal.
CascadeRCNN CenterNet2
#prop. mAP AR Runtime mAP AR Runtime
1000 42.1 62.4 66ms 43.0 70.8 75ms
512 41.9 60.4 56ms 42.9 69.0 61ms
256 41.6 57.4 48ms 42.9 66.6 57ms
128 40.8 53.7 45ms 42.7 63.5 54ms
64 39.6 49.2 42ms 42.1 59.7 52ms
Table 6: Accuracy-runtime trade-off of using different numbers of proposals (#prop.) on COCO validation. We show the overall mAP, the proposal recall (AR), and runtime for both the original CascadeRCNN and our probabilistic two-stage detector (CenterNet2). The results are reported with Res50-1x and multi-scale training. We highlight the default number of proposals in gray.
Figure 3: Visualization of region proposals on COCO validation, contrasting CascadeRCNN and its probabilistic counterpart, CascadeRCNN-CenterNet (or CenterNet2). Left: region proposals from the first stage of CascadeRCNN (RPN). Right: region proposals from the first stage of CenterNet2. For clarity, we only show regions with score 0.3.

6.1 Ablation studies

From FasterRCNN-RPN to FasterRCNN-RetinaNet.

Table 4 shows the detailed road map from the default RPN-FasterRCNN to a probabilistic two-stage FasterRCNN with RetinaNet as the first stage. First, switching to the RetinaNet-style FPN already gives a favorable improvement. However, directly multiplying the first-stage probability here does not give an improvement, because the original RPN is weak and does not provide a proper likelihood. Making the RPN stronger by adding layers makes it possible to use fewer proposals in the second stage, but does not improve accuracy. Switching to the RetinaNet loss (a stricter IoU threshold and focal loss), the proposal quality is improved, yielding a 0.5 mAP improvement over the original RPN loss. With the improved proposals, incorporating the first-stage score in our probabilistic framework significantly boosts accuracy to 40.4.

Table 5 reports similar ablations on CascadeRCNN. The observations are consistent: multiplying the first-stage probabilities with the original RPN does not improve accuracy, while using a strong one-stage detector can. This suggests that both ingredients in our design are necessary: a stronger proposal network and incorporating the proposal score.

Trade-off in the number of proposals.

Table 6 shows how the mAP, proposal average recall (AR), and runtime change when using a different numbers of proposals for the original RPN-based CascadeRCNN and CenterNet2. Both CascadeRCNN and CenterNet2 get faster with fewer proposals. However, the accuracy of the original CascadeRCNN drops steeply as the number of proposals decreases, while our detector performs well even with relatively few proposals. For example, CascadeCRNN drops by mAP when using 128 instead of 1000 proposals, while CenterNet2 only loses mAP. The average recall of 128 CenterNet2 proposals is higher than 1000 RPN ones.

mAP mAP mAP mAP Runtime
GFL (Li et al., 2020b) 18.5 6.9 15.8 26.6 69ms
CenterNet* 19.1 7.8 16.3 27.4 69ms
CascadeRCNN 24.0 7.6 22.9 32.7 100ms
CenterNet2 26.7 12.0 25.4 34.5 60ms
CenterNet2 w. FedLoss 28.2 18.8 26.4 34.4 60ms
Table 7: Object detection results on LVIS v1 validation. The experiments are conducted with Res50-1x, multi-scale training, and repeat-factor sampling (Gupta et al., 2019).
mAP mAP mAP Runtime
GFL (Li et al., 2020b) 18.8 28.1 20.2 56ms
CenterNet* 18.7 27.5 20.1 55ms
CascadeRCNN 21.7 31.7 23.4 67ms
CenterNet2 22.6 31.6 24.6 56ms
Table 8: Object detection results on Objects365. The experiments are conducted with Res50-1x, multi-scale training, and class-aware sampling (Shen et al., 2016).

6.2 Large vocabulary detection

Tables 7 and 8 report object detection results on LVIS (Gupta et al., 2019) and Objects365 (Shao et al., 2019), respectively. CenterNet2 improves on the CascadeRCNN baselines by 2.7 mAP on LVIS and 0.8 mAP on Objects365, showing the generality of our approach. On both datasets, two-stage detectors (CascadeRCNN, CenterNet2) outperform one-stage designs (GFL, CenterNet) by significant margins: 5-8 mAP on LVIS and 3-4 mAP on Objects365. On LVIS, the runtime of one-stage detectors increases by compared to COCO, as the number of categories grows from 80 to 1203. This is due to the dense classification heads. On the other hand, the runtime of CenterNet2 only increases by . This highlights the advantages of probabilistic two-stage detection in large-vocabulary settings.

Two stage-detectors allow using a more dedicated classification loss in the second stage. In the supplement, we propose a federated loss for handling the federated construction of LVIS. The results are highlighted in Table 7.

7 Conclusion

We developed a probabilistic interpretation of two-stage detection. This interpretation motivates the use of a strong first stage that learns to estimate object likelihoods rather than maximize recall. These likelihoods are then combined with the classification scores from the second stage to yield principled probabilistic scores for the final detections. Probabilistic two-stage detectors are both faster and more accurate than their one- or two-stage counterparts. Our work paves the way for an integration of advances in both one- and two-stage designs that combines accuracy with speed.


  • A. Bochkovskiy, C. Wang, and H. M. Liao (2020) YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934. Cited by: §6.
  • Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, Cited by: §1, §1, §1, §2, §3, §4, §5, Table 3.
  • N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. arXiv:2005.12872. Cited by: §2, §2, §3.
  • K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. (2019a) Hybrid task cascade for instance segmentation. In CVPR, Cited by: §1, §2.
  • Y. Chen, C. Han, N. Wang, and Z. Zhang (2019b) Revisiting feature alignment for one-stage object detection. arXiv:1908.01570. Cited by: §2, §2.
  • Z. Dong, G. Li, Y. Liao, F. Wang, P. Ren, and C. Qian (2020) CentripetalNet: pursuing high-quality keypoint pairs for object detection. In CVPR, Cited by: Table 3.
  • K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: object detection with keypoint triplets. ICCV. Cited by: §2, Table 3.
  • K. Duan, L. Xie, H. Qi, S. Bai, Q. Huang, and Q. Tian (2020) Corner proposal network for anchor-free, two-stage object detection. arXiv:2007.13816. Cited by: §2, Table 3.
  • S. Gao, M. Cheng, K. Zhao, X. Zhang, M. Yang, and P. H. Torr (2019a) Res2net: a new multi-scale backbone architecture. TPAMI. Cited by: §1, §5, §6.
  • Y. Gao, H. Shen, D. Zhong, J. Wang, Z. Liu, T. Bai, X. Long, and S. Wen (2019b) A solution for densely annotated large scale object detection task. Note: Cited by: §6.
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §1.
  • A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In CVPR, Cited by: §1, §2, §6.2, Table 7, §6.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, Cited by: §1, §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §5.
  • W. Ke, T. Zhang, Z. Huang, Q. Ye, J. Liu, and D. Huang (2020) Multiple anchor learning for visual object detection. In CVPR, Cited by: Table 3.
  • K. Kim and H. S. Lee (2020) Probabilistic anchor assignment with iou prediction for object detection. In ECCV, Cited by: §2, §2, §3, Table 3.
  • A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, T. Duerig, et al. (2018) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. arXiv:1811.00982. Cited by: §2.
  • H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. In ECCV, Cited by: §2, Table 3.
  • X. Li, W. Wang, X. Hu, J. Li, J. Tang, and J. Yang (2020a) Generalized focal loss v2: learning reliable localization quality estimation for dense object detection. arXiv preprint. Cited by: Table 3.
  • X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang (2020b) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In Neural Information Processing Systems, Cited by: §2, §2, Table 1, §5, §5, Table 3, Table 7, Table 8.
  • Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. ICCV. Cited by: Table 3.
  • T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017a) Feature pyramid networks for object detection.. In CVPR, Cited by: §2, §5.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. ICCV. Cited by: §1, §2, §3, Table 1, §5, §5.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §1, §6.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, Cited by: §1.
  • A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    In ECCV, Cited by: §2.
  • S. Qiao, L. Chen, and A. Yuille (2020) DetectoRS: detecting objects with recursive feature pyramid and switchable atrous convolution. arXiv:2006.02334. Cited by: Table 3.
  • H. Qiu, Y. Ma, Z. Li, S. Liu, and J. Sun (2020) BorderDet: border feature for dense object detection. ECCV. Cited by: §2.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. CVPR. Cited by: §1.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv:1804.02767. Cited by: §3.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Neural Information Processing Systems, Cited by: §1, §1, §2, §3, §3, §4, §4, §5, §5.
  • H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In CVPR, Cited by: §5.
  • S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun (2019) Objects365: a large-scale, high-quality dataset for object detection. In ICCV, Cited by: §1, §6.2.
  • L. Shen, Z. Lin, and Q. Huang (2016)

    Relay backpropagation for effective learning of deep convolutional neural networks

    In ECCV, Cited by: Table 8.
  • G. Song, Y. Liu, and X. Wang (2020) Revisiting the sibling head in object detector. In CVPR, Cited by: §2, §2, Table 3.
  • P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al. (2020) Scalability in perception for autonomous driving: an open dataset benchmark. CVPR. Cited by: §2.
  • M. Tan, R. Pang, and Q. V. Le (2020) Efficientdet: scalable and efficient object detection. In CVPR, Cited by: §1, §5, §6, §6, Table 2, Table 3.
  • Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In ICCV, Cited by: §2, §3, §3, §5, §5, §5, Table 3.
  • Z. Tian, C. Shen, H. Chen, and T. He (2020) Fcos: a simple and strong anchor-free object detector. TPAMI. Cited by: §6.
  • J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders (2013) Selective search for object recognition. IJCV. Cited by: §1, §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Neural Information Processing Systems, Cited by: §2.
  • C. Wang, A. Bochkovskiy, and H. M. Liao (2020) Scaled-yolov4: scaling cross stage partial network. arXiv preprint arXiv:2011.08036. Cited by: Table 3.
  • J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. In CVPR, Cited by: §5.
  • Y. Wu, A. Kirillov, F. Massa, W. Lo, and R. Girshick (2019) Detectron2. Note: Cited by: §5.
  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017)

    Aggregated residual transformations for deep neural networks

    In CVPR, Cited by: §5.
  • Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019) RepPoints: point set representation for object detection. ICCV. Cited by: §2, Table 3.
  • Z. Yang, Y. Xu, H. Xue, Z. Zhang, R. Urtasun, L. Wang, S. Lin, and H. Hu (2020) Dense reppoints: representing visual objects with dense point sets. In Neural Information Processing Systems, Cited by: §2.
  • F. Yu, D. Wang, E. Shelhamer, and T. Darrell (2018) Deep layer aggregation. In CVPR, Cited by: §2, §5.
  • H. Zhang, C. Wu, Z. Zhang, Y. Zhu, Z. Zhang, H. Lin, Y. Sun, T. He, J. Muller, R. Manmatha, M. Li, and A. Smola (2020a) ResNeSt: split-attention networks. arXiv:2004.08955. Cited by: Table 3.
  • S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2020b) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In CVPR, Cited by: §2, §2, §3, Table 1, §5, Table 3.
  • X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye (2019) Freeanchor: learning to match anchors for visual object detection. In Neural Information Processing Systems, Cited by: §2, Table 3.
  • X. Zhou, D. Wang, and P. Krähenbühl (2019a) Objects as points. arXiv:1904.07850. Cited by: §1, §2, §2, §3, §3, §5, §6, Table 3.
  • X. Zhou, J. Zhuo, and P. Krähenbühl (2019b) Bottom-up object detection by grouping extreme and center points. In CVPR, Cited by: §2.
  • B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li, and J. Sun (2020a) AutoAssign: differentiable label assignment for dense object detection. arXiv:2007.03496. Cited by: §2, §2, Table 3.
  • C. Zhu, F. Chen, Z. Shen, and M. Savvides (2020b) Soft anchor-point object detection. ECCV. Cited by: Table 3.
  • X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020c) Deformable detr: deformable transformers for end-to-end object detection. arXiv:2010.04159. Cited by: §2, §2, §3, Table 3.
  • C. L. Zitnick and P. Dollár (2014) Edge boxes: locating object proposals from edges. In ECCV, Cited by: §1.
  • B. Zoph, G. Ghiasi, T. Lin, Y. Cui, H. Liu, E. D. Cubuk, and Q. V. Le (2020) Rethinking pre-training and self-training. In Neural Information Processing Systems, Cited by: §1, §5, §6, §6.