Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training

04/13/2020 ∙ by Hongkai Zhang, et al. ∙ Institute of Computing Technology, Chinese Academy of Sciences 0

Although two-stage object detectors have continuously advanced the state-of-the-art performance in recent years, the training process itself is far from crystal. In this work, we first point out the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects the performance. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors. Consequently, we propose Dynamic R-CNN to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training. This dynamic design makes better use of the training samples and pushes the detector to fit more high quality samples. Specifically, our method improves upon ResNet-50-FPN baseline with 1.9 COCO dataset with no extra overhead. Codes and models are available at https://github.com/hkzhang95/DynamicRCNN.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

DynamicRCNN

Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Benefiting from the advances in deep convolutional neural networks (CNNs)

[19, 37, 14], object detection has made remarkable progress in recent years. Modern detection frameworks can be divided into two major categories of one-stage detectors [34, 29, 26] and two-stage detectors [10, 12, 35]. And various improvements have been made in recent studies [40, 23, 44, 45, 22, 21, 28, 5]

. In the training procedure of both kinds of pipelines, a classifier and a regressor are adopted respectively to solve the

recognition and localization tasks. Therefore, an effective training process plays a crucial role in achieving high quality object detection111Specifically, high quality represents the results under high IoU..

Figure 1: (a) The number of positive proposals under different IoU thresholds during the training process. The curve shows the numbers of positives vary significantly during training, with corresponding changes in regression errors distribution ( and

stands for the standard deviation for

and respectively). (b) The IoU and regression error of the 75th and 10th most accurate proposals respectively in the training procedure. These curves further show the improved quality of proposals.

Different from the image classification task, the annotations for the classification task in object detection are the ground-truth boxes in the image. So it is not clear how to assign positive and negative labels for the proposals in classifier training since their separation may be ambiguous. The most widely used strategy is to set a threshold for the IoU of the proposal and corresponding ground-truth. As shown in Table 1, training with a certain IoU threshold will lead to a classifier that degrades the performance at other IoUs. However, we cannot directly set a high IoU from the beginning of the training due to the scarcity of positive samples. The solution that Cascade R-CNN [3] provides is to gradually refine the proposals by several stages, which are effective yet time-consuming. As for regressor, the problem is similar. During training, the quality of proposals is improved, however the parameter in SmoothL1 Loss is fixed. Thus it leads to insufficient training for the high quality proposals.

To solve this issue, we first examine an overlooked fact that the quality of proposals is indeed improved over training process as shown in Figure 1. We can find that even under different IoU thresholds, the number of positive samples still increases significantly. Inspired by the illuminating observations, we propose Dynamic R-CNN, a simple yet effective method to better exploit the dynamic quality of proposals for object detection. It consists of two components: Dynamic Label Assignment and Dynamic SmoothL1 Loss, which are designed for the classification and regression branches, respectively. First, to train a better classifier that is discriminative for high IoU proposals, we gradually adjust the IoU threshold for positive/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution. For regression, we choose to change the shape of the regression loss function to adaptively fit the distribution change of error and ensure the contribution of high quality samples to training. In particular, we adjust the in SmoothL1 Loss based on the error distribution of the regression loss function, in which actually controls the magnitude of the gradient of small errors (shown in Figure 4).

By this dynamic scheme, we can not only alleviate the data scarcity issue at the beginning of the training, but also harvest the benefit of high IoU training. These two modules explore different parts of the detector, thus could work collaboratively towards high quality object detection. Furthermore, despite the simplicity of our proposed method, Dynamic R-CNN could bring consistent performance gains on MS COCO [27] with almost no extra computational complexity in training. And during the inference phase, our method does not introduce any additional overhead. Moreover, extensive experiments verify the proposed method could generalize to other baselines with stronger performance.

2 Related Work

Region-based object detectors. The general practice of region-based object detectors is converting the object detection task into a bounding box classification and a regression problem. In recent years, region-based approaches have been the leading paradigm with top performance. For example, R-CNN [10], Fast R-CNN [12] and Faster R-CNN [35] first generate some candidate region proposals, then randomly sample a small batch with certain foreground-background ratio from all the proposals. These proposals will be fed into a second stage to classify the categories and refine the locations at the same time. Later, some works extended Faster R-CNN to address different problems. R-FCN [7] makes the whole network fully convolutional to improve the speed; and FPN [25] proposes a top-down pathway to combine multi-scale features. Besides, various improvements have been witnessed in recent studies [16, 41, 23, 24, 49].

Classification in object detection. Recent researches focus on improving object classifier from various perspectives [26, 17, 31, 39, 22, 46, 6, 15]. The classification scores in detection not only determine the semantic category for each proposal, but also imply the localization accuracy, since Non-Maximum Suppression (NMS) suppresses less confident boxes using more reliable ones. It ranks the resultant boxes first using the classification scores. However, as mentioned in IoU-Net [17], the classification score has low correlation with localization accuracy, which leads to noisy ranking and limited performance. Therefore, IoU-Net [17] adopts an extra branch for predicting IoU scores and refining the classification confidence. Softer NMS [15]

devises an KL loss to model the variance of bounding box regression directly, and uses that for voting in NMS. Another direction to improve is to raise the IoU threshold for training high quality classifiers, since training with different IoU thresholds will lead to classifiers with corresponding quality. However, as mentioned in Cascade R-CNN 

[3], directly raising the IoU threshold is impractical due to the vanishing positive samples. Therefore, to produce high quality training samples, some approaches [3, 45] adopt sequential stages which are effective yet time-consuming. Essentially, it should be noted that these methods ignore the inherent dynamic property in training procedure which is useful for training high quality classifiers.

Backbone IoU
ResNet-50-FPN 0.4 35.4 58.2 53.0 44.1 29.2 7.3
ResNet-50-FPN 0.5 36.6 58.1 53.5 45.8 31.5 8.8
ResNet-50-FPN 0.6 35.7 56.0 51.6 44.5 31.6 9.3
Table 1: Object detection results with different IoU thresholds using FPN-based Faster R-CNN [35, 25] evaluated on COCO minival set.

Bounding box regression. It has been proved that the performance of models is dependent on the relative weight between losses in multi-task learning [18]. Cascade R-CNN [3] also adopt different regression normalization factors to adjust the aptitude of regression term in different stages. Besides, Libra R-CNN [31] proposes to promote the regression gradients from the accurate samples; and SABL [42] localizes each side of the bounding box with a lightweight two step bucketing scheme for precise localization. However, they mainly focus on a fixed scheme ignoring the dynamic distribution of learning targets during training.

Dynamic training. There are various researches following the idea of dynamic training. A widely used example is adjusting the learning rate based on the training iterations [30]. Besides, Curriculum Learning [1] and Self-paced Learning [20] focus on improving the training order of the examples. Moreover, for object detection, hard mining methods [36, 26, 31] can also be regarded as a dynamic way. However, they don’t handle the core issues in object detection such as constant label assignment strategy. Our method is complementary to theirs.

3 Dynamic Quality in the Training Procedure

Generally speaking, Object detection is complex since it needs to solve two main tasks: recognition and localization. Recognition task needs to distinguish foreground objects from backgrounds and determine the semantic category for them. Besides, the localization task needs to find accurate bounding boxes for different objects. To achieve high-quality object detection, we need to further explore the training process of both two tasks as follows.

3.1 Proposal Classification

How to assign labels is an interesting question for the classifier in object detection. It is unique to other classification problems since the annotations for detection are the ground-truth boxes in the image. There is no doubt that a proposal should be negative if it does not overlap with any ground-truth, and a proposal should be positive if its overlap with a ground-truth is 100%. However, it is a dilemma to define whether a proposal with IoU 0.5 should be labeled as positive or negative.

In the original Faster R-CNN [35], the labels are assigned by comparing the box’s highest IoU with ground-truths using a pre-defined IoU threshold. Formally, the paradigm can be formulated as follows (we take a binary classification loss for simplicity):

(1)

Here stands for a bounding box, represents for the set of ground-truths, and are the positive and negative threshold for IoU. stand for positives, negatives and ignored samples, respectively. As for the second stage of Faster R-CNN, and are set to 0.5 by default [11]. So the definition of positives and negatives is essentially hand-crafted.

Since the goal of classifier is to distinguish the positives and negatives, training with different IoU thresholds will lead to classifiers with corresponding quality [3]. We again verify this on the COCO benchmark [27] with a FPN-based Faster R-CNN detector. From Table 1 we can see that though the overall mAP does not change a lot, different IoU thresholds will lead to different performance under different metrics.

Therefore, to achieve high quality object detection, we need to train the classifier with a high IoU threshold. However, as mentioned in Cascade R-CNN [3], directly raising the IoU threshold is impractical due to the vanishing positive samples. Cascade R-CNN uses several sequential stages to lift the IoU of the proposals, which are effective yet time-consuming. Is there a way to get the best of two worlds? As mentioned above, the quality of the proposals actually improves along the training. This observation inspires us to take a progressive approach in training: At the beginning, the proposal network is not capable to produce enough high quality proposals, so we will use a lower IoU threshold to better accommodate these imperfect proposals in second stage training. As training goes, the quality of proposals improves, we gradually have enough high quality proposals. As a result, we may increase the threshold to better utilize them to train a high quality detector that is more discriminative at higher IoU. We will formulate this process in the following section.

3.2 Bounding Box Regression

Figure 2: distribution at different iterations and IoU thresholds (we randomly select some points for simplicity). From the first two columns, we find at the same IoU threshold, the regression labels are more concentrated. Moreover, after we change the IoU threshold, the distribution for the positives are also changed significantly by comparing the last two columns.

The task of bounding box regression is to regress the positive candidate bounding box to a target ground-truth . This is learned under the supervision of the regression loss function . To encourage a regression target invariant to scale and location, operates on the offset defined by

(2)

Since the bounding box regression performs on the offsets, the absolute values of Equation (2) can be very small. To balance the different terms in multi-task learning, is usually normalized by pre-defined mean and stdev (standard deviation) as widely used in many work [35, 25, 13].

However, we discover that the distribution of regression targets are shifting during training. As shown in Figure 2, we calculate the statistics of the regression targets under different iterations and IoU thresholds. First, from the first two columns, we find that under the same IoU threshold for positives, the mean and stdev are decreasing as the training goes due to the improved quality of proposals. With the same normalization factors, the contributions of those high quality samples will be reduced based on the definition of SmoothL1 loss function, which is harmful to the training of high quality regressors. Moreover, with a higher IoU threshold, the quality of positive samples is further enhanced, thus their contributions are reduced even more, which will greatly limit the overall performance. Therefore, to achieve high quality object detection, we need to fit the distribution change and adjust the shape of regression loss function to compensate for the increasing of high quality proposals.

4 Dynamic R-CNN

To better exploit the dynamic property of the training procedure, we propose Dynamic R-CNN which is shown in Figure 3. Our key insight is adjusting the second stage classifier and regressor to fit the distribution change of proposals. The two components designed for the classification and localization branch will be elaborated in the following sections.

Figure 3: The overall pipeline of the proposed Dynamic R-CNN. Considering the dynamic property of the training process, Dynamic R-CNN consists of two main components (a) Dynamic Label Assignment (DLA) process and (b) Dynamic SmoothL1 Loss (DSL) from different perspectives. From the left part of (a) we can find that there are more high quality proposals as the training goes. With the improved quality of proposals, DLA will automatically raise the IoU threshold based on the proposal distribution. Then positive (green) and negative (red) labels are assigned for the proposals by DLA which are shown in the right part of (a). Meanwhile, to fit the distribution change and compensate for the increasing of high quality proposals, the shape of regression loss function is also adjusted correspondingly in (b). Best viewed in color.

4.1 Dynamic Label Assignment

The Dynamic Label Assignment (DLA) process is illustrated in Figure 3 (a). Based on the common practice of label assignment in Equation (1) in object detection, the DLA module can be formulated as follows:

(3)

where stands for the current IoU threshold. Considering the dynamic property in training, the distribution of proposals is changing over time. Our DLA updates the automatically based on the statistics of proposals to fit this distribution change. Specifically, we first calculate the IoUs between proposals and their target ground-truths, and then select the -th largest value from as the threshold . As the training goes, will increase gradually which reflects the improved quality of proposals. In practice, we first calculate the -th largest value in each batch, and then update every iterations using the mean of them to enhance the robustness of the training. It should be noted that the calculation of IoUs is already done by the original method, so there is almost no additional complexity in our method. The resultant IoU thresholds used in training are illustrated in Figure 3 (a).

4.2 Dynamic SmoothL1 Loss

The localization task for object detection is supervised by the commonly used SmoothL1 Loss, which can be formulated as follows:

(4)

Here the stands for the regression error. is a hyper-parameter controlling in which range we should use a softer loss function like loss instead of the original loss. Considering the robustness of training, is set default as 1.0 to prevent the exploding loss due to the poor trained network in the early stages. We also illustrate the impact of in Figure 4, in which changing leads to different curves of loss and gradient. It is easy to find that a smaller actually accelerate the saturation of the magnitude of gradient, thus it makes smaller error contributes more to the network training.

Figure 4: We show curves for (a) loss and (b) gradient of SmoothL1 loss with different here. is set default as 1.0 in the R-CNN part.

As analyzed in Section 3.2, we need to fit the distribution change and adjust the regression loss function to compensate for the high quality samples. So we propose Dynamic SmoothL1 Loss (DSL) to change the shape of loss function to gradually focus on high quality samples as follows:

(5)

Similar to DLA, DSL will change the value of according to the statistics of regression errors which can reflect the localization accuracy. To be more specific, we first obtain the regression errors between proposals and their target ground-truths, then select the -th smallest value from to update the in Equation (5). Similarly, we also update the every iterations using the median of the

-th smallest error in each batch. We choose median instead of mean as in the classification because we find more outliers in regression errors. Through this dynamic way, appropriate

will be adopted automatically as shown in Figure 3 (b), which will better exploit the training samples and lead to a high quality regressor.

To summarize the whole method, we describe the proposed Dynamic R-CNN in Algorithm 1. Besides the proposals and ground-truths , Dynamic R-CNN has three hyperparamters: IoU threshold top-k , top-k and update iteration count

. Note that compared with baseline, we only introduce one additional hyperparameter. And we will show soon the results are actually quite robust to the choice of these hyperparameters.

1:
2:Proposal set , ground-truth set .
3:IoU threshold top-k , top-k , update iteration count .
4:
5:Trained object detector .
6:Initialize IoU threshold and SmoothL1 as
7:Build two empty sets for recording the IoUs and regression errors
8:for  to max_iter do
9:     Obtain matched IoUs and regression errors between and
10:     Select thresholds based on the
11:     Record corresponding values, add to and to
12:     if  then
13:         Compute new IoU threshold:
14:         Compute new SmoothL1 :
15:         Update the current values
16:               
17:     Train the network with
18:return Improved object detector
Algorithm 1 Dynamic R-CNN

5 Experiments

5.1 Dataset and Evaluation Metrics

Experimental results are presented on the bounding box detection track of the challenging MS COCO [27] dataset that includes 80 object classes. Following the common practice [26, 13], we use the COCO trainval35k split (union of 80k images from train and a random 35k subset of images from the 40k image val split) for training and report the ablation studies on the minival split (the remaining 5k images from val). We also submit our main results to the evaluation server for the final performance on the test-dev split, which has no disclosed labels

. The COCO-style Average Precision (AP) is chosen as the main evaluation metric which averages AP across IoU thresholds from 0.5 to 0.95 with an interval of 0.05. We also include other metrics to better understand the behavior of the proposed method.

Method Backbone
Faster R-CNN ResNet-50 37.3 58.5 40.6 20.3 39.2 49.1
Faster R-CNN+ ResNet-50 38.1 58.9 41.5 20.5 40.0 50.0
Faster R-CNN ResNet-101 39.3 60.5 42.7 21.3 41.8 51.7
Faster R-CNN+ ResNet-101 39.9 60.6 43.5 21.4 42.4 52.1
Faster R-CNN++MST ResNet-101 42.8 63.8 46.8 24.8 45.6 55.6
Faster R-CNN++MST ResNet-101-DCN 44.8 65.5 48.8 26.2 47.6 58.1
Faster R-CNN++MST* ResNet-101-DCN 46.9 68.1 51.4 30.6 49.6 58.1
Ours
Dynamic R-CNN ResNet-50 39.1 58.0 42.8 21.3 40.9 50.3
Dynamic R-CNN+ ResNet-50 39.9 58.6 43.7 21.6 41.5 51.9
Dynamic R-CNN ResNet-101 41.2 60.1 45.1 22.5 43.6 53.2
Dynamic R-CNN+ ResNet-101 42.0 60.7 45.9 22.7 44.3 54.3
Dynamic R-CNN++MST ResNet-101 44.7 63.6 49.1 26.0 47.4 57.2
Dynamic R-CNN++MST ResNet-101-DCN 46.9 65.9 51.3 28.1 49.6 60.0
Dynamic R-CNN++MST* ResNet-101-DCN 49.2 68.6 54.0 32.5 51.7 60.3
Table 2: Comparisons with different baselines on COCO test-dev set. We report our re-implemented results here. “MST” and “*” stand for multi-scale training and testing respectively. “” is a training schedule that follows the settings explained in Detectron [11].

5.2 Implementation Details

For fair comparisons, all experiments are implemented on PyTorch 

[32] and follow the settings in maskrcnn-benchmark222https://github.com/facebookresearch/maskrcnn-benchmark and SimpleDet [4]. We adopt FPN-based Faster R-CNN [35, 25] with ResNet-50 [14]

model pre-trained on ImageNet 

[9] as our baseline. All models are trained on the COCO trainval35k and tested on minival with image short size at 800 pixels unless noted. Due to the scarcity of positives in the training procedure, we set the NMS threshold of RPN to 0.85 instead of 0.7 for all the experiments.

5.3 Main Results

We compare Dynamic R-CNN with the corresponding baselines on the COCO test-dev set in Table 2. For fair comparisons, we report our re-implemented results here.

First, we prove that our method can work on different backbones. With the dynamic design, Dynamic R-CNN achieves 39.1% AP with ResNet-50 [14], which is 1.8 points higher than the FPN-based Faster R-CNN baseline. With a stronger backbone like ResNet-101, Dynamic R-CNN can also achieve consistent gains (+1.9 points).

Then, our dynamic design is also compatible with other training and testing skills. The results are consistently improved by progressively adding in longer training schedule, multi-scale training and testing and deformable convolution [49]. With the best combination, out Dynamic R-CNN achieves 49.2% AP, which is still 2.3 points higher than the Faster R-CNN baseline. It should be noted that we only use five scales in multi-scale testing, and the result will be better if more scales are adopted.

These experimental results show the effectiveness and robustness of our method since it can work together with different backbones and multiple training and testing skills. It should also be noted that the performance gains are almost free.

Backbone DLA DSL
ResNet-50-FPN 37.0 - 58.0 53.5 46.0 32.6 9.7
ResNet-50-FPN 38.0 +1.0 57.6 53.5 46.7 34.4 13.2
ResNet-50-FPN 38.2 +1.2 57.5 53.6 47.1 35.2 12.6
ResNet-50-FPN 38.9 +1.9 57.3 53.6 47.4 36.3 15.2
Table 3: Effects of each component in our Dynamic R-CNN. Results are reported using ResNet-50-FPN on COCO minival set.
Figure 5: We show the trends of (a) IoU threshold and (b) SmoothL1 under different settings based on our method. Obviously the distribution has changed a lot during training.

5.4 Ablation Experiments

To show the effectiveness of each proposed component, we report the overall ablation studies in Table 3.

1) Dynamic Label Assignment (DLA). Dynamic Label Assignment brings 1.2 points higher box AP than the ResNet-50-FPN baseline. To be more specific, results in higher IoU metrics are consistently improved, especially for the 2.9 points gains in . It proves the effectiveness of our method for pushing the classifier to be more discriminative at higher IoU thresholds.

2) Dynamic SmoothL1 Loss (DSL). Dynamic SmoothL1 Loss improves the box AP from 37.0 to 38.0. Results in higher IoU metrics like and are hugely improved, which validates the effectiveness of changing the loss function to compensate for the high quality samples during training. Moreover, as analyzed in Section 3.2, with DLA the quality of positives is further improved thus their contributions are reduced even more. So applying DSL on DLA will also bring reasonable gains especially on the high quality metrics. To sum up, Dynamic R-CNN improves the baseline by 1.9 points and brings 5.5 points gains in .

3) Illustration of dynamic training. To further illustrate the dynamics in the training procedure, we show the trends of IoU threshold and SmoothL1 under different settings based on our method in Figure 5. Here we clip the values of IoU threshold and to 0.4 and 1.0 respectively at the beginning of training. Regardless of the specific values of and , the overall trend of IoU threshold is increasing while that for SmoothL1 is decreasing during training. These results again verify the proposed method work as expected.

5.5 Studies on the effect of hyperparameters

Ablation study on in DLA. Experimental results with different of Dynamic Label Assignment are shown in Table 4. Compared to the Faster R-CNN baseline, DLA can achieve consistent gains in AP regardless of the choice of . These results prove the effectiveness of the DLA part and also show the universality of hyperparameter .

Backbone Setting
ResNet-50-FPN Baseline 37.0 58.0 53.5 46.0 32.6 9.7
ResNet-50-FPN 38.1 57.2 53.3 46.8 35.1 12.8
ResNet-50-FPN 38.2 57.5 53.6 47.1 35.2 12.6
ResNet-50-FPN 37.9 57.9 53.8 46.9 34.2 11.6
Table 4: Ablation studies for hyperparameter of Dynamic Label Assignment on COCO minival set.

Moreover, the performance under various metrics are changed under different . Since controls the results of label assignment process, choosing as 64/75/100 means that nearly 12.5%/15%/20% of the whole batch are selected as positives. Generally speaking, setting a smaller will increase the quality of selected samples, which will lead to better accuracy under higher metrics like . On the contrary, adopting a larger will be more helpful for the metrics at lower IoU. Finally, we find that setting as 75 achieves the best trade-off and use it as the default value for further experiments. All these ablations prove the effectiveness and robustness of the Dynamic Label Assignment. Note that as illustrated in Table 1, adjusting the thresholds, but leaving them fixed will result in inferior results.

Backbone Setting
ResNet-50-FPN - 37.0 58.0 53.5 46.0 32.6 9.7
- 35.9 57.7 53.2 45.1 30.1 8.3
- 37.5 57.6 53.3 46.4 33.5 11.3
ResNet-50-FPN DSL 15 37.6 57.3 53.1 46.0 33.9 12.5
DSL 10 38.0 57.6 53.5 46.7 34.4 13.2
DSL 8 37.6 57.5 53.3 45.9 33.9 12.4
Table 5: Ablation studies for hyperparameter of dynamic SmoothL1 Loss on COCO minival set.

Ablation study on in DSL. We show the results in Table 5. We first conduct experiments with different values of on the original Faster R-CNN and empirically find that using a smaller leads to better performance. Then, experiments of DSL under different are provided to show the effects of hyperparameter . Regardless of the certain value of , DSL can achieve consistent improvements compared with various fine-tuned baselines, which proves the effectiveness of the DSL part. Specifically, with our best setting, DSL is able to bring 1.0 point higher AP than the baseline, and the improvement mainly lies in the high quality metrics like (+3.5 points). These experimental results prove that our DSL is effective in compensating for high quality samples and can lead to a better regressor due to the advanced dynamic design.

Backbone Setting
ResNet-50-FPN Baseline 37.0 58.0 53.5 46.0 32.6 9.7
ResNet-50-FPN 38.0 57.4 53.5 47.0 35.0 12.5
ResNet-50-FPN 38.2 57.5 53.6 47.1 35.2 12.6
ResNet-50-FPN 38.1 57.6 53.5 47.2 34.8 12.6
Table 6: Ablation studies for iteration count on COCO minival set. Experiments are conducted with DLA only.

Ablation study on iteration count . As mentioned in Section 4, due to the concern of robustness, we update the IoU threshold every iterations using the mean of recorded IoU values in the last interval. To show the effects of different iteration count , we try different values of on the proposed method. As shown in Table 6, setting as 20, 100 and 500 leads to very similar results, which proves the robustness to this hyperparameter.

Backbone Method Extra training time Inference speed
ResNet-50-FPN Dynamic R-CNN 13.9 FPS
Cascade R-CNN 11.2 FPS
Dynamic Mask R-CNN 11.3 FPS
Cascade Mask R-CNN 7.3 FPS
ResNet-101-FPN Dynamic R-CNN 11.2 FPS
Cascade R-CNN 9.6 FPS
Dynamic Mask R-CNN 9.8 FPS
Cascade Mask R-CNN 6.4 FPS
Table 7: Comparisons on the training efficiency and inference speed between cascade manner and dynamic design under different methods and backbones. The results are reported per image on a single RTX 2080TI GPU.

Complexity and speed. As shown in the Algorithm 1

, the main computational complexity of our method lies in the calculations of IoUs and regression errors. However, these time-consuming statistics are already done by the original method. Thus the additional overhead of Dynamic R-CNN only lies in calculating the mean or median of a short vector, which basically does not increase the training time. Moreover, since our method only changes the training procedure, obviously no additional overhead is introduced during the inference phase.

Our advantage compared to other high quality detectors like Cascade R-CNN is the training efficiency and inference speed. As shown in Table 7, though the results of Cascade R-CNN are slightly better than us, it increases the training time and slows down the inference speed while our method does not. Specifically, under the ResNet-50-FPN backbone, our Dynamic R-CNN achieves 13.9 FPS, which is 1.25 times faster than Cascade R-CNN (9.6 FPS). Note that the difference will be more apparent as the backbone gets smaller, because the main overhead of Cascade R-CNN is the two additional headers it introduces. Moreover, with larger heads, the cascade manner will further slow down the inference speed. We compare the inference speed between Cascade Mask R-CNN and Dynamic Mask R-CNN, and find that our method runs 1.5 times faster than the cascade manner under different backbones.

Backbone Task +Dynamic
ResNet-50-FPN bbox 37.5 58.0 40.7 20.2 40.2 50.9
bbox 39.4 57.6 43.3 21.3 41.8 53.2
segm 33.8 54.6 36.0 14.5 35.7 51.5
segm 34.8 55.0 37.5 15.0 36.9 52.1
ResNet-101-FPN bbox 39.7 60.7 43.2 21.9 42.9 54.3
bbox 41.8 60.4 45.8 23.4 44.9 55.6
segm 35.6 56.9 37.7 15.7 38.0 53.6
segm 36.7 57.5 39.4 16.6 39.3 54.0
Table 8: The generalization capacity of our Dynamic R-CNN. We apply the idea of dynamic training on Mask R-CNN under different backbones. “bbox” and “segm” stands for the object detection and instance segmentation results, respectively. Results are reported on COCO minival set.
Method Backbone
RetinaNet [26] ResNet-101 39.1 59.1 42.3 21.8 42.7 50.2
CornerNet [21] Hourglass-104 40.5 56.5 43.1 19.4 42.7 53.9
FCOS [40] ResNet-101 41.0 60.7 44.1 24.0 44.1 51.0
FreeAnchor [47] ResNet-101 41.8 61.1 44.9 22.6 44.7 53.9
RepPoints [44] ResNet-101-DCN 45.0 66.1 49.0 26.6 48.6 57.5
CenterNet [48] Hourglass-104 45.1 63.9 49.3 26.6 47.1 57.7
ATSS [46] ResNet-101-DCN 46.3 64.7 50.4 27.7 49.8 58.4
Faster R-CNN [25] ResNet-101 36.2 59.1 39.0 18.2 39.0 48.2
Mask R-CNN [13] ResNet-101 38.2 60.3 41.7 20.1 41.1 50.2
Regionlets [43] ResNet-101 39.3 59.8 - 21.7 43.7 50.9
Libra R-CNN [31] ResNet-101 41.1 62.1 44.7 23.4 43.7 52.5
Cascade R-CNN [3] ResNet-101 42.8 62.1 46.3 23.7 45.5 55.2
SNIP [38] ResNet-101-DCN 44.4 66.2 49.9 27.3 47.4 56.9
DCNv2 [49] ResNet-101-DCN 46.0 67.9 50.8 27.8 49.1 59.5
TridentNet [23] ResNet-101-DCN 48.4 69.7 53.5 31.8 51.3 60.3
Dynamic R-CNN ResNet-101 42.0 60.7 45.9 22.7 44.3 54.3
Dynamic R-CNN* ResNet-101-DCN 50.1 68.3 55.6 32.8 53.0 61.2
Table 9: Comparisons of single-model results for different detectors on COCO test-dev set.

5.6 Generalization Capacity

Since the viewpoint of dynamic training is a general concept, we believe that it can be adopted in different methods. To validate the generalization capacity, we further apply the dynamic design on Mask R-CNN with different backbones. As shown in Table 8, adopting the dynamic design can achieve consistent gains (+1.92.0 points) on the box AP across different backbones. Moreover, we also evaluate the results on the task of instance segmentation. We find that adopting our dynamic design can also achieve consistent gains regardless of backbones. It is worth noting that we only adopt the DLA and DSL which are designed for object detection, so these results further demonstrate the generality and effectiveness of our dynamic design on improving training procedure for current state-of-the-art object detectors.

5.7 Comparison with State-of-the-Arts

We compare Dynamic R-CNN with the state-of-the-art object detection approaches on the COCO test-dev set in Table 9.

Considering that various backbones and training/testing settings are adopted by different detectors (including deformable convolutions [8, 49], image pyramid scheme [38]

, large-batch Batch-Normalization 

[33] and Soft-NMS [2]), we report the results of our method with two types.

Dynamic R-CNN applies our method on FPN-based Faster R-CNN with ResNet-101 as backbone, and it can achieve 42.0% AP without bells and whistles. Dynamic R-CNN* adopts image pyramid scheme (multi-scale training and testing), deformable convolutions and Soft-NMS. It further improves the results to 50.1% AP, outperforming all the previous detectors.

6 Conclusion

In this paper, we take a thorough analysis of the training process of detectors and find that the fixed scheme limits the overall performance. Based on the advanced dynamic viewpoint, we propose Dynamic R-CNN to better exploit the training procedure. With the help of the simple but effective components like Dynamic Label Assignment and Dynamic SmoothL1 Loss, Dynamic R-CNN brings significant improvements on the challenging COCO dataset with no extra cost. Extensive experiments with various detectors and backbones validate the generality and effectiveness of Dynamic R-CNN. We hope that this dynamic viewpoint can inspire further researches in the future.

References

  • [1] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In ICML, Cited by: §2.
  • [2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017) Soft-NMS – improving object detection with one line of code. In ICCV, Cited by: §5.7.
  • [3] Z. Cai and N. Vasconcelos (2018) Cascade R-CNN: delving into high quality object detection. In CVPR, Cited by: §1, §2, §2, §3.1, §3.1, Table 9.
  • [4] Y. Chen, C. Han, Y. Li, Z. Huang, Y. Jiang, N. Wang, and Z. Zhang (2019) SimpleDet: a simple and versatile distributed framework for object detection and instance recognition. JMLR 20 (156), pp. 1–8. Cited by: §5.2.
  • [5] Y. Chen, C. Han, N. Wang, and Z. Zhang (2019) Revisiting feature alignment for one-stage object detection. arXiv:1908.01570. Cited by: §1.
  • [6] B. Cheng, Y. Wei, H. Shi, R. Feris, J. Xiong, and T. Huang (2018) Revisiting RCNN: on awakening the classification power of faster RCNN. In ECCV, Cited by: §2.
  • [7] J. Dai, Y. Li, K. He, and J. Sun (2016) R-FCN: object detection via region-based fully convolutional networks. In NIPS, Cited by: §2.
  • [8] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, Cited by: §5.7.
  • [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §5.2.
  • [10] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, Cited by: §1, §2.
  • [11] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §3.1, Table 2.
  • [12] R. Girshick (2015) Fast R-CNN. In ICCV, Cited by: §1, §2.
  • [13] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017) Mask R-CNN. In ICCV, Cited by: §3.2, §5.1, Table 9.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §1, §5.2, §5.3.
  • [15] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In CVPR, Cited by: §2.
  • [16] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, Cited by: §2.
  • [17] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In ECCV, Cited by: §2.
  • [18] A. Kendall, Y. Gal, and R. Cipolla (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In CVPR, Cited by: §2.
  • [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012) ImageNet classification with deep convolutional neural networks. In NIPS, Cited by: §1.
  • [20] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In NIPS, Cited by: §2.
  • [21] H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. In ECCV, Cited by: §1, Table 9.
  • [22] H. Li, Z. Wu, C. Zhu, C. Xiong, R. Socher, and L. S. Davis (2019) Learning from noisy anchors for one-stage object detection. arXiv:1912.05086. Cited by: §1, §2.
  • [23] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. In ICCV, Cited by: §1, §2, Table 9.
  • [24] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2018) DetNet: design backbone for object detection. In ECCV, Cited by: §2.
  • [25] T. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: Table 1, §2, §3.2, §5.2, Table 9.
  • [26] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017) Focal loss for dense object detection. In ICCV, Cited by: §1, §2, §2, §5.1, Table 9.
  • [27] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §1, §3.1, §5.1.
  • [28] S. Liu, D. Huang, and Y. Wang (2018) Receptive field block net for accurate and fast object detection. In ECCV, Cited by: §1.
  • [29] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In ECCV, Cited by: §1.
  • [30] I. Loshchilov and F. Hutter (2017)

    SGDR: stochastic gradient descent with warm restarts

    .
    In ICLR, Cited by: §2.
  • [31] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin (2019) Libra R-CNN: towards balanced learning for object detection. In CVPR, Cited by: §2, §2, §2, Table 9.
  • [32] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Workshop, Cited by: §5.2.
  • [33] C. Peng, T. Xiao, Z. Li, Y. Jiang, X. Zhang, K. Jia, G. Yu, and J. Sun (2018) MegDet: a large mini-batch object detector. In CVPR, Cited by: §5.7.
  • [34] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §1.
  • [35] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §1, Table 1, §2, §3.1, §3.2, §5.2.
  • [36] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In CVPR, Cited by: §2.
  • [37] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In ICLR, Cited by: §1.
  • [38] B. Singh and L. S. Davis (2018) An analysis of scale invariance in object detection - SNIP. In CVPR, Cited by: §5.7, Table 9.
  • [39] Z. Tan, X. Nie, Q. Qian, N. Li, and H. Li (2019) Learning to rank proposals for object detection. In ICCV, Cited by: §2.
  • [40] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In ICCV, Cited by: §1, Table 9.
  • [41] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. In CVPR, Cited by: §2.
  • [42] J. Wang, W. Zhang, Y. Cao, K. Chen, J. Pang, T. Gong, J. Shi, C. C. Loy, and D. Lin (2019) Side-aware boundary localization for more precise object detection. arXiv:1912.04260. Cited by: §2.
  • [43] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa (2018) Deep regionlets for object detection. In ECCV, Cited by: Table 9.
  • [44] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019) RepPoints: point set representation for object detection. In ICCV, Cited by: §1, Table 9.
  • [45] H. Zhang, H. Chang, B. Ma, S. Shan, and X. Chen (2019) Cascade RetinaNet: maintaining consistency for single-stage object detection. In BMVC, Cited by: §1, §2.
  • [46] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2019) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv:1912.02424. Cited by: §2, Table 9.
  • [47] X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye (2019) FreeAnchor: learning to match anchors for visual object detection. In NeurIPS, Cited by: Table 9.
  • [48] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv:1904.07850. Cited by: Table 9.
  • [49] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In CVPR, Cited by: §2, §5.3, §5.7, Table 9.