Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training
Although two-stage object detectors have continuously advanced the state-of-the-art performance in recent years, the training process itself is far from crystal. In this work, we first point out the inconsistency problem between the fixed network settings and the dynamic training procedure, which greatly affects the performance. For example, the fixed label assignment strategy and regression loss function cannot fit the distribution change of proposals and thus are harmful to training high quality detectors. Consequently, we propose Dynamic R-CNN to adjust the label assignment criteria (IoU threshold) and the shape of regression loss function (parameters of SmoothL1 Loss) automatically based on the statistics of proposals during training. This dynamic design makes better use of the training samples and pushes the detector to fit more high quality samples. Specifically, our method improves upon ResNet-50-FPN baseline with 1.9 COCO dataset with no extra overhead. Codes and models are available at https://github.com/hkzhang95/DynamicRCNN.READ FULL TEXT VIEW PDF
In object detection, the intersection over union (IoU) threshold is
We propose a method, called Label Embedding Network, which can learn lab...
In object detection, an intersection over union (IoU) threshold is requi...
Region Proposal Network (RPN) is the cornerstone of two-stage object
Modern object detectors can rarely achieve short training time, fast
Real-time CNN based object detection models for applications like
Region sampling or weighting is significantly important to the success o...
Dynamic R-CNN: Towards High Quality Object Detection via Dynamic Training
Benefiting from the advances in deep convolutional neural networks (CNNs)[19, 37, 14], object detection has made remarkable progress in recent years. Modern detection frameworks can be divided into two major categories of one-stage detectors [34, 29, 26] and two-stage detectors [10, 12, 35]. And various improvements have been made in recent studies [40, 23, 44, 45, 22, 21, 28, 5]
. In the training procedure of both kinds of pipelines, a classifier and a regressor are adopted respectively to solve therecognition and localization tasks. Therefore, an effective training process plays a crucial role in achieving high quality object detection111Specifically, high quality represents the results under high IoU..
stands for the standard deviation forand respectively). (b) The IoU and regression error of the 75th and 10th most accurate proposals respectively in the training procedure. These curves further show the improved quality of proposals.
Different from the image classification task, the annotations for the classification task in object detection are the ground-truth boxes in the image. So it is not clear how to assign positive and negative labels for the proposals in classifier training since their separation may be ambiguous. The most widely used strategy is to set a threshold for the IoU of the proposal and corresponding ground-truth. As shown in Table 1, training with a certain IoU threshold will lead to a classifier that degrades the performance at other IoUs. However, we cannot directly set a high IoU from the beginning of the training due to the scarcity of positive samples. The solution that Cascade R-CNN  provides is to gradually refine the proposals by several stages, which are effective yet time-consuming. As for regressor, the problem is similar. During training, the quality of proposals is improved, however the parameter in SmoothL1 Loss is fixed. Thus it leads to insufficient training for the high quality proposals.
To solve this issue, we first examine an overlooked fact that the quality of proposals is indeed improved over training process as shown in Figure 1. We can find that even under different IoU thresholds, the number of positive samples still increases significantly. Inspired by the illuminating observations, we propose Dynamic R-CNN, a simple yet effective method to better exploit the dynamic quality of proposals for object detection. It consists of two components: Dynamic Label Assignment and Dynamic SmoothL1 Loss, which are designed for the classification and regression branches, respectively. First, to train a better classifier that is discriminative for high IoU proposals, we gradually adjust the IoU threshold for positive/negative samples based on the proposals distribution in the training procedure. Specifically, we set the threshold as the IoU of the proposal at a certain percentage since it can reflect the quality of the overall distribution. For regression, we choose to change the shape of the regression loss function to adaptively fit the distribution change of error and ensure the contribution of high quality samples to training. In particular, we adjust the in SmoothL1 Loss based on the error distribution of the regression loss function, in which actually controls the magnitude of the gradient of small errors (shown in Figure 4).
By this dynamic scheme, we can not only alleviate the data scarcity issue at the beginning of the training, but also harvest the benefit of high IoU training. These two modules explore different parts of the detector, thus could work collaboratively towards high quality object detection. Furthermore, despite the simplicity of our proposed method, Dynamic R-CNN could bring consistent performance gains on MS COCO  with almost no extra computational complexity in training. And during the inference phase, our method does not introduce any additional overhead. Moreover, extensive experiments verify the proposed method could generalize to other baselines with stronger performance.
Region-based object detectors. The general practice of region-based object detectors is converting the object detection task into a bounding box classification and a regression problem. In recent years, region-based approaches have been the leading paradigm with top performance. For example, R-CNN , Fast R-CNN  and Faster R-CNN  first generate some candidate region proposals, then randomly sample a small batch with certain foreground-background ratio from all the proposals. These proposals will be fed into a second stage to classify the categories and refine the locations at the same time. Later, some works extended Faster R-CNN to address different problems. R-FCN  makes the whole network fully convolutional to improve the speed; and FPN  proposes a top-down pathway to combine multi-scale features. Besides, various improvements have been witnessed in recent studies [16, 41, 23, 24, 49].
Classification in object detection. Recent researches focus on improving object classifier from various perspectives [26, 17, 31, 39, 22, 46, 6, 15]. The classification scores in detection not only determine the semantic category for each proposal, but also imply the localization accuracy, since Non-Maximum Suppression (NMS) suppresses less confident boxes using more reliable ones. It ranks the resultant boxes first using the classification scores. However, as mentioned in IoU-Net , the classification score has low correlation with localization accuracy, which leads to noisy ranking and limited performance. Therefore, IoU-Net  adopts an extra branch for predicting IoU scores and refining the classification confidence. Softer NMS 
devises an KL loss to model the variance of bounding box regression directly, and uses that for voting in NMS. Another direction to improve is to raise the IoU threshold for training high quality classifiers, since training with different IoU thresholds will lead to classifiers with corresponding quality. However, as mentioned in Cascade R-CNN, directly raising the IoU threshold is impractical due to the vanishing positive samples. Therefore, to produce high quality training samples, some approaches [3, 45] adopt sequential stages which are effective yet time-consuming. Essentially, it should be noted that these methods ignore the inherent dynamic property in training procedure which is useful for training high quality classifiers.
Bounding box regression. It has been proved that the performance of models is dependent on the relative weight between losses in multi-task learning . Cascade R-CNN  also adopt different regression normalization factors to adjust the aptitude of regression term in different stages. Besides, Libra R-CNN  proposes to promote the regression gradients from the accurate samples; and SABL  localizes each side of the bounding box with a lightweight two step bucketing scheme for precise localization. However, they mainly focus on a fixed scheme ignoring the dynamic distribution of learning targets during training.
Dynamic training. There are various researches following the idea of dynamic training. A widely used example is adjusting the learning rate based on the training iterations . Besides, Curriculum Learning  and Self-paced Learning  focus on improving the training order of the examples. Moreover, for object detection, hard mining methods [36, 26, 31] can also be regarded as a dynamic way. However, they don’t handle the core issues in object detection such as constant label assignment strategy. Our method is complementary to theirs.
Generally speaking, Object detection is complex since it needs to solve two main tasks: recognition and localization. Recognition task needs to distinguish foreground objects from backgrounds and determine the semantic category for them. Besides, the localization task needs to find accurate bounding boxes for different objects. To achieve high-quality object detection, we need to further explore the training process of both two tasks as follows.
How to assign labels is an interesting question for the classifier in object detection. It is unique to other classification problems since the annotations for detection are the ground-truth boxes in the image. There is no doubt that a proposal should be negative if it does not overlap with any ground-truth, and a proposal should be positive if its overlap with a ground-truth is 100%. However, it is a dilemma to define whether a proposal with IoU 0.5 should be labeled as positive or negative.
In the original Faster R-CNN , the labels are assigned by comparing the box’s highest IoU with ground-truths using a pre-defined IoU threshold. Formally, the paradigm can be formulated as follows (we take a binary classification loss for simplicity):
Here stands for a bounding box, represents for the set of ground-truths, and are the positive and negative threshold for IoU. stand for positives, negatives and ignored samples, respectively. As for the second stage of Faster R-CNN, and are set to 0.5 by default . So the definition of positives and negatives is essentially hand-crafted.
Since the goal of classifier is to distinguish the positives and negatives, training with different IoU thresholds will lead to classifiers with corresponding quality . We again verify this on the COCO benchmark  with a FPN-based Faster R-CNN detector. From Table 1 we can see that though the overall mAP does not change a lot, different IoU thresholds will lead to different performance under different metrics.
Therefore, to achieve high quality object detection, we need to train the classifier with a high IoU threshold. However, as mentioned in Cascade R-CNN , directly raising the IoU threshold is impractical due to the vanishing positive samples. Cascade R-CNN uses several sequential stages to lift the IoU of the proposals, which are effective yet time-consuming. Is there a way to get the best of two worlds? As mentioned above, the quality of the proposals actually improves along the training. This observation inspires us to take a progressive approach in training: At the beginning, the proposal network is not capable to produce enough high quality proposals, so we will use a lower IoU threshold to better accommodate these imperfect proposals in second stage training. As training goes, the quality of proposals improves, we gradually have enough high quality proposals. As a result, we may increase the threshold to better utilize them to train a high quality detector that is more discriminative at higher IoU. We will formulate this process in the following section.
The task of bounding box regression is to regress the positive candidate bounding box to a target ground-truth . This is learned under the supervision of the regression loss function . To encourage a regression target invariant to scale and location, operates on the offset defined by
Since the bounding box regression performs on the offsets, the absolute values of Equation (2) can be very small. To balance the different terms in multi-task learning, is usually normalized by pre-defined mean and stdev (standard deviation) as widely used in many work [35, 25, 13].
However, we discover that the distribution of regression targets are shifting during training. As shown in Figure 2, we calculate the statistics of the regression targets under different iterations and IoU thresholds. First, from the first two columns, we find that under the same IoU threshold for positives, the mean and stdev are decreasing as the training goes due to the improved quality of proposals. With the same normalization factors, the contributions of those high quality samples will be reduced based on the definition of SmoothL1 loss function, which is harmful to the training of high quality regressors. Moreover, with a higher IoU threshold, the quality of positive samples is further enhanced, thus their contributions are reduced even more, which will greatly limit the overall performance. Therefore, to achieve high quality object detection, we need to fit the distribution change and adjust the shape of regression loss function to compensate for the increasing of high quality proposals.
To better exploit the dynamic property of the training procedure, we propose Dynamic R-CNN which is shown in Figure 3. Our key insight is adjusting the second stage classifier and regressor to fit the distribution change of proposals. The two components designed for the classification and localization branch will be elaborated in the following sections.
The Dynamic Label Assignment (DLA) process is illustrated in Figure 3 (a). Based on the common practice of label assignment in Equation (1) in object detection, the DLA module can be formulated as follows:
where stands for the current IoU threshold. Considering the dynamic property in training, the distribution of proposals is changing over time. Our DLA updates the automatically based on the statistics of proposals to fit this distribution change. Specifically, we first calculate the IoUs between proposals and their target ground-truths, and then select the -th largest value from as the threshold . As the training goes, will increase gradually which reflects the improved quality of proposals. In practice, we first calculate the -th largest value in each batch, and then update every iterations using the mean of them to enhance the robustness of the training. It should be noted that the calculation of IoUs is already done by the original method, so there is almost no additional complexity in our method. The resultant IoU thresholds used in training are illustrated in Figure 3 (a).
The localization task for object detection is supervised by the commonly used SmoothL1 Loss, which can be formulated as follows:
Here the stands for the regression error. is a hyper-parameter controlling in which range we should use a softer loss function like loss instead of the original loss. Considering the robustness of training, is set default as 1.0 to prevent the exploding loss due to the poor trained network in the early stages. We also illustrate the impact of in Figure 4, in which changing leads to different curves of loss and gradient. It is easy to find that a smaller actually accelerate the saturation of the magnitude of gradient, thus it makes smaller error contributes more to the network training.
As analyzed in Section 3.2, we need to fit the distribution change and adjust the regression loss function to compensate for the high quality samples. So we propose Dynamic SmoothL1 Loss (DSL) to change the shape of loss function to gradually focus on high quality samples as follows:
Similar to DLA, DSL will change the value of according to the statistics of regression errors which can reflect the localization accuracy. To be more specific, we first obtain the regression errors between proposals and their target ground-truths, then select the -th smallest value from to update the in Equation (5). Similarly, we also update the every iterations using the median of the
-th smallest error in each batch. We choose median instead of mean as in the classification because we find more outliers in regression errors. Through this dynamic way, appropriatewill be adopted automatically as shown in Figure 3 (b), which will better exploit the training samples and lead to a high quality regressor.
To summarize the whole method, we describe the proposed Dynamic R-CNN in Algorithm 1. Besides the proposals and ground-truths , Dynamic R-CNN has three hyperparamters: IoU threshold top-k , top-k and update iteration count
. Note that compared with baseline, we only introduce one additional hyperparameter. And we will show soon the results are actually quite robust to the choice of these hyperparameters.
Experimental results are presented on the bounding box detection track of the challenging MS COCO  dataset that includes 80 object classes. Following the common practice [26, 13], we use the COCO trainval35k split (union of 80k images from train and a random 35k subset of images from the 40k image val split) for training and report the ablation studies on the minival split (the remaining 5k images from val). We also submit our main results to the evaluation server for the final performance on the test-dev split, which has no disclosed labels
. The COCO-style Average Precision (AP) is chosen as the main evaluation metric which averages AP across IoU thresholds from 0.5 to 0.95 with an interval of 0.05. We also include other metrics to better understand the behavior of the proposed method.
For fair comparisons, all experiments are implemented on PyTorch and follow the settings in maskrcnn-benchmark222https://github.com/facebookresearch/maskrcnn-benchmark and SimpleDet . We adopt FPN-based Faster R-CNN [35, 25] with ResNet-50 
model pre-trained on ImageNet as our baseline. All models are trained on the COCO trainval35k and tested on minival with image short size at 800 pixels unless noted. Due to the scarcity of positives in the training procedure, we set the NMS threshold of RPN to 0.85 instead of 0.7 for all the experiments.
We compare Dynamic R-CNN with the corresponding baselines on the COCO test-dev set in Table 2. For fair comparisons, we report our re-implemented results here.
First, we prove that our method can work on different backbones. With the dynamic design, Dynamic R-CNN achieves 39.1% AP with ResNet-50 , which is 1.8 points higher than the FPN-based Faster R-CNN baseline. With a stronger backbone like ResNet-101, Dynamic R-CNN can also achieve consistent gains (+1.9 points).
Then, our dynamic design is also compatible with other training and testing skills. The results are consistently improved by progressively adding in longer training schedule, multi-scale training and testing and deformable convolution . With the best combination, out Dynamic R-CNN achieves 49.2% AP, which is still 2.3 points higher than the Faster R-CNN baseline. It should be noted that we only use five scales in multi-scale testing, and the result will be better if more scales are adopted.
These experimental results show the effectiveness and robustness of our method since it can work together with different backbones and multiple training and testing skills. It should also be noted that the performance gains are almost free.
To show the effectiveness of each proposed component, we report the overall ablation studies in Table 3.
1) Dynamic Label Assignment (DLA). Dynamic Label Assignment brings 1.2 points higher box AP than the ResNet-50-FPN baseline. To be more specific, results in higher IoU metrics are consistently improved, especially for the 2.9 points gains in . It proves the effectiveness of our method for pushing the classifier to be more discriminative at higher IoU thresholds.
2) Dynamic SmoothL1 Loss (DSL). Dynamic SmoothL1 Loss improves the box AP from 37.0 to 38.0. Results in higher IoU metrics like and are hugely improved, which validates the effectiveness of changing the loss function to compensate for the high quality samples during training. Moreover, as analyzed in Section 3.2, with DLA the quality of positives is further improved thus their contributions are reduced even more. So applying DSL on DLA will also bring reasonable gains especially on the high quality metrics. To sum up, Dynamic R-CNN improves the baseline by 1.9 points and brings 5.5 points gains in .
3) Illustration of dynamic training. To further illustrate the dynamics in the training procedure, we show the trends of IoU threshold and SmoothL1 under different settings based on our method in Figure 5. Here we clip the values of IoU threshold and to 0.4 and 1.0 respectively at the beginning of training. Regardless of the specific values of and , the overall trend of IoU threshold is increasing while that for SmoothL1 is decreasing during training. These results again verify the proposed method work as expected.
Ablation study on in DLA. Experimental results with different of Dynamic Label Assignment are shown in Table 4. Compared to the Faster R-CNN baseline, DLA can achieve consistent gains in AP regardless of the choice of . These results prove the effectiveness of the DLA part and also show the universality of hyperparameter .
Moreover, the performance under various metrics are changed under different . Since controls the results of label assignment process, choosing as 64/75/100 means that nearly 12.5%/15%/20% of the whole batch are selected as positives. Generally speaking, setting a smaller will increase the quality of selected samples, which will lead to better accuracy under higher metrics like . On the contrary, adopting a larger will be more helpful for the metrics at lower IoU. Finally, we find that setting as 75 achieves the best trade-off and use it as the default value for further experiments. All these ablations prove the effectiveness and robustness of the Dynamic Label Assignment. Note that as illustrated in Table 1, adjusting the thresholds, but leaving them fixed will result in inferior results.
Ablation study on in DSL. We show the results in Table 5. We first conduct experiments with different values of on the original Faster R-CNN and empirically find that using a smaller leads to better performance. Then, experiments of DSL under different are provided to show the effects of hyperparameter . Regardless of the certain value of , DSL can achieve consistent improvements compared with various fine-tuned baselines, which proves the effectiveness of the DSL part. Specifically, with our best setting, DSL is able to bring 1.0 point higher AP than the baseline, and the improvement mainly lies in the high quality metrics like (+3.5 points). These experimental results prove that our DSL is effective in compensating for high quality samples and can lead to a better regressor due to the advanced dynamic design.
Ablation study on iteration count . As mentioned in Section 4, due to the concern of robustness, we update the IoU threshold every iterations using the mean of recorded IoU values in the last interval. To show the effects of different iteration count , we try different values of on the proposed method. As shown in Table 6, setting as 20, 100 and 500 leads to very similar results, which proves the robustness to this hyperparameter.
|Backbone||Method||Extra training time||Inference speed|
|ResNet-50-FPN||Dynamic R-CNN||13.9 FPS|
|Cascade R-CNN||✓||11.2 FPS|
|Dynamic Mask R-CNN||11.3 FPS|
|Cascade Mask R-CNN||✓||7.3 FPS|
|ResNet-101-FPN||Dynamic R-CNN||11.2 FPS|
|Cascade R-CNN||✓||9.6 FPS|
|Dynamic Mask R-CNN||9.8 FPS|
|Cascade Mask R-CNN||✓||6.4 FPS|
Complexity and speed. As shown in the Algorithm 1
, the main computational complexity of our method lies in the calculations of IoUs and regression errors. However, these time-consuming statistics are already done by the original method. Thus the additional overhead of Dynamic R-CNN only lies in calculating the mean or median of a short vector, which basically does not increase the training time. Moreover, since our method only changes the training procedure, obviously no additional overhead is introduced during the inference phase.
Our advantage compared to other high quality detectors like Cascade R-CNN is the training efficiency and inference speed. As shown in Table 7, though the results of Cascade R-CNN are slightly better than us, it increases the training time and slows down the inference speed while our method does not. Specifically, under the ResNet-50-FPN backbone, our Dynamic R-CNN achieves 13.9 FPS, which is 1.25 times faster than Cascade R-CNN (9.6 FPS). Note that the difference will be more apparent as the backbone gets smaller, because the main overhead of Cascade R-CNN is the two additional headers it introduces. Moreover, with larger heads, the cascade manner will further slow down the inference speed. We compare the inference speed between Cascade Mask R-CNN and Dynamic Mask R-CNN, and find that our method runs 1.5 times faster than the cascade manner under different backbones.
|Faster R-CNN ||ResNet-101||36.2||59.1||39.0||18.2||39.0||48.2|
|Mask R-CNN ||ResNet-101||38.2||60.3||41.7||20.1||41.1||50.2|
|Libra R-CNN ||ResNet-101||41.1||62.1||44.7||23.4||43.7||52.5|
|Cascade R-CNN ||ResNet-101||42.8||62.1||46.3||23.7||45.5||55.2|
Since the viewpoint of dynamic training is a general concept, we believe that it can be adopted in different methods. To validate the generalization capacity, we further apply the dynamic design on Mask R-CNN with different backbones. As shown in Table 8, adopting the dynamic design can achieve consistent gains (+1.92.0 points) on the box AP across different backbones. Moreover, we also evaluate the results on the task of instance segmentation. We find that adopting our dynamic design can also achieve consistent gains regardless of backbones. It is worth noting that we only adopt the DLA and DSL which are designed for object detection, so these results further demonstrate the generality and effectiveness of our dynamic design on improving training procedure for current state-of-the-art object detectors.
We compare Dynamic R-CNN with the state-of-the-art object detection approaches on the COCO test-dev set in Table 9.
, large-batch Batch-Normalization and Soft-NMS ), we report the results of our method with two types.
Dynamic R-CNN applies our method on FPN-based Faster R-CNN with ResNet-101 as backbone, and it can achieve 42.0% AP without bells and whistles. Dynamic R-CNN* adopts image pyramid scheme (multi-scale training and testing), deformable convolutions and Soft-NMS. It further improves the results to 50.1% AP, outperforming all the previous detectors.
In this paper, we take a thorough analysis of the training process of detectors and find that the fixed scheme limits the overall performance. Based on the advanced dynamic viewpoint, we propose Dynamic R-CNN to better exploit the training procedure. With the help of the simple but effective components like Dynamic Label Assignment and Dynamic SmoothL1 Loss, Dynamic R-CNN brings significant improvements on the challenging COCO dataset with no extra cost. Extensive experiments with various detectors and backbones validate the generality and effectiveness of Dynamic R-CNN. We hope that this dynamic viewpoint can inspire further researches in the future.
SGDR: stochastic gradient descent with warm restarts. In ICLR, Cited by: §2.