1 Introduction
Recent years has witnessed the remarkable progress in object detection thanks to the advance of the deep convolution networks [17, 31, 32, 13] . Among them, the two-stage approach is the leading paradigm in the deep learning era of object detection and Region Proposal Network (RPN)

Figure 1 (a) shows the IoU histogram of the Region of interests (RoIs) generated by the RPN. With the increase of IoU, the number of RoIs decreases sharply, leading to the IoU distribution imbalance. As the subsequent R-CNN takes the RoIs as training samples, the distribution of training samples is naturally skewed towards lower IoUs. What’s more, the total number of positive samples per image is no more than 100 during the training procedure while the total number of training samples is 512. We argue that the IoU distribution imbalance and inadequate quantity of the positive samples hinder the optimization of detector, especially at high IoU levels. It can be seen in Figure 1 (b), we plot the IoU of the RoIs with their corresponding GT bounding boxes before and after regression. The localization accuracy gains of RoIs, after the refinement of regressor, are mainly concentrated at low IoU levels and it even decays at high IoU levels. We attribute this to the loss imbalance during training. Figure 1 (c) illustrates the composition of the regression loss during training. It can be seen that the low IoU RoIs comprise the majority of the loss and dominate the gradients. As a result, the detector optimized at low IoU level is not necessarily optimal at other level, which influence the overall performance of the detector.
In order to solve the above problems, Cascade R-CNN [2] proposed a multi-stage object detection framework. The detectors are trained stage by stage and the training samples of following stages are the output of previous stage, as the output IoU of a regressor is almost invariably better than the input IoU, the detector can obtain enough samples at different IoU levels, which improve the overall performance of the detector. Although Cascade R-CNN obtained solid improvements, the multi-stage way is not flexible enough. Libra R-CNN [25] proposed IoU-balanced sampling, which is a simple but effective method, to alleviate the distribution imbalance among hard negative samples. However, because of the exponentially vanishing high IoU samples, it is hard to get uniform IoU distribution for positive samples by this sampling strategy.
There comes a question: For a two-stage detector, do the training samples of R-CNN must come from the output of RPN ? The answer is No. In this paper, instead of taking positive samples from RPN, we propose to add controllable jitter to each GT bounding box to directly generate positive training samples for R-CNN. So that we can simply and effectively obtain adequate uniform distributed training samples for not only the regression branch, but also the IoU prediction branch

What’s more, as non-maximum suppression (NMS) procedure is a critical post-processing procedure to filter redundant bounding boxes, IoU-Net [16] pointed out that the misalignment between classification confidence and localization accuracy may lead to accurately localized bounding boxes being suppressed by less accurate ones in the NMS. To solve this problem, IoU-Net proposed a IoU-prediction branch to predict the IoU between the predicted bounding box and the corresponding ground truth bounding box. The predicted IoU replaces the classification score as the metric for ranking the bounding boxes. Nevertheless, we argue that there is still a mismatch in IoU-prediction branch. During training, the input of the IoU predictor is the RoI feature at the current position, and the IoU predictor outputs the predicted IoU of the RoI with its corresponding GT bounding box. But when it comes to the test phase, the predicted IoU is assigned to the bounding box, which has been moved to a new position after the refinement of RoI by the regression branch. The shift of RoI position also brings feature offset. It is the feature offset of RoI that result in a misalignment between predicted IoU and localization accuracy. It is shown in Figure 2, the red and orange dotted box are the proposals and they cover the same GT bbox. Although the IoU of the red box is smaller than the orange box, the red box come from behind after the refinement of R-CNN network. However, as the input of following branch is still the feature at the position before the regression, the red box score is still lower than the orange box score, which would lead to the more accurate one be suppressed during NMS. In our paper, we further improves the performance of IoU prediction branch by eliminating the feature offsets of RoIs at inference without training. Experiments have shown it can boost the performance of detector and the stronger the IoU prediction branch is, the more gains it brings.
Our main contributions are summarized as follows: (1) Our study reveals the importance of solving the limitations of RPN and our proposed IoU-uniform R-CNN can alleviate the IoU distribution imbalance and inadequate training samples by generating samples with uniform IoU distribution. (2) We improve the performance of IoU prediction branch by eliminating the feature offsets of RoIs at inference. (3) Our proposed method consistently obtains significant improvements over multiple state-of-the-art detectors. Specially, without bells and whistles, it achieves 2.4 AP improvement than Faster R-CNN (with ResNet-101-FPN backbone) on MS COCO dataset.
2 Related Work
Development of the model architecture. Nowadays, with the deep learning techniques have been widely applied to various computer vision tasks, convolution neural networks (CNNs) based approaches have prevailed on object detection task and the model architectures are also constantly evolving. The CNN based detectors were first introduced by R-CNN can be classified as two-stage methods. They first obtain a sparse set of proposals and then classify and refine these proposals at the second stage. On the other hand, single-stage detectors are popularized by YOLO , the detectors enhance its feature extraction capabilities, further improving performance. Recently, as the single-stage and two-stage detection frameworks becomes mature, the anchor-free methods have become a new research hotspot. Instead of using anchor boxes, they predict bounding boxes in a per-pixel prediction fashion
Imbalance problems in object detection. Nowadays, as the model architecture becomes mature, more and more research has resort to improve the training process of the detector. Under such circumstance, the sample imbalance problems during training have attracted growing attentions. [24] review the deep-learning-era object detection literature and identify 8 different imbalance problems. Numerous studies have also shown that mitigating sample imbalance, especially the Forground-Background imbalance, would bring significant gains to the performance of the detector. For example, in order to alleviate the Forground-Background imbalance, Focal loss [21], Gradient Harmonizing Mechanism (GHM) [19] solves it by a soft sampling way, which suppresses the gradient originating from easy positives and negatives. And SSD [23], OHEM [29] restrict the imbalance by hard example mining. However, the IoU distribution imbalance has received relatively less attention in object detection. Cascade R-CNN [2] tried to solve this problem by a cascade framework. RoIs were iteratively refined and the detector can obtain enough sample at different IoU levels ultimately. Libra R-CNN [25] proposed IoU-balanced sampling to alleviate the IoU distribution imbalance among hard negative samples.
Improvement of Duplicate Removal. Duplicate Removal is an essential postprocessing procedure of object detectors for removing duplicated bounding boxes. Its efficacy heavily affects the final performance. The most widely used algorithm is non-maximum suppression (NMS). It iteratively selects proposals according to the confidence score (usually the classification score) and suppresses overlapped proposals. However, the classification score is not accurate enough to guarantee preserving the most accurate detection results. Instead of directly eliminating overlapped proposals, Soft-NMS [1] decays the bounding box scores and Softer NMS [14] averages the selected boxes in a softer way. Fitness NMS [34] introduces the localization information while ranking the bounding box into ranking confidence. Different from existing NMS algorithms, Prime sample [3] investigates the sample importance and make the classifier more prone to give high scores to high IoU proposals. IoU-Net [16] claims that it is not proper to use classification scores as the ranking criterion, it proposed a IoU-prediction branch to predict the IoU between the predicted bounding box and the corresponding ground truth. The predicted IoU replaces the classification score as the metric for ranking the bounding boxes.
3 Methodology
In this section, we will illustrate the proposed IoU-uniform R-CNN for object detection. As our goal is to break through the limitations of RPN during the training of detectors, we first replace the RoI training samples with generated samples to obtain more powerful regressor and IoU predictor. Furthermore we simply tune loss weight of different IoU intervals to control the balance of regression loss composition. Then we propose to eliminate the feature offsets of RoI during the inference of IoU-prediction branch. With more powerful regressor and IoU predictor, IoU-uniform R-CNN achieves superior performance. All components will be elaborated below.
3.1 Generate positive samples with uniform IoU distribution
We first revisit the pipeline of two-stage approach. As illustrated in Figure 4 (a), the Region Proposal Networks (RPN) generates a sparse set of proposals that should cover all forground objects while filtering out the majority of negative locations. Then at the second stage, a region-wise subnetwork is designed to refine these proposals by further classification and regression. The whole network is trained end-to-end and the region-wise subnetwork takes the output of RPN as training samples. The region-wise subnetwork is expected to do it well among different quality proposals but things go athwart. Figure 3 shows the average localization improvement of proposals from different IoU intervals after refinement. We can find that, with the increase of IoU, the gain of refinement get smaller and the performance even get worse at high IoU levels. As discussed in section 1, the performance imbalance may come from the loss imbalance during training. As most of the train samples are from low IoU levels (IoU<0.7), the low IoU RoIs comprise the majority of the loss and dominate the gradients.

A natural solution for alleviating the imbalance is to resample or tune the loss weight of different IoU intervals. However, as the quantity gap between low IoU level and high IoU level is too wide and the number of positive samples is not enough (no more than 100 positive samples per image), it is hard to get uniform IoU distribution for positive samples by resampling strategy. And it is also hard to determine appropriate weights to balance the composition of regression loss.
In this paper, in order to obtain samples with uniform IoU distribution for region-wise subnetwork, we propose to directly generate training samples around each GT bounding box, instead of taking proposals from RPN. We first divide the IoU into intervals and then we generate samples for each bounding box at each interval by adding controllable jitters. Given an image with annotated ground-truths, a bounding box is represented by . A generated sample is determined by:
(1) |

In order to obtain enough samples from limitted number of attempts, the random range among intervals depends on the IoU level of generated RoI samples. Precisely, for obtaining higher IoU samples, the random range should get smaller. Besides, to guarantee the validity of generated samples, we only keep the RoIs which has the maximum IoU over the current GT bounding box. As we keep samples for each IoU interval per GT bouding box, the overall number of the training samples for one image is and we obtain a totally uniform IoU distribution at last. To this end, our training pipeline is shown in Figure 4 (b), the generated IoU uniformly distributed samples are used to train not only the regression branch but also the IoU-prediction branch. The following experiments will show it can greatly promote the performance of both the regression and IoU prediction branch. As for the classification branch, it still take the output of RPN as training samples, for the big difference of RoI distribution between training and test stage is harmful to the performance of classifiers.
Although the number of samples for among different IoU intervals is now the same, we can still find the regression loss imbalance. It may caused by the initialization. As the layers are randomly initialized with normal distributions, where the mean is set to 0 and standard deviation 0.001. The size of the output value of regression branch is concentrated around 0 at the beginning. For the regression is trained to predict the offset between RoI and GT bounding box, This initialization will lead to the the low IoU samples dominate total loss in early training because the offset of low IoU RoI is natural bigger then high IoU ones. As the imbalance problem has been largely mitigated and we already have enough training samples for all IoU intervals, tuning the regression weights according to the IoU of proposals have becoming feasible and easy. The weighted regression loss is shown in Equ. 2,
(2) |
3.2 Eliminating the feature offsets of RoIs
The IoU-prediction branch [16] was proposed to predict the localization confidence for each detected bounding box and it is more sensitive to localization accuracy. Thus, the feature offsets of RoIs can not be neglected. It is shown in Figure 5, to eliminate the feature offsets, we set the output bounding box of the region-wise subnetwork as the new RoI and obtain the new RoI features by RoIAlign pooling [12]. Then the new RoI feaures was sent to the region-wise subnetwork again to obtain the ultimate IoU prediction result. It is the twice feature extraction of RoIs at inference that help us to eliminate the feature offsets of RoIs without training.
Ideally, we expect the IoU score of bbox candidates to replace the classification score as the suppression criterion of NMS algorithm, but the IoU score of numerous background bounding box is not credible enough during test, because IoU-prediction branch is only trained by positive samples whose IoU is above 0.5. However, the messy situation of background samples would bring much trouble to the training procedure of IoU-prediction branch and influence the accuracy on those high IoU bounding boxes which play a key role in driving the detection performance. Therefore, it is inappropriate to train the IoU-prediction branch by both positive and negative samples. To remedy this, we set the multiplication of the IoU score and classification score as the final score of the suppression criterion of the NMS algorithm, for the classification branch can help to suppress the background bboxes with low classification score.

Backbone | Detector | Our method | AP | AP | AP | AP | AP | AP |
---|---|---|---|---|---|---|---|---|
ResNet-50-FPN | Faster R-CNN | No | 50.2 | 79.1 | 75.6 | 63.4 | 42.4 | 10.8 |
Faster R-CNN | Yes | 55.4 | 79.8 | 75.9 | 67.2 | 51.6 | 23.0 | |
Cascade R-CNN | No | 54.9 | 79.2 | 74.3 | 66.2 | 51.7 | 23.2 | |
Cascade R-CNN | Yes | 56.1 | 77.8 | 74.1 | 66.1 | 53.4 | 29.4 | |
ResNet-101-FPN | Faster R-CNN | No | 52.8 | 82.2 | 77.1 | 66.3 | 45.9 | 12.1 |
Faster R-CNN | Yes | 57.6 | 81.4 | 77.6 | 69.2 | 54.1 | 25.4 | |
Cascade R-CNN | No | 57.7 | 81.9 | 77.5 | 68.7 | 56.3 | 25.9 | |
Cascade R-CNN | Yes | 57.6 | 78.4 | 74.4 | 67.2 | 55.0 | 31.7 |
4 Experiments
We comprehensively evaluate our method on two widely used benchmarks for the object detection task: MS COCO [22] and PASCAL VOC [8]. In particular, MS COCO is a large scale dataset with 80 object categories. It consists of 115k images for training (train-2017), 5k images for validation (val-2017), 20k for testing without provided annotations. We use the train split for training and report the performance on validation and test-dev split. PASCAL VOC is another dataset for evaluating our method. We use the union of VOC2007 and VOC2012 trainval as training set, which contains 16551 images and objects from 20 pre-defined categories annotated with bounding boxes. We evaluate our models on the VOC 2007 test set.
Backbone | Detector | Our method | AP | AP | AP | AP | AP | AP |
---|---|---|---|---|---|---|---|---|
ResNet-50-FPN | Faster R-CNN | No | 36.4 | 58.4 | 39.1 | 21.5 | 40.0 | 46.6 |
Faster R-CNN | Yes | 39.1 | 57.7 | 42.2 | 22.5 | 42.6 | 50.7 | |
Cascade R-CNN | No | 40.4 | 58.5 | 43.9 | 21.5 | 43.7 | 53.8 | |
Cascade R-CNN | Yes | 40.9 | 58.1 | 43.9 | 23.1 | 44.1 | 53.7 | |
ResNet-101-FPN | Faster R-CNN | No | 38.5 | 60.3 | 41.6 | 22.3 | 43.0 | 49.8 |
Faster R-CNN | Yes | 40.9 | 59.7 | 43.8 | 22.9 | 44.9 | 54.4 | |
Cascade R-CNN | No | 42.0 | 60.3 | 45.9 | 23.2 | 45.9 | 23.2 | |
Cascade R-CNN | Yes | 42.4 | 59.7 | 45.7 | 23.8 | 45.9 | 56.5 |
Method | backbone | AP | AP | AP | AP | AP | AP |
---|---|---|---|---|---|---|---|
YOLOv2 | DarkNet-19 | 21.6 | 44.0 | 19.2 | 5.0 | 22.4 | 35.5 |
YOLOv3 | DarkNet-53 | 33.0 | 57.9 | 34.4 | 18.3 | 35.4 | 41.9 |
SSD513 | ResNet-101 | 31.2 | 50.4 | 33.3 | 10.2 | 34.5 | 49.8 |
RetinaNet | ResNet-101 | 39.1 | 59.1 | 42.3 | 21.8 | 42.7 | 50.2 |
Faster R-CNN | ResNet-101-FPN | 38.8 | 60.9 | 42.3 | 22.3 | 42.2 | 48.6 |
Faster R-CNN by G-RMI [15] | Inception-ResNet-v2 | 34.7 | 55.5 | 36.7 | 13.5 | 38.1 | 52.0 |
Faster R-CNN w/TDM [30] | Inception-ResNet-v2-TDM | 36.8 | 57.5 | 39.2 | 16.2 | 39.8 | 52.1 |
Deformable R-FCN [6] | Aligned-Inception-ResNet | 37.5 | 58.0 | 40.8 | 19.4 | 40.1 | 52.5 |
Mask R-CNN [12] | ResNet-101-FPN | 38.2 | 60.3 | 41.7 | 20.1 | 41.1 | 50.2 |
Cascade R-CNN | ResNet-50-FPN | 40.7 | 59.3 | 44.1 | 23.1 | 43.6 | 51.4 |
Cascade R-CNN | ResNet-101-FPN | 42.4 | 61.1 | 46.1 | 23.6 | 45.4 | 54.1 |
Libra R-CNN | ResNet-101-FPN | 40.3 | 61.3 | 43.9 | 22.9 | 43.1 | 51.0 |
IoU-Net | ResNet-101-FPN | 40.6 | 59.0 | - | - | - | - |
Faster R-CNN+IoU-uniform R-CNN | ResNet-50-FPN | 39.0 | 57.8 | 42.0 | 22.4 | 41.9 | 48.7 |
Faster R-CNN+IoU-uniform R-CNN | ResNet-101-FPN | 41.2 | 60.1 | 44.3 | 23.6 | 44.1 | 52.2 |
Cascade R-CNN+IoU-uniform R-CNN | ResNet-50-FPN | 41.3 | 58.8 | 44.4 | 23.8 | 43.9 | 52.0 |
Cascade R-CNN+IoU-uniform R-CNN | ResNet-101-FPN | 42.8 | 60.3 | 46.1 | 24.1 | 45.7 | 54.5 |
4.1 Implementation details
For fair comparisons, all experiments are implemented based on PyTorch and mmdetection toolbox . However, considering the increase in the number of positive samples, we choose to double the learning rate. Thus the learning rate is initialized as 0.01 and 0.005 for MS COCO and PASCAL VOC respectively. We use the SGD as the optimizer for model learning and train all models for 12 epochs. All other hyper-parameters follow the settings in mmdetection if not specifically noted.
As for the hyper-parameters of generating RoI samples, we set . It means that we split the IoU range into 4 intervals: [0.5, 0.6), [0.6, 0.7), [0.7, 0.8), [0.8, 1.0) and each GT bounding box will generate 64 samples. For the regression loss weight for different IoU intervals, , We also tried larger N and split the IoU range into more elaborate intervals but did not obtain noticeable improvements.
4.2 Main results
Experiments on PASCAL VOC. The original evaluation metric of PASCAL VOC is to calculate the mAP at 0.5 IoU threshold. As our methods is mainly designed to alleviate the performance imbalance among different IoU levels, we extend the original metric to the COCO-style criterion which calculates the average AP across IoU thresholds from 0.5 to 0.95 with an interval of 0.05. We evaluate the validity of our proposed method on two state-of-the-art object detectors: Faster R-CNN and Cascade R-CNN. From Table 1, we can see that the performance improvement on Faster R-CNN is obvious, 5.2 and 4.8 points higher with the backbone of 50 and 101 layers respectively. We can also find that even though both of Cascade R-CNN and our method solve the same IoU distribution imbalance problem in different ways, our method still raises their performance. These results further demonstrate the compatibility and adaptivity of our method. Further analysing the performance on different IoU thresholds, we can observe that most of the improvements come from the high IoUs, which is in accordance with our expectation.
Experiments on MS COCO. To further demonstrate the generalization capacity of our approach, we also conduct experiments on more challenge COCO dataset. All reported results follow standard COCO-style Average Precision (AP) metrics, AP (AP for IoU threshold 50), AP (AP for IoU threshold 75). We also include AP, AP, AP, which correspond to the results on small, medium and large scales respectively. Table 2 shows the resuls on validation set, our approach brings consistent and substantially improvement across multiple detectors with different backbone. Specially, it improves Faster R-CNN and Cascade R-CNN by 2.7 and 0.5 points on ResNet-50-FPN backbone and 2.4 and 0.4 points on ResNet-101-FPN backbone.
Comparison with state-of-the-art methods. We compare IoU-uniform R-CNN with the state-of-the-art object detection approaches on the COCO test-dev in Table 3. Without bells and whistles, it achieves 41.2 AP with ResNet-101-FPN, which is 2.4 points higher than the baseline. With more powerful feature extractor and base detector, IoU-uniform R-CNN achieves 42.8 AP, demonstrating the superior performance of our method.
4.3 Analysis
We perform a thorough study on each component of our method and explain how it works from complete statistics.
method | AP | AP | AP | Improvement |
---|---|---|---|---|
Baseline | 50.2 | 79.1 | 54.7 | |
+uniform IoU distribution | 52.1 | 79.8 | 56.2 | +1.9/+0.7/+1.5 |
+EFO(with IoU predictor) | 54.4 | 79.5 | 59.6 | +4.2/+0.4/+4.9 |
+tuning weight | 55.4 | 79.8 | 59.6 | +5.2/+0.7/+4.9 |
Component Analysis. To analyze the importance of each proposed component, we report the overall ablation studies in Table 4. We gradually add our strategies on Faster R-CNN with ResNet-50 FPN and report the results on PASCAL VOC 2007. We can learn that the generated IoU uniformly distributed samples empower the detector to have more potential to make process. Each component of our method obtain gains and the combination of them achieves a total gain of 5.2 AP.
How dose the generated training samples affect the regressor? To verify the improvement it brings to the regressor, we first couducted a check experiment that split the classifier and regressor into two branch and found its performance drops a little, which exclude the factor of spliting the original branch. As shown in Figure 3, compared with the original, our regressor significantly improves its performance in the face of high IoU proposals. Take it a step further, we plot the IoU histogram of bbox after the refinement of regressor in Figure 6. It can be seen that we obtain more high IoU bboxes by using the IoU uniformly distributed samples for training.

How dose the generated training samples affect the IoU predictor? To verify the validity of training IoU predictor with IoU uniformly distributed samples, we only use the generated samples to train the IoU predictor, and the training samples of the regression branch still come from the output of RPN. For comparison, we also train a model that using the output of RPN as the training samples of IoU predictor. It can be seen in Table 6 that the detector whose IoU predictor was trained with the generated sample outperform the one trained with the output of RPN by 2.3 points. With a more powerful IoU predictor, we obtain a more reliable metric to rank the bounding boxes, which would promote the proposal reservation in NMS. To analyse the improvement, we plot the recall curve for different NMS algorithms in Figure 7, with the matching IoU ranging from 0.5 to 1. We can find it achieves better recall among different IoU thresholds, indicating that it helps the NMS process to preserve accurately localized bbox.
Influence of uniform IoU distribution and increasing number of samples. Although we have already shown the validity of using generated training samples, we are still unable to determine whether the improvement is mainly from uniform distribution or just the increasing number of samples. Hence we design a experiment that we resample the generated samples. The quantity of generated samples depends on the number of original positive samples produced by RPN. As we can see in Table 5, the IoU uniformly distributed samples can achieve 3.2 improvements of AP with the same number of samples. If we double the quantity, the performance can be further improved with 1.7 points. From above results, we can identify IoU imbalance as the main obstacle to obtain better performance of existing network structure and we can also obtain additional gains from more RoI samples.
Influence of tuning the regression weight. As discussed in section 3.1, we can construct a more balanced regression loss among different IoU intervals by tuning the loss weight of different IoU intervals. This is also supported by our experiments. As shown in Figure 3, the performance of regressor has been further improved.
Method | Num of samples | AP | AP | AP |
---|---|---|---|---|
Baseline | 50.2 | 79.1 | 54.7 | |
IoU-uniform R-CNN | Equal | 53.4 | 79.3 | 57.5 |
IoU-uniform R-CNN | Double | 55.1 | 80.3 | 59.5 |
Training samples | Eliminating feature offsets | AP |
---|---|---|
Output of RPN | No | 48.7 |
Yes | 49.9 | |
Generated samples | No | 50.2 |
Yes | 52.2 |
Influence of eliminating the feature offsets. From the results reported in Table 6, we can find its major impact on the final performance, as the gain from IoU uniformly distributed samples is almost offset without eliminating the feature offsets. Figure 8 may answer the question why it has such huge impact on the performance. The x-axis is the IoU between the refined bbox and its matched ground-truth, while the y-axis denotes its predicted value. We can find in Figure 8 (a) that the predicted value is not well correlated with the ground truth. We attribute this to the transition of low IoU to high IoU. As the average IoU increment for those low IoU-level candidate bbox is around 0.2, the unupdated predicted value is far behind the ground truth. This leads to the potential suppression of accurate located bbox. Visualized in Figure 8 (b), the IoU estimation becomes more accurate after eliminating the feature offset. It is worth mentioning that the stronger the IoU prediction branch is, the more gains it can bring. The gain of generated uniform samples is 2.0 points compared to 1.2 of RPN samples. The qualitative results for comparison between the eliminating feature offsets with unequipped one are provided in Figure 9. We can see that eliminating the feature offsets of RoIs can help preserve more accurate detection results. These further demonstrate the effectiveness of eliminating the feature offsets of RoIs for better IoU prediction.


Num | Lr | AP | AP | AP | AP | AP | AP |
---|---|---|---|---|---|---|---|
2 | 0.0025 | 53.18 | 79.3 | 73.7 | 64.5 | 48.9 | 19.9 |
79.5 | 74.4 | 66.0 | 50.1 | 21.8 | |||
4 | 0.005 | 53.44 | 79.0 | 74.1 | 64.8 | 48.3 | 18.2 |
4 | 0.01 | 54.19 | 79.7 | 74.7 | 65.6 | 50.5 | 21.5 |
4 | 0.015 | 53.84 | 78.6 | 73.9 | 65.5 | 49.7 | 21.4 |

4.4 Ablation study
During training, we found that the performance of our IoU-uniform R-CNN is sensitive to the batch size and learning rate. According to Linear Scaling Rule, the learning rate is supposed to be divided by 4, as we only have 2 GPUs available compared with the default 8 GPUs. Thus, for Faster R-CNN with ResNet-50-FPN backbone on PASCAL VOC, the learning rate is supposed to be 0.0025. But we found that we can obtain better results by double the learning rate. We attribute this to the increasing training samples for regression branch. The number of original training samples for regression branch is usually no more than 100 per image, but when we generate samples by ourselves, the average number of samples reaches 200. We further conduct ablation studies on the number of images per GPU and learning rate to determine the suitable hype-parameters. It is shown in Table 7 that we obtain the best results by setting the num=2 and learning rate=0.005.
5 Conclusion
In this paper, we reveal the limitations of RPN and rethink the IoU distribution imbalance problem in object detection. The proposed IoU-uniform R-CNN, a simple but effective way, alleviates the imbalance in both the number of samples and regression loss among different IoU intervals. In particular, we first replace the RoI training samples with generated IoU uniformly distributed samples. Then we tune the loss weight of different IoU intervals to further control the balance of regression loss composition. Besides, we also point out the feature offsets of RoIs during the inference of IoU-prediction branch and solve it by updating the feature of refined RoIs. Extensive experiments show its superior performance on both PASCAL VOC and MS COCO dataset, as well as its compatibility and adaptivity to many object detection architectures.
References
- [1] (2017) Soft-nms–improving object detection with one line of code. In Proceedings of the IEEE international conference on computer vision, pp. 5561–5569. Cited by: §2.
-
[2]
(2018)
Cascade r-cnn: delving into high quality object detection.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 6154–6162. Cited by: §1, §2, §2. - [3] (2019) Prime sample attention in object detection. arXiv preprint arXiv:1904.04821. Cited by: §2.
- [4] (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.1.
- [5] (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §2.
- [6] (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: Table 3.
- [7] (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578. Cited by: §2.
- [8] (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §4.
- [9] (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.
- [10] (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2.
-
[11]
(2017)
Accurate, large minibatch sgd: training imagenet in 1 hour
. arXiv preprint arXiv:1706.02677. Cited by: §4.1. - [12] (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §3.2, Table 3.
- [13] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
- [14] (2018) Softer-nms: rethinking bounding box regression for accurate object detection. arXiv preprint arXiv:1809.08545. Cited by: §2.
- [15] (2017) Speed/accuracy trade-offs for modern convolutional object detectors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7310–7311. Cited by: Table 3.
- [16] (2018) Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–799. Cited by: §1, §1, §2, §3.2.
- [17] (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §1.
- [18] (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2.
-
[19]
(2019)
Gradient harmonized single-stage detector.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 8577–8584. Cited by: §2. - [20] (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §2.
- [21] (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §2, §2.
- [22] (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.
- [23] (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2, §2.
- [24] (2019) Imbalance problems in object detection: a review. arXiv preprint arXiv:1909.00169. Cited by: §2.
- [25] (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §1, §2.
- [26] (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.
- [27] (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.
- [28] (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.
- [29] (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 761–769. Cited by: §2.
- [30] (2016) Beyond skip connections: top-down modulation for object detection. arXiv preprint arXiv:1612.06851. Cited by: Table 3.
- [31] (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1.
- [32] (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
- [33] (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §2.
- [34] (2018) Improving object localization with fitness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6877–6885. Cited by: §2.
- [35] (2019) Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621. Cited by: §2.
Comments
There are no comments yet.