Prime Sample Attention in Object Detection

It is a common paradigm in object detection frameworks to treat all samples equally and target at maximizing the performance on average. In this work, we revisit this paradigm through a careful study on how different samples contribute to the overall performance measured in terms of mAP. Our study suggests that the samples in each mini-batch are neither independent nor equally important, and therefore a better classifier on average does not necessarily mean higher mAP. Motivated by this study, we propose the notion of Prime Samples, those that play a key role in driving the detection performance. We further develop a simple yet effective sampling and learning strategy called PrIme Sample Attention (PISA) that directs the focus of the training process towards such samples. Our experiments demonstrate that it is often more effective to focus on prime samples than hard samples when training a detector. Particularly, On the MSCOCO dataset, PISA outperforms the random sampling baseline and hard mining schemes, e.g. OHEM and Focal Loss, consistently by more than 1 backbone ResNeXt-101.


page 4

page 5

page 8


Prime-Aware Adaptive Distillation

Knowledge distillation(KD) aims to improve the performance of a student ...

Dynamic Multi-Scale Loss Optimization for Object Detection

With the continuous improvement of the performance of object detectors v...

Libra R-CNN: Towards Balanced Learning for Object Detection

Compared with model architectures, the training process, which is also c...

Learning a Unified Sample Weighting Network for Object Detection

Region sampling or weighting is significantly important to the success o...

S-OHEM: Stratified Online Hard Example Mining for Object Detection

One of the major challenges in object detection is to propose detectors ...

Nearest Prime Simplicial Complex for Object Recognition

The structure representation of data distribution plays an important rol...

Impact of Channel Variation on One-Class Learning for Spoof Detection

The value of Spoofing detection in increasing the reliability of the ASV...

1 Introduction

Modern object detection frameworks, including both single-stage [17, 14] and two-stage [7, 6, 19], usually adopt a region-based approach, where a detector is trained to classify and localize sampled regions. Therefore, the choice of region samples is critical to the success of an object detector. In practice, most of the samples are located in the background areas. Hence, simply feeding all the region samples, or a random subset thereof, through a network and optimizing the average loss is obviously not a very effective strategy.

Recent studies [17, 20, 14] showed that focusing on difficult samples is an effective way to boost the performance of an object detector. A number of methods have been developed to implement this idea in various ways. Representative methods along this line include OHEM [20] and Focal Loss [14]. The former explicitly selects hard samples

,  those with high loss values; while the latter uses a reshaped loss function to reweight the samples, emphasizing difficult ones.

Figure 1: Left shows both a prime sample (in red color) and a hard sample (in blue color) for an object against the ground-truth. The prime sample has a high IoU with the ground-truth and is located more precisely around the object. Right shows the RoC curves obtained with different sampling strategies, which suggests that attention to prime samples instead of hard samples is a more effective strategy to boost the performance of a detector.

Though simple and widely adopted, random sampling or hard mining are not necessarily the optimal sampling strategy in terms of training an effective detector. Particularly, a question remains open – what are the most important samples for training an object detector. In this work, we carry out a study on this issue with an aim to find a more effective way to sample/weight regions.

Our study reveals two significant aspects that need to be taken into consideration when designing a sampling strategy: (1) Samples should not be treated as independent and equally important. Region-based object detection is to select a small subset of bounding boxes out of a large number of candidates to cover all objects in an image. Hence, the decisions on different samples are competing with each other, instead of being independent (like in a classification task). In general, it is more advisable for a detector to yield high scores on one bounding box around each object while ensuring all objects of interest are sufficiently covered, instead of trying to produce high scores for all positive samples,  those that substantially overlap with objects. Particularly, our study shows that focusing on those samples with highest IoUs with the ground-truth objects is an effective way towards this goal. (2) The objective of classification and localization are correlated. The observation that those samples that are precisely located around ground-truth objects are particularly important has a strong implication, that is, the objective of classification is closely related to that of localization. In particular, well located samples need to be well classified with high confidences.

Inspired by the study, we propose PrIme Sample Attention (PISA), a simple yet effective method to sample regions and learn object detectors, where we refer to those samples that play a more important role in achieving high detection performance as the prime samples. Specifically, we define IoU Hierarchical Local Rank (IoU-HLR) to rank the samples in each mini-batch. This ranking strategy places the samples with highest IoUs around each object to the top of the ranked list and directs the focus of the training process to them via a simple re-weighting scheme. We also devise a classification-aware regression loss to jointly optimize both the classification and the regression branches. Particularly, this loss would suppress those samples with large regression loss, thus reinforcing the attention to prime samples.

We tested PISA with both two-stage and single-stage detection frameworks. On the MSCOCO [15] test-dev, with a strong backbone of ResNet-101-32x4d, PISA improves Faster R-CNN [19], Mask R-CNN [8] and RetinaNet [14] by 1.2%, 1.0%, 1.4% respectively. For SSD, PISA achieves a gain of 2.0%.

Our main contributions mainly lie in three aspects: (1) Our study leads to a new insight into what samples are important for training an object detector, thus establishing the notion of prime samples. (2) We devise IoU Hierarchical Local Rank (IoU-HLR) to rank the importance of samples, and on top of that an importance-based reweighting scheme. (3) We introduce a new loss called classification-aware regression loss that jointly optimizes both the classification and regression branches, which further reinforces the attention to prime samples.

2 Related Work

Region-based object detectors. Region-based object detectors transform the detection task into a bounding box classification and regression problem. Contemporary approaches mostly fall into two categories, , the two-stage and single-stage detection paradigm. Two-stage detectors such as R-CNN [7], Fast R-CNN [6] and Faster R-CNN [19] first generate a set of candidate proposals, and then randomly sample a relatively small batch of proposals from all the candidates. These proposals are classified into foreground classes or background, their locations are at the same time refined by bounding box regression. There are also some recent improvements on the architectures [4, 13, 16], pipelines [8, 1, 2] and components [10, 25, 23]. In contrast, single-stage detectors like SSD [17] and RetinaNet [14] directly predict class scores and box offsets from anchors, without the region proposal step. Other variants include [26, 12, 27, 28]. The proposed PISA is not designed for any specific detectors but can be easily applied to both paradigms.

Sampling strategies in object detection. The most widely adopted sampling scheme in object detection is the random sampling, that is, to randomly select some samples from all candidates. Since the negative samples are usually much more than positive ones, a fixed ratio may be set for positive and negative samples during the sampling, like in [6, 19]. Another popular idea is to sample hard samples which have larger losses, this strategy can lead to better optimization for classifiers. The principle of hard mining is proposed in early detection work [22, 5]

, and also adopted by more recent methods in the deep learning era. Specifically,

[17] and [7] use hard negative mining which selects hard samples among all negative ones in each image. OHEM [20] mines hard examples regardless of positive or negative ones at each iteration on the fly. Libra R-CNN [18]

proposes IoU-balanced Sampling as an estimation of hard negative mining, which samples equal number of negative RoIs in different IoU ranges. RetinaNet 

[14], does not perform actual sampling although, can be seen as a soft version of sampling. It applies different loss weights to different samples through Focal Loss, to focus more on hard positive and negative samples. However, the goal of hard mining is to boost the average performance of classifier and does not investigate the difference between detection and classification. Different from that, PISA can achieve a biased performance on different samples. According to our study in Sec. 3, we find that prime samples are more likely to be easy ones, which is opposite to hard mining.

Relation between samples Unlike conventional detectors that predicts all samples independently, [10]

proposes an attention module adapted from the natural language processing field to model relations between objects. Though it is effective with a more complicated framework, all samples are treated equal and relations are learned implicitly, without understanding what exactly they are. In PISA, samples are attended differently according to their importance.

Improvement of NMS with localization confidence IoU-Net [11] claims that it is not proper to use classification scores for NMS, the localization confidence is needed. Besides conventional branches for classification and regression, it adds an extra branch to predict the IoU of samples, and use the localization confident (predicted IoU) to rank all samples. There are some major differences between IoU-Net and our method. Firstly, we do not exploit an additional branch to predict the localization confidence, but to correlate the training of two branches. More importantly, our goal is not to improve the NMS. We investigate the sample importance and propose to pay more attention to prime samples with the importance-based reweighting.

Figure 2: Precision-recall curve under different IoU thresholds. The solid lines correspond to the baseline, dashed lines and dotted lines are results of reducing the classification loss by increasing scores of positive samples. Top5 and top25 IoU-HLR samples are concentrated on respectively.

3 Prime Samples

In this section, we introduce the concept of Prime Samples, namely those that have greater influence on the performance of object detection. Specifically, we carry out a study on the importance of different samples by revisiting mAP, the major performance metric for object detection. Our study shows that the importance of each sample depends on how its IoU with the ground-truth object compares to that of the others overlapping with the same object. Therefore, we propose IoU-HLR, a new ranking strategy, as a quantitive way to assess the importance.

A Revisit to mAP.

Mean Average Precision (mAP) is a widely adopted metric for assessing the performance in object detection, which is computed as follows. Given an image with annotated ground-truths, each bounding box will be marked as true positive (TP) when: (i) the IoU between this bounding box and its nearest ground truth is greater than a threshold , and (ii) there are no other boxes with higher scores which is also a TP of the same ground truth. All other bounding boxes are considered as false positives (FP). Then, the recall is defined as the fraction of ground-truths that are cover by TPs, and the precision is defined as the fraction of resulted bounding boxes that are TPs. On a testing dataset, one can obtain a precision-recall curve by varying the threshold , usually ranging from to , and compute the average precision (AP) for each class as the area under the curve. Then mAP is defined as the mean of the AP values over all classes.

The way that mAP works reveals two criteria on which samples are more important for an object detector. (1) Among all bounding boxes that overlap with a ground-truth object, the one with the highest IoU is the most important as its IoU value directly influences the recall. (2) Across all highest-IoU bounding boxes for different objects, the ones with higher IoUs are more important, as they are the last ones that fall below the IoU threshold as increases and thus have great impact on the overall precision.

Figure 3: Two steps to compute IoU-HLR. Samples are first sorted by IoU locally, and then sorted in the same-rank group.

IoU Hierarchical Local Rank (IoU-HLR).

Based on the analysis above, we propose IoU Hierarchical Local Rank (IoU-HLR) to rank the importance of the bounding box samples in a mini-batch. This rank is computed in a hierarchical manner, which reflects both the IoU relation locally (around each ground truth object) and globally (over the whole image or mini-batch). Notably, IoU-HLR is computed based on the final located position of samples, other than the bounding box coordinates before regression, since mAP is evaluated based on the regressed sample location. As shown in Figure 3, we first divide all samples into different groups, based on their nearest ground truth objects. Next, we sort the samples within each group in descending order by their IoU with the ground truth, and get the IoU Local Rank (IoU-LR). Subsequently, we take samples with the same IoU-LR and sort them in descending order. Specifically, all top-1 IoU-LR samples are collected and sorted, followed by top2, top3, and so on. These two stages of sorting results in the linear order among all samples in a batch, that is the IoU-HLR.

IoU-HLR follows the two criteria mentioned above. First, it places those samples with higher local ranks ahead, which are the samples that are most important to each individual ground-truth objects. Second, within each local group, it re-ranks the samples according to IoU, which aligns with the second criterion. Note that it is often good enough to ensure high accuracies on those samples that top this ranked list as they directly influence both the recall and the precision, especially when the IoU threshold is high; while those down the list are less important in terms of achieving high detection performance. As shown in Figure 2, the solid lines are the precision-recall curves under different IoU thresholds. We simulate some experiments by increasing the scores of samples. With the same budget, , reducing the total loss by 10%, we increase the scores of top5 and top25 IoU-HLR samples and plot the results in dashed and dotted lines respectively. The results suggests that focusing on only top samples is better than attending more samples equally.

We plot the distributions of random, hard, and prime samples in Figure 4, with the IoU classification loss. We can observe that prime samples tend to have high IoUs and low classification loss, while hard samples tend to have higher classification losses and scatter over a wider range along the IoU axis. This suggests that these two category of samples are of essentially different characteristics.

Figure 4: The distribution of random, hard, and prime samples. Here, the hard samples are chosen as the ones with top three loss values from each image; while the prime samples are those ranked as top three according to IoU-HLR.
Figure 5: Examples of PISA and random sampling baseline. PISA significantly increase scores of high local IoU rank samples.

4 Learn Detectors via Prime Sample Attention

The aim of object detection is not to obtain a better classification accuracy on average, but to achieve as good performance as possible on prime samples in the set, as discussed in Section 3. Nevertheless, this is nontrivial. If we just use top IoU-HLR samples for training like what OHEM does, the mAP will drop significantly because most prime samples are easy ones and cannot provide enough gradients to optimize the classifier. In this work, we propose Prime Sample Attention, a simple and effective sampling and learning strategy which pay more attention to prime samples. PISA consists of two components: Importance-based Sample Reweighting (ISR) and Classification Aware Regression Loss (CARL). With the proposed method, the training process is biased on prime samples other than evenly treat all ones. Firstly, the loss weight of prime samples are larger than the others, so that the classifier tends to predict higher scores on these samples. Secondly, the classifier and regressor are learned with a joint objective, thus scores of prime samples get boosted relative to unimportant ones.

4.1 Importance-based Sample Reweighting

Given the same classifier, the distribution of performance usually matches the distribution of training samples. If part of the samples occurs more frequently in the training data, a better classification accuracy on those samples is supposed to be achieved. Hard sampling and soft sampling are two different ways to change the training data distribution. Hard sampling selects a subset of samples from all candidates to train a model, while soft sampling assigns different weights for all samples. Hard sampling can be seen as a special case of soft sampling, where each sample is assigned a loss weight of either 0 or 1.

To make fewer modifications and fit existing frameworks, we propose a soft sampling strategy named Importance-based Sample Reweighting (ISR), which assigns different loss weights to samples according to their importance. Given that we adopt IoU-HLR as the importance measurement, the remaining question is how to map the importance to an appropriate loss weight.

We first transform the rank to a real value with a linear mapping. According to its definition, IoU-HLR is computed separately within each class. For class , supposing there are samples in total with the IoU-HLR , where , we use a linear function to transform each to as shown in Equ. 1. Here denotes the importance value of the -th sample of class .


A monotone increasing function is needed to further cast the sample importance to a loss weight . Here we adopt an exponential form as Equ. 2, where is the degree factor indicating how much preference will be given to important samples and is a bias that decides the minimum sample weight.


With the proposed reweighting scheme, the cross entropy classification loss can be rewritten as Equ. 3, where and are the total numbers of positive and all samples respectively, and denotes the predicted score and classification target of the -th sample. Note that simply adding loss weights will change the total value of losses and the ratio between the loss of positive and negative samples, so we normalize to in order to keep the total loss of positive samples unchanged.


4.2 Classification-Aware Regression Loss

Re-weighting the classification loss is a straightforward way to focus on prime samples. Besides that, we develop another method to highlight the prime samples, motivated by the earlier discussion that classification and localization is correlated. We propose to jointly optimize the two branches with a Classification-Aware Regression Loss (CARL). CARL can boost the scores of prime samples while suppressing the scores of other ones. The regression quality determines the importance of a sample and we expect the classifier to output higher scores for important samples. The optimization of two branches should be correlated other than independent.

Our solution is to make the regression loss aware of the classification scores, so that gradients are propagated from the regression branch to the classification branch. To this end, we propose CARL as shown in Equ. 5.

denotes the predicted probability of the corresponding ground truth class and

denotes the output regression offset. We use an exponential function to transform the to , and then rescale it according to the average value of all samples. Similar to the Equ. 3, the classification awareness is also normalized to keep the loss scale unchanged. is the commonly used smooth L1 loss.


It is obvious that the gradient of is proportional to the original regression loss . In the supplementary, we prove that there is a positive correlation between and the gradient of . Namely, samples with greater regression loss will receive large gradients for the classification scores, which means stronger suppression on the classification scores. In another view, reflects the localization quality of sample , thus can be seen as an estimation of IoU and further seen as an estimation of IoU-HLR. Approximately, top ranked samples have low regression loss, thus the gradients of classification scores are smaller. With CARL, the classification branch also get supervised by the regression loss. The scores of unimportant samples are greatly suppressed, while the attention to prime samples are reinforced.

5 Experiments

Method Backbone Train time AP
Two-stage detectors
Faster R-CNN ResNet-50 0.585 36.7 58.8 39.6 21.6 39.8 44.9
Faster R-CNN ResNeXt-101 1.004 40.3 62.7 44.0 24.4 43.7 49.8
Mask R-CNN ResNet-50 0.746 37.5 59.4 40.7 22.1 40.6 46.2
Mask R-CNN ResNeXt-101 1.134 41.4 63.4 45.2 24.5 44.9 51.8
Faster R-CNN w/ PISA ResNet-50 0.594 37.8(+1.1) 58.0 41.7 22.1 40.8 46.6
Faster R-CNN w/ PISA ResNeXt-101 1.018 41.5(+1.2) 61.8 45.8 24.7 44.7 51.9
Mask R-CNN w/ PISA ResNet-50 0.765 38.5(+1.0) 58.5 42.5 22.3 41.2 48.1
Mask R-CNN w/ PISA ResNeXt-101 1.224 42.4(+1.0) 62.4 46.9 25.0 45.9 53.2
Single-stage detectors
RetinaNet ResNet-50 0.526 35.9 56.0 38.3 19.8 38.9 45.0
RetinaNet ResNeXt-101 1.017 39.0 59.7 41.9 22.3 42.5 48.9
SSD300 VGG16 0.256 25.7 44.2 26.4 7.0 27.1 41.5
RetinaNet w/ PISA ResNet-50 0.575 37.2(+1.3) 55.8 40.2 20.2 40.2 47.2
RetinaNet w/ PISA ResNeXt-101 1.100 40.4(+1.4) 59.7 43.8 22.9 43.6 51.3
SSD300 w/ PISA VGG16 0.287 27.7(+2.0) 44.3 29.3 7.9 28.7 44.1
Table 1: Results of different detectors on COCO test-dev. The training time is measured on GTX 1080Ti.
Method Backbone AP(VOC) AP(COCO)
Faster R-CNN ResNet-50 79.1 48.4
Faster R-CNN w/ PISA ResNet-50 79.7 50.4
RetinaNet ResNet-50 79.0 51.8
RetinaNet w/ PISA ResNet-50 78.9 53.2
SSD300 VGG16 77.8 49.5
SSD300 w/ PISA VGG16 77.8 51.3
Table 2: Results of different detectors on VOC2007 test.

5.1 Experimental Setting

Dataset and evaluation metric.

We conduct experiments on the challenging MS COCO 2017 dataset [15]. It consists of two subsets: the train split with 118k images and val split with 5k images. We use the train split for training and report the performance on val and test-dev. The standard COCO-style AP metric is adopted, which averages mAP of multiple IoUs from 0.5 to 0.95 with an interval of 0.05.

Implementation details. We implement our methods based on mmdetection [3]. ResNet-50 [9], ResNeXt-101-32x4d [24] and VGG16 [21] are adopted as backbones in our experiments. Detailed settings are described in the supplementary material.

5.2 Results

Overall results. We evaluate the proposed PISA on both two-stage and single-stage detectors, on two popular benchmarks. We use the same hyper-parameters of PISA for all backbones and datasets. The results on MS COCO dataset are shown in Table 1. PISA achieves consistent mAP improvements on all detectors with different backbones, indicating its effectiveness and generality. Specifically, with a strong bacbone ResNeXt-101, it improves Faster R-CNN, Mask R-CNN and RetinaNet by 1.2%, 1.0% and 1.4% respectively. On SSD, the gain is as significant as 2.0%. The cost of training time brought by PISA is negligible, and no difference is presented during inference. On the smaller PASCAL VOC dataset, PISA also outperforms the baselines, as shown in Figure 2. PISA brings very limited gains under the VOC evaluation metric that use 0.5 as the IoU threshold, but achieves significant better under the COCO metric that use the average of multiple IoU thresholds. This implies that PISA is especially benefitial to high IoU metrics and makes more accurate prediction on precisely located samples.

Comparison of different sampling methods. We compare PISA with random sampling and hard mining. To investigate the effects of different sampling results, we apply them on positve and negative samples respectively. Faster R-CNN is adopted as the baseline methods. As shown in Table 3, PISA outperforms OHEM and its variants in all cases. We find that hard mining is effective when applied to negative samples, but hampers the performance when applied to positive samples. When adopting the random sampling strategy for negative samples, PISA is 1.7% higher than hard mining and 1.3% higher than random sampling. When adopting the hard mining strategy for negative samples, PISA is 1.3% and 1.1% superior to hard mining and random sampling. It is noted that the gain originates from the AP of high IoU thresholds, such as . is even slightly lower than baselines. This indicates that attending prime samples helps the classifier to be more accurate on samples with high IoUs. We demonstrate some qualitative results of PISA and the baseline in Figure 5.

pos neg AP
R R 36.4 58.4 39.1 21.6 40.1 46.6
R 36.0 58.3 38.7 21.1 39.5 45.8
C R 37.7 57.5 41.6 21.8 41.3 48.7
H H 36.9 58.1 40.3 21.2 40.4 47.9
36.8 58.2 39.8 21.2 40.4 48.5
R 37.0 58.1 40.4 21.3 40.7 48.3
C 38.1 57.6 42.4 21.8 41.7 50.3
Table 3: Comparison of different sampling strategies on Faster R-CNN. Results are evaluated on COCO val. “R”, “H” and “C” denote random sampling, OHEM and prime sample attention respectively. “” denotes the variant of OHEM which adopts a fixed postive/negative ratio so that we can combine it with other sampling schemes.

5.3 Analysis

We perform a thorough study on each each component of PISA, and explain how it works compared with random sampling and hard mining.

Component Analysis. Table 4 shows the effects of each component of PISA. We can learn that ISR and CARL improves the AP by and respectively, and the combination of them achieves a total gain of .

36.4 58.4 39.1 21.6 40.1 46.6
37.1 58.7 40.5 21.9 41.0 47.6
37.2 57.6 40.6 22.1 40.9 48.2
37.7 57.5 41.6 21.8 41.3 48.7
Table 4: Effectiveness of components of PISA.

Ablation experiments of hyper-parameters. For both ISR and CARL, we use an exponential transformation function of Equ. 2 and 2 hyper-parameters ( for ISR and for CARL) are introduced. As shown in Figure 6, the exponential factor or controls the steepness of the curve, while the constant factor or affects the minimum value.

Figure 6: The transformation function under different hyper-parameterss.

When performing ablation study on hyper-parameters of ISR, we do not adopt CARL, and vice versa. A larger and small means larger gap between prime samples and unimportant samples, so that we are more focus on prime samples. The opposite case means we pay more equal attention to all samples. When adopting and , our model achieve 37.1%, which is the best performance among these settings. We use it as the default setting in our main experiments. In CARL, we fail to run the setting , which will diverge at the very beginning. Therefore we adopt , and find that the performance is not sensitive to different values of . is used in other experiments as the default setting.

0.5 0.0 36.7 0.5 0.0 36.9
1.0 0.0 37.0 1.0 0.0 36.9
2.0 0.0 37.1 2.0 0.0 -
2.0 0.1 37.0 1.0 0.1 37.2
2.0 0.2 36.8 1.0 0.2 37.2
2.0 0.3 36.9 1.0 0.3 37.2
Table 5: Varing in ISR and in CARL.

We also study the influence of batch size when ranking the samples. In our experiments using IoU-HLR, samples are ranked in the batch of a GPU (usually 2 images). We can also rank the samples within each image or a “sub-batch” instead. We use a small backbone of ResNet-18 so that each GPU can host 8 images, then we rank the samples within 1, 2, 4, 8 images respectively. Results in Table 6 show that ranking the samples among more images bring some minor gains. During evaluation, samples are ranked in the whole dataset.

ranking batch AP AP AP AP AP AP
baseline 32.0 53.2 34.1 18.2 34.9 41.2
1 33.4 53.1 36.6 18.2 36.0 43.8
2 33.5 52.7 36.7 17.7 36.5 44.1
4 33.6 52.8 37.0 18.1 36.5 43.7
8 33.7 53.1 37.1 18.5 36.6 43.5
Table 6: Different batch size for ranking samples.

What samples do different sampling strategies prefer? To understand how ISR works, we study the sample distribution of random sampling, hard mining, and ISR from two aspects, IoU and loss. The weights of samples are the same for random sampling and hard mining, while different for ISR. Thus we take the sample weights into account when statistic the distribution. The IoU and loss distribution of selected samples are shown in Figure 7. We can learn that hard samples selected by hard mining and important samples selected by PISA diverge from each other. Hard samples have high losses and low IoUs, while important samples come with high IoUs and low losses, indicating that important samples tend to be easier for classifiers.

Figure 7: (a) IoU distribution and (b) loss distribution of different sampling scheme.

How does ISR affect classification scores? ISR assigns larger weights to prime samples, but does it achieve the biased classification performance as expected? We plot the score distribution of samples of different ranking in Figure 8. The average score of top ranked samples are higher than the baseline, while lower ranked samples are lower. The results demonstrate that ISR biases the classifier, thus boosting the prime samples while suppressing others.

Figure 8: Average scores of different IoU-HLR samples. ISR increases the average score of prime samples while decreasing the scores of unimportant ones.

How does CARL affect classification scores? Introducing classification scores to the regression loss has two effects. The first one is the gradient, as discussed in Sec. 4.2

. There is another side effect that is different regression weights are assigned to different samples. Samples with higher scores will be attended during the training of the regressor. Here we investigate which one plays the most important role. We train two models with different settings: (a) backpropagate the gradient from regression loss to classification scores, and (b) no gradient (implemented with the detach() method in PyTorch). All other settings are kept the same. Model (a) and (b) achieves 37.2% and 36.6% mAP respectively. The results show that using scores as loss weights brings only minor gains, and the gradients to suppress unimportant samples is the key contribution.

We plot the average scores of samples of different IoU, as shown in Figure 9. Compared with the baseline, CARL suppresses the scores of low IoU samples as expected.

Figure 9: Average scores at different IoU samples. CARL decreases the score of low IoU samples more significantly.

Is IoU-HLR better than other metrics? The results prove that IoU-HLR is an effective importance metric while loss is not, but is it better than others? We test other metrics for ISR, including loss rank, IoU, IoU before regression (denoted as ). The results are shown in Table 7, which suggests (1) the performance is more related to IoU instead of loss, and (2) using the locations after regression is important, and (3) IoU-HLR is better than IoU. These results match our intuition and analysis in Sec. 3.

Loss Rank 36.3 58.0 39.3 21.4 40.1 46.4
36.5 58.4 39.5 21.6 40.3 46.4
IoU 36.8 58.6 40.0 21.9 41.3 47.2
IoU-HLR 37.1 58.7 40.3 21.7 40.9 47.1
Table 7: Comparison of ISR with different importance metrics.

6 Conclusion

We study the question what are the most important samples for training an object detector, and establishing the notion of prime samples. We present PrIme Sample Attention (PISA), a simple and effective sampling and learning strategy to highlight important samples. On both MS COCO and PASCAL VOC dataset, PISA achieves consistent improvements over random sampling and hard mining counterparts.

Appendix A: Derivative of CARL

As discussed in Section 4.2, we prove that there is a positive correlation between and , where


Since we detach the inputs when computing the normalization ratio , gradients are not propagated from to . Thus we can just denote as a constant .




Denoting , we have


The batch size is usually large, so . Thus we have


On the other hand,


We have and , so . Especially when , .

Combining (6)(7)(9)(10),


When , , indicating that is proportional to , otherwise and are positively correalted.

Appendix B: Implementation details

We use 8 GTX 1080Ti GPUs in all experiments. For SSD, we adopt VGG16 as the backbone and resize input images to

. We train the model for a total of 120 epochs with a minibatch of 64 images (8 images per GPU). The learning rate is initialized as 0.001 and decreased by 0.1 after 80 and 110 epochs. For other methods, we use ResNet-50 or ResNeXt-101-32x4d as the backbone. FPN is used by default. The batch size is 16 (2 images per GPU). Models are trained for 12 epochs with an initial learning rate of 0.02, which is decreased by 0.1 after 8 and 11 epochs respectively. For the random sampling baseline, we sample 512 RoIs from 2000 proposals and the ratio of positive/negative samples is set to 1:3. When OHEM is used, we forward all 2000 proposals and select 512 samples with the highest loss. A variant of OHEM is also explored, where the ratio of positive/negative samples is set to be 1:3 they are mined independently. When applying ISR to RetinaNet, we simply multiply our weights with focal loss.


  • [1] Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , 2018.
  • [2] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, et al. Hybrid task cascade for instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [3] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin. mmdetection., 2018.
  • [4] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pages 379–387, 2016.
  • [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
  • [6] R. Girshick. Fast r-cnn. In IEEE International Conference on Computer Vision, 2015.
  • [7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  • [8] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, 2017.
  • [9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [10] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3588–3597, 2018.
  • [11] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition of localization confidence for accurate object detection. In European Conference on Computer Vision, 2018.
  • [12] B. Li, Y. Liu, and X. Wang. Gradient harmonized single-stage detector. In

    AAAI Conference on Artificial Intelligence

    , 2019.
  • [13] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [14] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In IEEE International Conference on Computer Vision, 2017.
  • [15] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  • [16] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia. Path aggregation network for instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European Conference on Computer Vision, 2016.
  • [18] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin. Libra r-cnn: Towards balanced learning for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [19] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, 2015.
  • [20] A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
  • [21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [22] K.-K. Sung and T. Poggio.

    Example-based learning for view-based human face detection.

    IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(1):39–51, 1998.
  • [23] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin. Region proposal by guided anchoring. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [24] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He.

    Aggregated residual transformations for deep neural networks.

    In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [25] H. Xu, X. Lv, X. Wang, Z. Ren, N. Bodla, and R. Chellappa. Deep regionlets for object detection. In European Conference on Computer Vision, 2018.
  • [26] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [27] Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling. M2det: A single-shot object detector based on multi-level feature pyramid network. In AAAI Conference on Artificial Intelligence, 2019.
  • [28] R. Zhu, S. Zhang, X. Wang, L. Wen, H. Shi, L. Bo, and T. Mei. Scratchdet: Exploring to train single-shot object detectors from scratch. arXiv preprint arXiv:1810.08425, 2018.