DR Loss: Improving Object Detection by Distributional Ranking

07/23/2019 ∙ by Qi Qian, et al. ∙ 1

Most of object detection algorithms can be categorized into two classes: two-stage detectors and one-stage detectors. For two-stage detectors, a region proposal phase can filter massive background candidates in the first stage and it masks the classification task more balanced in the second stage. Recently, one-stage detectors have attracted much attention due to its simple yet effective architecture. Different from two-stage detectors, one-stage detectors have to identify foreground objects from all candidates in a single stage. This architecture is efficient but can suffer from the imbalance issue with respect to two aspects: the imbalance between classes and that in the distribution of background, where only a few candidates are hard to be identified. In this work, we propose to address the challenge by developing the distributional ranking (DR) loss. First, we convert the classification problem to a ranking problem to alleviate the class-imbalance problem. Then, we propose to rank the distribution of foreground candidates above that of background ones in the constrained worst-case scenario. This strategy not only handles the imbalance in background candidates but also improves the efficiency for the ranking algorithm. Besides the classification task, we also improve the regression loss by gradually approaching the L_1 loss as suggested in interior-point methods. To evaluate the proposed losses, we replace the corresponding losses in RetinaNet that reports the state-of-the-art performance as a one-stage detector. With the ResNet-101 as the backbone, our method can improve mAP on COCO data set from 39.1% to 41.1% by only changing the loss functions and it verifies the effectiveness of the proposed losses.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The performance of object detection has been improved dramatically with the development of deep neural networks in the past few years.

Figure 1: Illustration of the proposed distributional ranking loss. First, we re-weight examples to derive the constrained distributions for foreground and background from the original distributions, respectively. Then, we learn to rank the expectation of the derived distribution of foreground above that of background by a large margin.

Most of detection algorithms fall into two categories: two-stage detectors [3, 11, 12, 14] and one-stage detectors [6, 15, 17, 20]

. For the two-stage schema, the procedure of the algorithms can be divided into two parts. In the first stage, a region propose method will filter most of background candidate bounding boxes and keep only a small set of candidates. In the following stage, these candidates are classified as foreground classes or background and the bounding box is further refined by optimizing a regression loss. Two-stage detectors demonstrate the superior performance on real-world data sets while the efficiency can be an issue in practice, especially for the devices with limited computing resources, e.g., smart phones, cameras, etc.

Therefore, one-stage detectors are developed for an efficient detection. Different from two-stage detectors, one-stage algorithms consist of a single phase and have to identify foreground objects from all candidates directly. The structure of a ons-stage detector is straightforward and efficient. However, a one-stage detector may suffer from the imbalance problem that can reside in the following two aspects. First, the numbers of candidates between classes are imbalanced. Without a region proposal phase, the number of background candidates can easily overwhelm that of foreground ones. Second, the distribution of background candidates is imbalanced. Most of them can be easily separated from foreground objects while only a few of them are hard to classify.

To alleviate the imbalance problem, SSD [17] adopts hard negative mining, which keeps a small set of background candidates with the highest loss. By eliminating simple background candidates, the strategy balances the number of candidates between classes and the distribution of background simultaneously. However, some important classification information from background can be lost, and thus the detection performance can degrade. RetinaNet [15]

proposes to keep all background candidates but assign different weights for loss functions. The weighted cross entropy loss is called focal loss. It makes the algorithm focus on the hard candidates while reserving the information from all candidates. This strategy improves the performance of one-stage detectors significantly. Despite the success of focal loss, it re-weights classification losses in a heuristic way and can be insufficient to address the class-imbalance problem. Besides, the design of focal loss is data independent and lacks the exploration of the data distribution, which is essential to balance the distribution of background candidates.

In this work, we propose a data dependent ranking loss to handle the imbalance challenge. First, to alleviate the effect of the class-imbalance problem, we convert the classification problem to a ranking problem, which optimizes ranks of pairs. Since each pair consists of a foreground candidate and a background candidate, it is well balanced. Moreover, considering the imbalance in background candidates, we introduce the distributional ranking (DR) loss to rank the constrained distribution of foreground above that of background candidates. By re-weighting the candidates to derive the distribution corresponding to the worst-case loss, the loss can focus on the decision boundary between foreground and background distributions. Besides, we rank the expectation of distributions in lieu of original examples, which reduces the number of pairs in ranking and improves the efficiency. Compared with the re-weighting strategy in focal loss, that for DR loss is data dependent and can balance the distribution of background better. Fig. 1 illustrates the proposed DR loss. Besides the classification task, the regression is also important for detection to refine the bounding boxes of objects. The smoothed loss is prevalently adopted to approximate the loss in detection algorithms. We propose to improve the regression loss by gradually approaching the loss for better approximation, where the similar trick is also applied in interior-point methods [1].

We conduct the experiments on COCO [16] data set to demonstrate the proposed losses. Since RetinaNet reports the state-of-the-art performance among one-stage detectors, we replace the corresponding losses in RetinaNet with our proposed losses while the other components are retained. For fair comparison, we implement our algorithm in Detectron 111https://github.com/facebookresearch/Detectron, which is the official codebase of RetinaNet. With ResNet-101 [12] as the backbone, optimizing our loss functions can boost the mAP of RetinaNet from to , which confirms the effectiveness of proposed losses.

The rest of this paper is organized as follows. Section 2 reviews the related work in object detection. Section 3 describes the details of the proposed DR loss and regression loss. Section 4 compares the proposed losses to others on COCO detection task. Finally, Section 5 concludes this work with future directions.

2 Related Work

Detection is a fundamental task in computer vision. In conventional methods, hand crafted features, e.g., HOG 

[4] and SIFT [18], are used for detection either with a sliding-window strategy which holds a dense set of candidates, e.g., DPM [5] or with a region proposal method which keeps a sparse set of candidates, e.g., Selective Search [23]. Recently, since deep neural networks have shown the dominating performance in classification tasks [13], the features obtained from neural networks are leveraged for detection tasks.

R-CNN [9]

equips the region proposal stage and works as a two-stage algorithm. It first obtains a sparse set of regions by selective search. In the next stage, a deep convolutional neural network is applied to extract features for each region. Finally, regions are classified with a conventional classifier, e.g., SVM. R-CNN improves the performance of detection by a large margin but the procedure is too slow for real-world applications. Hence, many variants are developed to accelerate it 

[8, 21]. To further improve the accuracy, Mask-RCNN [11] adds a branch for object mask prediction to boost the performance with the additional information from multi-task learning. Besides the two-stage structure, cascade R-CNN [2] develops a multiple stage strategy to promote the quality of detectors after region proposal stage in a cascade fashion.

One-stage detectors are also developed for efficiency [17, 19, 22]. Since there is no region proposal phase to sample background candidates, one-stage detectors can suffer from the imbalance issue both between classes and in the background distribution. To alleviate the challenge, SSD [17] adopts hard example mining, which only keeps the hard background candidates for training. Recently, RetinaNet [15] is proposed to address the problem by focal loss. Unlike SSD, it keeps all background candidates but re-weights them such that the hard example will be assigned with a large weight. Focal loss improves the performance of detection explicitly, but the imbalance problem in detection is still not explored sufficiently. In this work, we develop the distributional ranking loss that ranks the distributions of foreground and background. It can alleviate the imbalance issue and capture the data distribution better with a data dependent mechanism.

3 DR Loss

Given a set of candidate bounding boxes from an image, a detector has to identify the foreground objects from background ones with a classification model. Let denote a classifier and it can be learned by optimizing the problem



is the number of total images. In this work, we employ sigmoid function to predict the probability for each example.

is determined by

and indicates the estimated probability that the

-th candidate in the -th image is from the -th class. is the loss function. In most of detectors, the classifier is learned by optimizing the cross entropy loss. For the binary classification problem, it can be written as

where is the label.

The objective in Eqn. 1 is conventional for object detection and it suffers from the class-imbalance problem. This can be demonstrated by rewriting the problem in the equivalent form


where and denote the positive (i.e., foreground) and negative (i.e., background) examples, respectively. and are the corresponding number of examples. When , the accumulated loss from the latter term will dominate. This issue is from the fact that the losses for positive and negative examples are separated and the contribution of positive examples will be overwhelmed by negative examples. A heuristic way to handle the problem is emphasizing positive examples, which can increase the weights for the corresponding losses. In this work, we aim to address the problem in a fundamental way.

3.1 Ranking

To alleviate the challenge from class-imbalance, we optimize the rank between positive and negative examples. Given a pair of positive and negative examples, an ideal ranking model can rank the positive example above the negative one with a large margin


where is a non-negative margin. Compared with the objective in Eqn. 1, the ranking model optimizes the relationship between individual positive and negative examples, which is well balanced.

The objective of ranking can be written as


where can be the hinge loss as

The objective can be interpreted as


It demonstrates that the objective measures the expected ranking loss by uniformly sampling a pair of positive and negative examples.

The ranking loss addresses the class-imbalance issue by comparing the rank of each positive example to negative examples. However, it ignores a phenomenon in object detection, where the distribution of negative examples is also imbalanced. Besides, the ranking loss introduces a new challenge, that is, the vast number of pairs. We tackle them in the following subsections.

3.2 Distributional Ranking

As indicated in Eqn. 3.1, the ranking loss in Eqn. 4 punishes a mis-ranking for a uniformly sampled pair. In detection, most of negative examples can be easily ranked well, that is, a randomly sampled pair will not incur the ranking loss with high probability. Therefore, we propose to optimize the ranking boundary to avoid the trivial solution


If we can rank the positive example with the lowest score above the negative one with the highest confidence, the whole set of candidates are perfectly ranked. Compared with the conventional ranking loss, the worst case loss is much more efficient by reducing the number of pairs from to

. Moreover, it clearly eliminates the class-imbalance issue since only a single pair of positive and negative examples are required for each image. However, this formulation is very sensitive to outliers, which can lead to the degraded detection model.

To improve the robustness, we first introduce the distribution for the positive and negative examples and obtain the expectation as

where and denote the distribution over positive and negative examples, respectively. and represent the expected ranking score under the corresponding distribution. is the simplex as . When and

are the uniform distribution,

and demonstrates the expectation from the original distribution.

By deriving the distribution corresponding to the worst-case loss from the original distribution

we can rewrite the problem in Eqn. 6 in the equivalent form

which can be considered as ranking the distributions between positive and negative examples in the worst case. It is obvious that the original formulation is not robust due to the fact that the domain of the generated distribution is unconstrained. Consequently, it will concentrate on a single example while ignoring the original distribution. Hence, we improve the robustness of the ranking loss by regularizing the freedom of the derived distribution as

where is a regularizer for the diversity of the distribution to prevent the distribution from the trivial one-hot solution. It can be different forms of entropy, e.g., Rényi entropy, Shannon entropy, etc. and are constants to control the freedom of distributions.

To obtain the constrained distribution, we investigate the subproblem

According to the dual theory [1], given , we can find the parameter to obtain the optimal by solving the problem

We observe that the former term is linear in . Hence, if is strongly concave in , the problem can be solved efficiently by first order algorithms [1].

Considering the efficiency, we adopt the Shannon entropy as the regularizer in this work and we can have the closed-form solution as follows.

Proposition 1.

For the problem

we have the closed-form solution as


It can be proved directly from K.K.T. condition [1]. ∎

For the distribution over positive examples, we have the similar result as

Proposition 2.

For the problem

we have the closed-form solution as

Figure 2: Illustration of the drifting in the distribution. We randomly sample

points from a Gaussian distribution to mimic negative examples. We change the weights of examples according to the proposed strategy as in Proposition 


and then plot the curves of different probability density functions (PDF) when varying


Remark 1

These Propositions show that the harder the example, the larger the weight of the example. Besides, the weight is data dependent and is affected by the data distribution.

Fig. 2 illustrates the drifting of the distribution with the proposed strategy. The derived distribution is approaching the distribution corresponding to the worst-case loss when decreasing .

With the closed-form solutions of distributions, the expectation of distributions can be computed as


Finally, smoothness is crucial for the convergence of non-convex optimization [7]. So we use the smoothed approximation instead of the original hinge loss as the loss function [25]


where controls the smoothness of the function. The larger the is , the more closer to the hinge loss the approximation is. Fig. 3 compares the hinge loss to its smoothed version in Eqn. 8.

Figure 3: Illustration of the hinge loss and its smoothed variants.

Incorporating all of these components, our distributional ranking loss can be defined as


where and are given in Eqn. 7 and is in Eqn. 8. Compared with the conventional ranking loss, we rank the expectation between two distributions. It shrinks the number of pairs to that leads to the efficient optimization.

The objective in Eqn. 9 looks complicated but its gradient is easy to compute. The detailed calculation of the gradient can be found in the appendix.

If we optimize the DR loss by the standard stochastic gradient descent (SGD) with mini-batch as

we can show that it can converge as in the following theorem and the detailed proof is cast to the appendix.

Theorem 1.

Let denote the model obtained from the -th iteration with SGD optimizer, where mini-batch size is . When

, if we assume the variance of the gradient is bounded as

and set the learning rate as , we have

Remark 2

Theorem 1 implies that the learning rate depends on the mini-batch size and the number of iterations as and the convergence rate is . We let , and denote an initial setting for training. If we increase the mini-batch size as and shrink the number of iterations as where , the convergence rate remains the same. However, the learning rate has to be increased as when , which is consistent with the observation in [10].

Remark 3

Theorem  1 also indicates that the convergence rate depends on . Therefore, trades between the approximation error and the convergence rate. When is large, the smoothed loss can simulate the hinge loss better while the convergence can become slow.

3.3 Recover Classification from Ranking

In detection, we have to identify foreground from background. Therefore, the results from ranking has to be converted to classification. A straightforward way is setting a threshold for the ranking score. However, the scores from different pairs can be inconsistent for classification. For example, given two pairs as

we observe that both of them have the perfect ranking but it is hard to set a threshold to classify positive examples from negative ones simultaneously. To make the ranking result meaningful for classification, we enforce a large margin in the constraint 3 as . Therefore, the constraint becomes

Due to the non-negative property of probability, it implies

which recovers the standard criterion for classification.

3.4 Bounding Box Regression

Besides classification, regression is also important for detection to refine the bounding box. Most of detectors apply smoothed loss to optimize the bounding box


It smoothes loss by loss in the interval of and guarantees that the whole loss function is smooth. It is reasonable since smoothness is important for convergence as indicated in Theorem 1. However, it may result in the slow optimization in the interval of loss. Inspired by the interior-point method [1], which gradually approximates the non-smooth domain by increasing the weight of the corresponding barrier function at different stages, we obtain from a decreasing function to reduce the gap between and losses. As suggested in the interior-point method, the current objective should be solved to optimum before changing the weight for the barrier function. We decay the value of in a stepwise manner. Specifically, we compute at the -th iteration as

where is a constant and denotes the width of a step.

Combining the regression loss, the objective of training the detector becomes

where is to balance the weights between classification and regression.

4 Experiments

4.1 Implementation Details

We evaluate the proposed losses on COCO 2017 data set [16], which contains about images for training, images for validation, and images for test. To focus on the comparison of loss functions, we employ the structure of RetinaNet [15] as the backbone and only substitute the corresponding loss functions. For fair comparison, we make the adequate modifications in the official codebase of RetinaNet, which is released in Detectron. Besides, we train the model with the same setting as RetinaNet. Specifically, the model is learned with SGD on GPUs and the mini-batch size is set as where each GPU can hold images at each iteration. Most of experiments are trained with iterations and the length is denoted as “”. The initial learning rate is and is decayed by a factor of after iterations and then iterations. For anchor density, we apply the same setting as in [15], where each location has scales and aspect ratios. The standard COCO evaluation criterion is used to compare the performance of different methods.

Since RetinaNet lacks the optimization of the relationship between positive and negative distributions, it has to initialize the output probability of the classifier at to fit the distribution of background. In contrast, we initialize the probability of the sigmoid function at , which is more reasonable for binary classification scenario without any prior knowledge. It also verifies that the proposed DR loss can handle class-imbalance better.

In Eqn. 7, we compute the constrained distribution over positive and negative examples with and , respectively. To reduce the number of parameters, we fix the ratio between and as and tune the scale as

It is easy to show that this strategy is equivalent to fixing and as and , and changing the base in the definition of the entropy regularizer as

Note that RetinaNet applies Feature Pyramid Network (FPN) [14] to obtain multiple scale features. To compute DR loss in one image, we collect candidates from multiple pyramid levels and obtain a single distribution for foreground and background, respectively.

4.2 Effect of Parameters

First, we take ablation experiments to evaluate the effect of multiple parameters on the validation set. All experiments in this subsection are implemented with a single image scale of for training and test. ResNet- is applied as the backbone for comparison. Only horizontal flipping is adopted as the data augmentation in this subsection.



5 3     38.6 58.3 41.7 21.5 43.0 51.4
6 5     38.7 58.8 41.5 21.1 42.9 52.1
7 5     38.7 58.9 41.4 21.7 42.9 52.0
8 5     38.6 58.7 41.3 21.6 42.4 51.9
Table 1: Comparison of the smooth term in Eqn. 8. Training uses iterations and ResNet-101 as the backbone.

Effect of :

controls the smoothness of the loss function in Eqn. 8. We compare the model with different in Table 1. Note that also changes the function value, we adjust the weight of classification loss accordingly. The base of entropy regularizer is fixed as . We observe that the loss function is quite stable for the choice of different smooth values. Besides, a larger will result in a smaller function value as shown in Fig. 3 and it suggests to increase the weight of classification loss to balance the losses. We keep and in the rest experiments.



3.5     38.7 58.6 41.7 21.1 43.0 52.5
4.0     38.7 58.8 41.5 21.1 42.9 52.1
4.5     38.6 58.5 41.4 21.1 42.6 52.6
Table 2: Comparison of the base in entropy regularizer . Training uses iterations and ResNet-101 as the backbone.

Effect of :

Next, we evaluate the effect of . changes the scale of and in the standard entropy regularizer. As illustrated in Fig. 2, a large will push the generated distribution to the extreme case while a small will make the derived distribution close to the original distribution. We vary the range of and summarize the results in Table 2. It is obvious that is also not sensitive in a reasonable range and we fix it to in the following experiments.



Fixed     38.7 58.8 41.5 21.1 42.9 52.1
Linear     39.0 58.7 42.0 21.7 42.9 52.6
=     39.0 58.7 41.7 21.4 43.1 52.3
=     39.1 58.8 42.3 21.5 43.3 52.3
Table 3: Comparison of the smooth term in Eqn. 10. Training uses iterations and ResNet-101 as the backbone.
Figure 4: Different strategies for decaying in Eqn. 10.
    RetinaNet Dr.Retina
depth scale     AP AP AP AP AP AP AP AP AP AP AP AP


50 400     30.5 47.8 32.7 11.2 33.8 46.1 32.1 50.2 34.0 11.5 34.5 47.5
50 500     32.5 50.9 34.8 13.9 35.8 46.7 34.2 53.1 36.3 14.8 36.8 48.6
50 600     34.3 53.2 36.9 16.2 37.4 47.4 35.8 55.0 38.4 17.5 38.2 48.9
50 700     35.1 54.2 37.7 18.0 39.3 46.4 36.6 56.2 39.0 19.1 39.2 48.5
50 800     35.7 55.0 38.5 18.9 38.9 46.3 37.2 56.7 39.8 20.1 40.0 48.3
101 400     31.9 49.5 34.1 11.6 35.8 48.5 33.5 51.9 35.6 11.9 36.5 49.8
101 500     34.4 53.1 36.8 14.7 38.5 49.1 36.0 55.1 38.2 15.6 38.9 51.0
101 600     36.0 55.2 38.7 17.4 39.6 49.7 37.6 57.1 40.4 18.1 40.6 51.4
101 700     37.1 56.6 39.8 19.1 40.6 49.4 38.6 58.4 41.4 20.3 41.5 51.2
101 800     37.8 57.5 40.8 20.2 41.1 49.2 39.2 59.0 42.1 21.5 42.3 51.0
Table 4: Comparison of different input scales and backbones. Training uses iterations. Results on the test set are reported.
Methods Backbone AP AP AP AP AP AP


two-stage detectors
Faster R-CNN+++ [12] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9
Faster R-CNN w FPN [14] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2
Deformable R-FCN [3] Aligned-Inception-ResNet 37.5 58.0 40.8 19.4 40.1 52.5
Mask R-CNN [11] Resnet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2
one-stage detectors
YOLOv2 [20] DarkNet-19 21.6 44.0 19.2 5.0 22.4 35.5
SSD513 [17] ResNet-101-SSD 31.2 50.4 33.3 10.2 34.5 49.8
DSSD513 [6] ResNet-101-DSSD 33.2 53.3 35.2 13.0 35.4 51.1
RetinaNet [15] ResNet-101-FPN 39.1 59.1 42.3 21.8 42.7 50.2
RetinaNet [15] ResNeXt-101-FPN 40.8 61.1 44.1 24.1 44.2 51.2
Dr.Retina ResNet-101-FPN 40.6 60.7 43.9 22.9 43.7 51.9
Dr.Retina ResNet-101-FPN 41.1 60.7 44.3 23.3 44.1 52.6
Dr.Retina ResNeXt-101-FPN 42.5 62.8 45.9 25.2 45.8 53.5
Table 5: Comparison with the state-of-the-art methods on COCO test set.

Effect of :

Finally, we demonstrate the different strategies for changing in the smoothed loss. In the implementation of RetinaNet, is fixed to . We compare three strategies to decay to , which are illustrated in Fig. 4. The results are shown in Table 3. First, it is evident that all strategies with decayed can improve the performance of detectors with a fixed . Then, the stepwise decay with outperforms linear decay and it verifies that the objective should be optimized sufficiently before moving to the decay step. We adopt stepwise decay in the next subsections.

Effect of DR Loss:

To illustrate the effect of DR loss, we collect the confidence scores of examples from all images in the validation set and compare the empirical probability density in Fig. 6. We include cross entropy loss and focal loss in the comparison. The model with cross entropy loss is trained by ourselves while the model with focal loss is downloaded directly from the official model zoo with the same configuration as DR loss.

First, we observe that most of examples have the extremely low confidence with cross entropy loss. It is because the number of negative examples overwhelms that of positive ones and it will classify most of examples to negative to obtain a small loss as demonstrated in Eqn. 2. Second, focal loss is better than cross entropy loss by drifting the distribution of foreground. However, the expectation of the foreground distribution is still close to that of background, and it has to adopt a small threshold as to identify positive examples from negative ones. Compared to cross entropy and focal loss, DR loss optimizes the foreground distribution significantly. By optimizing the ranking loss with a large margin, the expectation of the foreground examples is larger than while that of background is smaller than . It confirms that DR loss can address the imbalance between classes well. Consequently, DR loss allows us to set a large threshold for classification. We set the threshold as in experiments while it is not sensitive in the range of . Besides, the distribution of background examples with DR loss is more balanced than that with cross entropy or focal loss. It verifies that with the data dependent re-weighting strategy, DR loss can handle the imbalance in background distribution and focus on the hard negative examples appropriately.

(a) Background distribution (b) Foreground distribution
Figure 5: Illustration of empirical PDF of distributions that are computed from images in the validation set.

4.3 Performance with Different Scales

With the parameters suggested from ablation studies, we train the model with different scales and backbones to show the robustness of the proposed losses. We adopt ResNet-50 and ResNet-101 as backbones in the comparison. Training applies only horizontal flipping as the data augmentation. Table. 4 compares the performance with different scales to that of RetinaNet. We let “Dr.Retina” denote the RetinaNet with the proposed DR loss and the decaying strategy for smoothed loss. Evidently, Dr.Retina performs better than RetinaNet over all scales with different backbones. Since we only change the loss functions in RetinaNet, the inference time remains the same while the mAP is consistently improved by about . The comparison also shows that the parameters in Dr.Retina is not sensitive to the scale of input images. It implies that the proposed losses is applicable for real-world applications.

4.4 Comparison with State-of-the-Art

Finally, we compare Dr.Retina to the state-of-the-art two-stage and one-stage detectors on COCO test set. We follow the setting in [15] to increase the number of training iterations to , which contains iterations, and applies scale jitter in as the additional data augmentation for training. Note that we still use a single image scale and a single crop for test as above. Table 5 summarizes the comparison for Dr.Retina. To emphasize the effectiveness of DR loss, we first train a model with the original regression loss, which is denoted as “Dr.Retina”. With ResNet-101 as the backbone, we can observe that Dr.Retina improves AP from to and it confirms that DR loss can handle the imbalance issue in detection better than focal loss. With gradually approaching regression loss, Dr.Retina gains another improvement and surpasses RetinaNet by . Equipped with ResNeXt-32x8d-101 [24] and training, the performance of Dr.Retina can achieve as a one-stage detector on COCO detection task.

5 Conclusion

In this work, we propose the distributional ranking loss to address the imbalance challenge in one-stage object detection. It first converts the original classification problem to a ranking problem, which balances the classes of foreground and background. Furthermore, we propose to rank the expectation of derived distributions in lieu of original examples to focus on the hard examples, which balances the distribution of background. Besides, we improve the regression loss by developing the strategy to optimize loss better. Experiments on COCO verifies the effectiveness of the proposed losses. Since RPN also has the imbalance issue in two-stage detectors, applying DR loss for that can be our future work.


  • [1] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • [2] Z. Cai and N. Vasconcelos. Cascade R-CNN: delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
  • [3] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
  • [4] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages 886–893, 2005.
  • [5] P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.
  • [6] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017.
  • [7] S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • [8] R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
  • [9] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
  • [10] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
  • [11] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  • [13] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
  • [14] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, pages 936–944, 2017.
  • [15] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017.
  • [16] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014.
  • [17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, pages 21–37, 2016.
  • [18] D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
  • [19] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
  • [20] J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In CVPR, pages 6517–6525, 2017.
  • [21] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
  • [22] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
  • [23] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
  • [24] S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017.
  • [25] T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Inf. Retr., 4(1):5–31, 2001.

Appendix A Gradient of DR Loss

We define the DR loss as




It looks complicated but its gradient is easy to compute. Here we give the detailed gradient form. For , we have

where .

For , we have

Appendix B Proof of Theorem 1


We assume that the loss in Eqn. 9 is -smoothness, so we have

According to the definition, we have

If we assume that the variance is bounded as

then we have

Therefore, we have

By assuming and adding from to , we have

We finish the proof by letting

Appendix C Experiments

Effect of DR Loss:

We illustrate the empirical PDF of foreground and background from DR loss in Fig. 6. Fig. 6 (a) show the original density of foreground and background. To make the results more explicit, we decay the density of background by a factor of and demonstrate the result in Fig. 6 (b). It is obvious that DR loss can separate the foreground and background with a large margin in the imbalance scenario.

(a) Original Density (b) Decayed Density
Figure 6: Illustration of empirical PDF of distributions from DR loss.