The performance of object detection has been improved dramatically with the development of deep neural networks in the past few years.
. For the two-stage schema, the procedure of the algorithms can be divided into two parts. In the first stage, a region propose method will filter most of background candidate bounding boxes and keep only a small set of candidates. In the following stage, these candidates are classified as foreground classes or background and the bounding box is further refined by optimizing a regression loss. Two-stage detectors demonstrate the superior performance on real-world data sets while the efficiency can be an issue in practice, especially for the devices with limited computing resources, e.g., smart phones, cameras, etc.
Therefore, one-stage detectors are developed for an efficient detection. Different from two-stage detectors, one-stage algorithms consist of a single phase and have to identify foreground objects from all candidates directly. The structure of a ons-stage detector is straightforward and efficient. However, a one-stage detector may suffer from the imbalance problem that can reside in the following two aspects. First, the numbers of candidates between classes are imbalanced. Without a region proposal phase, the number of background candidates can easily overwhelm that of foreground ones. Second, the distribution of background candidates is imbalanced. Most of them can be easily separated from foreground objects while only a few of them are hard to classify.
To alleviate the imbalance problem, SSD  adopts hard negative mining, which keeps a small set of background candidates with the highest loss. By eliminating simple background candidates, the strategy balances the number of candidates between classes and the distribution of background simultaneously. However, some important classification information from background can be lost, and thus the detection performance can degrade. RetinaNet 
proposes to keep all background candidates but assign different weights for loss functions. The weighted cross entropy loss is called focal loss. It makes the algorithm focus on the hard candidates while reserving the information from all candidates. This strategy improves the performance of one-stage detectors significantly. Despite the success of focal loss, it re-weights classification losses in a heuristic way and can be insufficient to address the class-imbalance problem. Besides, the design of focal loss is data independent and lacks the exploration of the data distribution, which is essential to balance the distribution of background candidates.
In this work, we propose a data dependent ranking loss to handle the imbalance challenge. First, to alleviate the effect of the class-imbalance problem, we convert the classification problem to a ranking problem, which optimizes ranks of pairs. Since each pair consists of a foreground candidate and a background candidate, it is well balanced. Moreover, considering the imbalance in background candidates, we introduce the distributional ranking (DR) loss to rank the constrained distribution of foreground above that of background candidates. By re-weighting the candidates to derive the distribution corresponding to the worst-case loss, the loss can focus on the decision boundary between foreground and background distributions. Besides, we rank the expectation of distributions in lieu of original examples, which reduces the number of pairs in ranking and improves the efficiency. Compared with the re-weighting strategy in focal loss, that for DR loss is data dependent and can balance the distribution of background better. Fig. 1 illustrates the proposed DR loss. Besides the classification task, the regression is also important for detection to refine the bounding boxes of objects. The smoothed loss is prevalently adopted to approximate the loss in detection algorithms. We propose to improve the regression loss by gradually approaching the loss for better approximation, where the similar trick is also applied in interior-point methods .
We conduct the experiments on COCO  data set to demonstrate the proposed losses. Since RetinaNet reports the state-of-the-art performance among one-stage detectors, we replace the corresponding losses in RetinaNet with our proposed losses while the other components are retained. For fair comparison, we implement our algorithm in Detectron 111https://github.com/facebookresearch/Detectron, which is the official codebase of RetinaNet. With ResNet-101  as the backbone, optimizing our loss functions can boost the mAP of RetinaNet from to , which confirms the effectiveness of proposed losses.
The rest of this paper is organized as follows. Section 2 reviews the related work in object detection. Section 3 describes the details of the proposed DR loss and regression loss. Section 4 compares the proposed losses to others on COCO detection task. Finally, Section 5 concludes this work with future directions.
2 Related Work
Detection is a fundamental task in computer vision. In conventional methods, hand crafted features, e.g., HOG and SIFT , are used for detection either with a sliding-window strategy which holds a dense set of candidates, e.g., DPM  or with a region proposal method which keeps a sparse set of candidates, e.g., Selective Search . Recently, since deep neural networks have shown the dominating performance in classification tasks , the features obtained from neural networks are leveraged for detection tasks.
equips the region proposal stage and works as a two-stage algorithm. It first obtains a sparse set of regions by selective search. In the next stage, a deep convolutional neural network is applied to extract features for each region. Finally, regions are classified with a conventional classifier, e.g., SVM. R-CNN improves the performance of detection by a large margin but the procedure is too slow for real-world applications. Hence, many variants are developed to accelerate it[8, 21]. To further improve the accuracy, Mask-RCNN  adds a branch for object mask prediction to boost the performance with the additional information from multi-task learning. Besides the two-stage structure, cascade R-CNN  develops a multiple stage strategy to promote the quality of detectors after region proposal stage in a cascade fashion.
One-stage detectors are also developed for efficiency [17, 19, 22]. Since there is no region proposal phase to sample background candidates, one-stage detectors can suffer from the imbalance issue both between classes and in the background distribution. To alleviate the challenge, SSD  adopts hard example mining, which only keeps the hard background candidates for training. Recently, RetinaNet  is proposed to address the problem by focal loss. Unlike SSD, it keeps all background candidates but re-weights them such that the hard example will be assigned with a large weight. Focal loss improves the performance of detection explicitly, but the imbalance problem in detection is still not explored sufficiently. In this work, we develop the distributional ranking loss that ranks the distributions of foreground and background. It can alleviate the imbalance issue and capture the data distribution better with a data dependent mechanism.
3 DR Loss
Given a set of candidate bounding boxes from an image, a detector has to identify the foreground objects from background ones with a classification model. Let denote a classifier and it can be learned by optimizing the problem
whereis determined by
and indicates the estimated probability that the-th candidate in the -th image is from the -th class. is the loss function. In most of detectors, the classifier is learned by optimizing the cross entropy loss. For the binary classification problem, it can be written as
where is the label.
The objective in Eqn. 1 is conventional for object detection and it suffers from the class-imbalance problem. This can be demonstrated by rewriting the problem in the equivalent form
where and denote the positive (i.e., foreground) and negative (i.e., background) examples, respectively. and are the corresponding number of examples. When , the accumulated loss from the latter term will dominate. This issue is from the fact that the losses for positive and negative examples are separated and the contribution of positive examples will be overwhelmed by negative examples. A heuristic way to handle the problem is emphasizing positive examples, which can increase the weights for the corresponding losses. In this work, we aim to address the problem in a fundamental way.
To alleviate the challenge from class-imbalance, we optimize the rank between positive and negative examples. Given a pair of positive and negative examples, an ideal ranking model can rank the positive example above the negative one with a large margin
where is a non-negative margin. Compared with the objective in Eqn. 1, the ranking model optimizes the relationship between individual positive and negative examples, which is well balanced.
The objective of ranking can be written as
where can be the hinge loss as
The objective can be interpreted as
It demonstrates that the objective measures the expected ranking loss by uniformly sampling a pair of positive and negative examples.
The ranking loss addresses the class-imbalance issue by comparing the rank of each positive example to negative examples. However, it ignores a phenomenon in object detection, where the distribution of negative examples is also imbalanced. Besides, the ranking loss introduces a new challenge, that is, the vast number of pairs. We tackle them in the following subsections.
3.2 Distributional Ranking
As indicated in Eqn. 3.1, the ranking loss in Eqn. 4 punishes a mis-ranking for a uniformly sampled pair. In detection, most of negative examples can be easily ranked well, that is, a randomly sampled pair will not incur the ranking loss with high probability. Therefore, we propose to optimize the ranking boundary to avoid the trivial solution
If we can rank the positive example with the lowest score above the negative one with the highest confidence, the whole set of candidates are perfectly ranked. Compared with the conventional ranking loss, the worst case loss is much more efficient by reducing the number of pairs from to
. Moreover, it clearly eliminates the class-imbalance issue since only a single pair of positive and negative examples are required for each image. However, this formulation is very sensitive to outliers, which can lead to the degraded detection model.
To improve the robustness, we first introduce the distribution for the positive and negative examples and obtain the expectation as
where and denote the distribution over positive and negative examples, respectively. and represent the expected ranking score under the corresponding distribution. is the simplex as . When and
are the uniform distribution,and demonstrates the expectation from the original distribution.
By deriving the distribution corresponding to the worst-case loss from the original distribution
we can rewrite the problem in Eqn. 6 in the equivalent form
which can be considered as ranking the distributions between positive and negative examples in the worst case. It is obvious that the original formulation is not robust due to the fact that the domain of the generated distribution is unconstrained. Consequently, it will concentrate on a single example while ignoring the original distribution. Hence, we improve the robustness of the ranking loss by regularizing the freedom of the derived distribution as
where is a regularizer for the diversity of the distribution to prevent the distribution from the trivial one-hot solution. It can be different forms of entropy, e.g., Rényi entropy, Shannon entropy, etc. and are constants to control the freedom of distributions.
To obtain the constrained distribution, we investigate the subproblem
According to the dual theory , given , we can find the parameter to obtain the optimal by solving the problem
We observe that the former term is linear in . Hence, if is strongly concave in , the problem can be solved efficiently by first order algorithms .
Considering the efficiency, we adopt the Shannon entropy as the regularizer in this work and we can have the closed-form solution as follows.
For the problem
we have the closed-form solution as
It can be proved directly from K.K.T. condition . ∎
For the distribution over positive examples, we have the similar result as
For the problem
we have the closed-form solution as
These Propositions show that the harder the example, the larger the weight of the example. Besides, the weight is data dependent and is affected by the data distribution.
Fig. 2 illustrates the drifting of the distribution with the proposed strategy. The derived distribution is approaching the distribution corresponding to the worst-case loss when decreasing .
With the closed-form solutions of distributions, the expectation of distributions can be computed as
Incorporating all of these components, our distributional ranking loss can be defined as
where and are given in Eqn. 7 and is in Eqn. 8. Compared with the conventional ranking loss, we rank the expectation between two distributions. It shrinks the number of pairs to that leads to the efficient optimization.
The objective in Eqn. 9 looks complicated but its gradient is easy to compute. The detailed calculation of the gradient can be found in the appendix.
If we optimize the DR loss by the standard stochastic gradient descent (SGD) with mini-batch as
we can show that it can converge as in the following theorem and the detailed proof is cast to the appendix.
Let denote the model obtained from the -th iteration with SGD optimizer, where mini-batch size is . When , if we assume the variance of the gradient is bounded as
, if we assume the variance of the gradient is bounded asand set the learning rate as , we have
Theorem 1 implies that the learning rate depends on the mini-batch size and the number of iterations as and the convergence rate is . We let , and denote an initial setting for training. If we increase the mini-batch size as and shrink the number of iterations as where , the convergence rate remains the same. However, the learning rate has to be increased as when , which is consistent with the observation in .
Theorem 1 also indicates that the convergence rate depends on . Therefore, trades between the approximation error and the convergence rate. When is large, the smoothed loss can simulate the hinge loss better while the convergence can become slow.
3.3 Recover Classification from Ranking
In detection, we have to identify foreground from background. Therefore, the results from ranking has to be converted to classification. A straightforward way is setting a threshold for the ranking score. However, the scores from different pairs can be inconsistent for classification. For example, given two pairs as
we observe that both of them have the perfect ranking but it is hard to set a threshold to classify positive examples from negative ones simultaneously. To make the ranking result meaningful for classification, we enforce a large margin in the constraint 3 as . Therefore, the constraint becomes
Due to the non-negative property of probability, it implies
which recovers the standard criterion for classification.
3.4 Bounding Box Regression
Besides classification, regression is also important for detection to refine the bounding box. Most of detectors apply smoothed loss to optimize the bounding box
It smoothes loss by loss in the interval of and guarantees that the whole loss function is smooth. It is reasonable since smoothness is important for convergence as indicated in Theorem 1. However, it may result in the slow optimization in the interval of loss. Inspired by the interior-point method , which gradually approximates the non-smooth domain by increasing the weight of the corresponding barrier function at different stages, we obtain from a decreasing function to reduce the gap between and losses. As suggested in the interior-point method, the current objective should be solved to optimum before changing the weight for the barrier function. We decay the value of in a stepwise manner. Specifically, we compute at the -th iteration as
where is a constant and denotes the width of a step.
Combining the regression loss, the objective of training the detector becomes
where is to balance the weights between classification and regression.
4.1 Implementation Details
We evaluate the proposed losses on COCO 2017 data set , which contains about images for training, images for validation, and images for test. To focus on the comparison of loss functions, we employ the structure of RetinaNet  as the backbone and only substitute the corresponding loss functions. For fair comparison, we make the adequate modifications in the official codebase of RetinaNet, which is released in Detectron. Besides, we train the model with the same setting as RetinaNet. Specifically, the model is learned with SGD on GPUs and the mini-batch size is set as where each GPU can hold images at each iteration. Most of experiments are trained with iterations and the length is denoted as “”. The initial learning rate is and is decayed by a factor of after iterations and then iterations. For anchor density, we apply the same setting as in , where each location has scales and aspect ratios. The standard COCO evaluation criterion is used to compare the performance of different methods.
Since RetinaNet lacks the optimization of the relationship between positive and negative distributions, it has to initialize the output probability of the classifier at to fit the distribution of background. In contrast, we initialize the probability of the sigmoid function at , which is more reasonable for binary classification scenario without any prior knowledge. It also verifies that the proposed DR loss can handle class-imbalance better.
In Eqn. 7, we compute the constrained distribution over positive and negative examples with and , respectively. To reduce the number of parameters, we fix the ratio between and as and tune the scale as
It is easy to show that this strategy is equivalent to fixing and as and , and changing the base in the definition of the entropy regularizer as
Note that RetinaNet applies Feature Pyramid Network (FPN)  to obtain multiple scale features. To compute DR loss in one image, we collect candidates from multiple pyramid levels and obtain a single distribution for foreground and background, respectively.
4.2 Effect of Parameters
First, we take ablation experiments to evaluate the effect of multiple parameters on the validation set. All experiments in this subsection are implemented with a single image scale of for training and test. ResNet- is applied as the backbone for comparison. Only horizontal flipping is adopted as the data augmentation in this subsection.
Effect of :
controls the smoothness of the loss function in Eqn. 8. We compare the model with different in Table 1. Note that also changes the function value, we adjust the weight of classification loss accordingly. The base of entropy regularizer is fixed as . We observe that the loss function is quite stable for the choice of different smooth values. Besides, a larger will result in a smaller function value as shown in Fig. 3 and it suggests to increase the weight of classification loss to balance the losses. We keep and in the rest experiments.
Effect of :
Next, we evaluate the effect of . changes the scale of and in the standard entropy regularizer. As illustrated in Fig. 2, a large will push the generated distribution to the extreme case while a small will make the derived distribution close to the original distribution. We vary the range of and summarize the results in Table 2. It is obvious that is also not sensitive in a reasonable range and we fix it to in the following experiments.
|Faster R-CNN+++ ||ResNet-101-C4||34.9||55.7||37.4||15.6||38.7||50.9|
|Faster R-CNN w FPN ||ResNet-101-FPN||36.2||59.1||39.0||18.2||39.0||48.2|
|Deformable R-FCN ||Aligned-Inception-ResNet||37.5||58.0||40.8||19.4||40.1||52.5|
|Mask R-CNN ||Resnet-101-FPN||38.2||60.3||41.7||20.1||41.1||50.2|
Effect of :
Finally, we demonstrate the different strategies for changing in the smoothed loss. In the implementation of RetinaNet, is fixed to . We compare three strategies to decay to , which are illustrated in Fig. 4. The results are shown in Table 3. First, it is evident that all strategies with decayed can improve the performance of detectors with a fixed . Then, the stepwise decay with outperforms linear decay and it verifies that the objective should be optimized sufficiently before moving to the decay step. We adopt stepwise decay in the next subsections.
Effect of DR Loss:
To illustrate the effect of DR loss, we collect the confidence scores of examples from all images in the validation set and compare the empirical probability density in Fig. 6. We include cross entropy loss and focal loss in the comparison. The model with cross entropy loss is trained by ourselves while the model with focal loss is downloaded directly from the official model zoo with the same configuration as DR loss.
First, we observe that most of examples have the extremely low confidence with cross entropy loss. It is because the number of negative examples overwhelms that of positive ones and it will classify most of examples to negative to obtain a small loss as demonstrated in Eqn. 2. Second, focal loss is better than cross entropy loss by drifting the distribution of foreground. However, the expectation of the foreground distribution is still close to that of background, and it has to adopt a small threshold as to identify positive examples from negative ones. Compared to cross entropy and focal loss, DR loss optimizes the foreground distribution significantly. By optimizing the ranking loss with a large margin, the expectation of the foreground examples is larger than while that of background is smaller than . It confirms that DR loss can address the imbalance between classes well. Consequently, DR loss allows us to set a large threshold for classification. We set the threshold as in experiments while it is not sensitive in the range of . Besides, the distribution of background examples with DR loss is more balanced than that with cross entropy or focal loss. It verifies that with the data dependent re-weighting strategy, DR loss can handle the imbalance in background distribution and focus on the hard negative examples appropriately.
4.3 Performance with Different Scales
With the parameters suggested from ablation studies, we train the model with different scales and backbones to show the robustness of the proposed losses. We adopt ResNet-50 and ResNet-101 as backbones in the comparison. Training applies only horizontal flipping as the data augmentation. Table. 4 compares the performance with different scales to that of RetinaNet. We let “Dr.Retina” denote the RetinaNet with the proposed DR loss and the decaying strategy for smoothed loss. Evidently, Dr.Retina performs better than RetinaNet over all scales with different backbones. Since we only change the loss functions in RetinaNet, the inference time remains the same while the mAP is consistently improved by about . The comparison also shows that the parameters in Dr.Retina is not sensitive to the scale of input images. It implies that the proposed losses is applicable for real-world applications.
4.4 Comparison with State-of-the-Art
Finally, we compare Dr.Retina to the state-of-the-art two-stage and one-stage detectors on COCO test set. We follow the setting in  to increase the number of training iterations to , which contains iterations, and applies scale jitter in as the additional data augmentation for training. Note that we still use a single image scale and a single crop for test as above. Table 5 summarizes the comparison for Dr.Retina. To emphasize the effectiveness of DR loss, we first train a model with the original regression loss, which is denoted as “Dr.Retina”. With ResNet-101 as the backbone, we can observe that Dr.Retina improves AP from to and it confirms that DR loss can handle the imbalance issue in detection better than focal loss. With gradually approaching regression loss, Dr.Retina gains another improvement and surpasses RetinaNet by . Equipped with ResNeXt-32x8d-101  and training, the performance of Dr.Retina can achieve as a one-stage detector on COCO detection task.
In this work, we propose the distributional ranking loss to address the imbalance challenge in one-stage object detection. It first converts the original classification problem to a ranking problem, which balances the classes of foreground and background. Furthermore, we propose to rank the expectation of derived distributions in lieu of original examples to focus on the hard examples, which balances the distribution of background. Besides, we improve the regression loss by developing the strategy to optimize loss better. Experiments on COCO verifies the effectiveness of the proposed losses. Since RPN also has the imbalance issue in two-stage detectors, applying DR loss for that can be our future work.
-  S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.
-  Z. Cai and N. Vasconcelos. Cascade R-CNN: delving into high quality object detection. In CVPR, pages 6154–6162, 2018.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, pages 764–773, 2017.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, pages 886–893, 2005.
-  P. F. Felzenszwalb, D. A. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In CVPR, 2008.
-  C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD : Deconvolutional single shot detector. CoRR, abs/1701.06659, 2017.
-  S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
-  R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015.
-  R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pages 580–587, 2014.
-  P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
-  K. He, G. Gkioxari, P. Dollár, and R. B. Girshick. Mask R-CNN. In ICCV, pages 2980–2988, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, pages 1106–1114, 2012.
-  T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, pages 936–944, 2017.
-  T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, pages 2999–3007, 2017.
-  T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. In ECCV, pages 740–755, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, pages 21–37, 2016.
-  D. G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2):91–110, 2004.
-  J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, pages 779–788, 2016.
-  J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In CVPR, pages 6517–6525, 2017.
-  S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
-  P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. In ICLR, 2014.
-  J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders. Selective search for object recognition. International Journal of Computer Vision, 104(2):154–171, 2013.
-  S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, pages 5987–5995, 2017.
-  T. Zhang and F. J. Oles. Text categorization based on regularized linear classification methods. Inf. Retr., 4(1):5–31, 2001.
Appendix A Gradient of DR Loss
We define the DR loss as
It looks complicated but its gradient is easy to compute. Here we give the detailed gradient form. For , we have
For , we have
Appendix B Proof of Theorem 1
We assume that the loss in Eqn. 9 is -smoothness, so we have
According to the definition, we have
If we assume that the variance is bounded as
then we have
Therefore, we have
By assuming and adding from to , we have
We finish the proof by letting
Appendix C Experiments
Effect of DR Loss:
We illustrate the empirical PDF of foreground and background from DR loss in Fig. 6. Fig. 6 (a) show the original density of foreground and background. To make the results more explicit, we decay the density of background by a factor of and demonstrate the result in Fig. 6 (b). It is obvious that DR loss can separate the foreground and background with a large margin in the imbalance scenario.