Rank Sort Loss for Object Detection and Instance Segmentation

07/24/2021 ∙ by Kemal Oksuz, et al. ∙ Middle East Technical University 21

We propose Rank Sort (RS) Loss, as a ranking-based loss function to train deep object detection and instance segmentation methods (i.e. visual detectors). RS Loss supervises the classifier, a sub-network of these methods, to rank each positive above all negatives as well as to sort positives among themselves with respect to (wrt.) their continuous localisation qualities (e.g. Intersection-over-Union - IoU). To tackle the non-differentiable nature of ranking and sorting, we reformulate the incorporation of error-driven update with backpropagation as Identity Update, which enables us to model our novel sorting error among positives. With RS Loss, we significantly simplify training: (i) Thanks to our sorting objective, the positives are prioritized by the classifier without an additional auxiliary head (e.g. for centerness, IoU, mask-IoU), (ii) due to its ranking-based nature, RS Loss is robust to class imbalance, and thus, no sampling heuristic is required, and (iii) we address the multi-task nature of visual detectors using tuning-free task-balancing coefficients. Using RS Loss, we train seven diverse visual detectors only by tuning the learning rate, and show that it consistently outperforms baselines: e.g. our RS Loss improves (i) Faster R-CNN by   3 box AP and aLRP Loss (ranking-based baseline) by   2 box AP on COCO dataset, (ii) Mask R-CNN with repeat factor sampling (RFS) by 3.5 mask AP (  7 AP for rare classes) on LVIS dataset; and also outperforms all counterparts. Code available at https://github.com/kemaloksuz/RankSortLoss

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

page 12

Code Repositories

RankSortLoss

Official PyTorch Implementation of Rank & Sort Loss [ICCV2021]


view repo

RankSortLoss-Solov2

A provisional repository for Rank & Sort Loss on SOLOv2


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: A ranking-based classification loss vs RS Loss. (a) Enforcing to rank positives above negatives provides a useful objective for training, however, it ignores ordering among positives. (b) Our RS Loss, in addition to raking positives above negatives, aims to sort positives wrt. their continuous IoUs (positives: a green tone based on its label, negatives: orange). We propose Identity Update (Section 3), a reformulation of error-driven update with backpropagation, to tackle these ranking and sorting operations which are difficult to optimize due to their non-differentiable nature.

Owing to their multi-task (e.g. classification, box regression, mask prediction) nature, object detection and instance segmentation methods rely on loss functions of the form:

(1)

which combines , the loss function for task on stage (e.g. for Faster R-CNN [31] with RPN and R-CNN), weighted by a hyper-parameter . In such formulations, the number of hyper-parameters can easily exceed 10 [26], with additional hyper-parameters arising from task-specific imbalance problems [28], e.g. the positive-negative imbalance in the classification task, and if a cascaded architecture is used (e.g. HTC [6] employs R-CNNs with different ). Thus, although such loss functions have led to unprecedented successes in several benchmarks, they necessitate tuning, which is time consuming, leads to sub-optimal solutions and makes fair comparison of methods challenging.

Recently proposed ranking-based loss functions, namely “Average Precision (AP) Loss” [8] and “average Localisation Recall Precision (aLRP) Loss” [26], offer two important advantages over the classical score-based functions (e.g. Cross-entropy Loss and Focal Loss [21]): (1) They directly optimize the performance measure (e.g. AP), thereby providing consistency between training and evaluation objectives. This also reduces the number of hyper-parameters as the performance measure (e.g. AP) does not typically have any hyper-parameters. (2) They are robust to class-imbalance due to their ranking-based error definition. Although these losses have yielded impressive performances, they require longer training and more augmentation.

Broadly speaking, the ranking-based losses (AP Loss and aLRP Loss) focus on ranking positive examples over negatives, and they do not explicitly model positive-to-positive interactions. However, prioritizing predictions wrt. their localisation qualities by using an auxiliary (aux. - e.g. IoU, centerness) head has been a common approach to improve performance [15, 37, 43, 16]. Besides, as recently shown by Li et al. [17] (in Quality Focal Loss - QFL), when the classifier is directly supervised to regress IoUs of the predictions (i.e. to prioritize predictions wrt. IoU), one can remove the aux. head and further improve the performance.

In this paper, we propose Rank & Sort (RS) Loss as a ranking-based loss function to train visual detection (VD – i.e. object detection and instance segmentation) methods. RS Loss not only ranks positives above negatives (Fig. 1(a)) but also sorts positives among themselves with respect to their continuous IoU values (Fig. 1(b)). This approach brings in several crucial benefits. Due to the prioritization of positives during training, detectors trained with RS Loss do not need an aux. head, and due to its ranking-based nature, RS Loss can handle extremely imbalanced data (e.g. object detection [28]) without any sampling heuristics. Besides, except for the learning rate, RS Loss does not need any hyper-parameter tuning thanks to our tuning-free task-balancing coefficients. Owing to this significant simplification of training, we can apply RS Loss to different methods (i.e. multi-stage, one-stage, anchor-based, anchor-free) easily (i.e. only by tuning the learning rate) and demonstrate that RS Loss consistently outperforms baselines.

Our contributions can be summarized as follows:

(1) We reformulate the incorporation of error-driven optimization into backpropagation to optimize non-differentiable ranking-based losses as Identity Update, which uniquely provides interpretable loss values during training and allows definition of intra-class errors (e.g. the sorting error among positives).

(2) We propose Rank & Sort Loss that defines a ranking objective between positives and negatives as well as a sorting objective to prioritize positives wrt. their continuous IoUs. Due to this ranking-based nature, RS Loss can train models in the presence of highly imbalanced data.

(3) We present the effectiveness of RS Loss on a diverse set of four object detectors and three instance segmentation methods only by tuning the learning rate and without any aux. heads or sampling heuristics on the widely-used COCO and long-tailed LVIS benchmarks: E.g. (i) Our RS-R-CNN improves Faster-CNN by box AP on COCO, (ii) our RS-Mask R-CNN improves repeat factor sampling by mask AP ( AP for rare classes) on LVIS.

2 Related Work

Auxiliary heads and continuous labels. Predicting the localisation quality of a detection with an aux. centerness [37, 43], IoU [15, 16], mask-IoU [14]

or uncertainty (i.e. variance) head

[13] and combining these predictions with the classification scores for NMS are shown to improve detection performance. Lin et al. [17] discovered that using continuous IoUs of predictions to supervise the classifier outperforms using an aux. head. Currently, Lin et al.’s “Quality Focal Loss” [17] is the only method that is robust to class imbalance [28] and uses continuous labels to train the classifier. In this work, we investigate the generalizability of this idea on different networks (e.g. multi-stage networks [31, 2]) and on a different task (i.e. instance segmentation) by using our ranking-based RS Loss.

Ranking-based losses in VD. Despite their advantages, ranking-based losses are non-differentiable and difficult to optimize. To address this challenge, black-box solvers [33]

use an interpolated AP surface, though yielding little gain in object detection. DR Loss

[30] achieves ranking between positives and negatives by enforcing a margin with Hinge Loss, which is differentiable. Finally, AP Loss [8] and aLRP Loss [26] optimize the performance metrics, AP and LRP [25]

respectively, by using the error-driven update of perceptron learning

[34] for the non-differentiable parts. The main difference of RS Loss is that it also considers continuous localisation qualities as labels.

Objective imbalance in VD. The common strategy in VD is to use (Eq. 1), a scalar multiplier, on each task and tune them by grid search [16, 1]. Recently, Oksuz et al. [26] employed a self-balancing strategy to balance classification and box regression heads, both of which compete for the bounded range of aLRP Loss. Similarly, Chen et al. [5] use the ratio of classification and regression losses to balance these tasks. In our design, each loss for a specific head has its own bounded range and thus, no competition ensues among heads. Besides, we use s with similar ranges, and show that our RS Loss can simply be combined with a simple task balancing strategy based on loss values, and hence does not require any tuning except the learning rate.

3 Identity Update for Ranking-based Losses

Using a ranking-based loss function is attractive thanks to its compatibility with common performance measures (e.g. AP). It is challenging, however, due to the non-differentiable nature of ranking. Here we first revisit an existing solution [8, 26] that overcomes this non-differentiability by incorporating error-driven update [34] into backpropagation (Section 3.1), and then present our reformulation (Section 3.2), which uniquely (i) provides interpretable loss values and (ii) takes into account intra-class errors, which is crucial for using continuous labels.

Figure 2: Three-step computation (green arrows) and optimization (orange arrows) algorithms of ranking-based loss functions. Our identity update (i) yields interpretable loss values (see Appendix A for an example on our RS Loss), (ii) replaces Eq. 2 of previous work [26] by Eq. 5 (green arrow in Step 2) to allow intra-class errors, crucial to model our RS Loss, and (iii) results in a simple “Identity Update” rule (orange arrow in Step 2): .

3.1 Revisiting the Incorporation of Error-Driven Optimization into Backpropagation

Definition of the Loss. Oksuz et al. [26] propose writing a ranking-based loss as where is a problem specific normalization constant, is the set of positive examples and is the error term computed on .

Computation of the Loss.

Given logits (

), can be computed in three steps [8, 26] (Fig. 2 green arrows):

Step 1. The difference transform between logits and is computed by .

Step 2. Using , errors originating from each pair of examples are calculated as primary terms ():

(2)

where

is a probability mass function (pmf) that distributes

, the error computed on , over where is the set of negative examples. By definition, the ranking-based error , and thus , requires pairwise-binary-ranking relation between outputs and , which is determined by the non-differentiable unit step function (i.e. if and otherwise) with input .

Using , different ranking-based functions can be introduced to define and : e.g. the rank of the th example, ; the rank of the th example among positives, ; and number of false positives with logits larger than , . As an example, for AP Loss [8], using these definitions, and can be simply defined as and respectively [26].

Step 3. Finally, is calculated as the normalized sum of the primary terms [26]: .

Optimization of the Loss. Here, the aim is to find updates , and then proceed with backpropagation through model parameters. Among the three computation steps (Fig. 2 orange arrows), Step 1 and Step 3 are differentiable, whereas a primary term is not a differentiable function of difference transforms. Denoting this update in by

and using the chain rule,

can be expressed as:

(3)

Chen et al. [8] incorporate the error-driven update [34] and replace by where is the target primary term indicating the desired error for pair . Both AP Loss [8] and aLRP Loss [26] are optimized this way.

3.2 Our Reformulation: Identity Update

We first identify two drawbacks of the formulation in Section 3.1: (D1) Resulting loss value () does not consider the target , and thus, is not easily interpretable when (cf. aLRP Loss [26] and our RS Loss - Section 4); (D2) Eq. 2 assigns a non-zero primary term only if and , effectively ignoring intra-class errors. These errors become especially important with continuous labels: The larger the label of , the larger should be.

Definition of the Loss. We redefine the loss function as:

(4)

where is the desired error term on . Our loss definition has two benefits: (i) directly measures the difference between the target and the desired errors, yielding an interpretable loss value to address (D1), and (ii) we do not constrain to be defined only on positives and replace “” with “”. Although we do not use “” to model RS Loss, it makes the definition of complete in the sense that, if necessary to obtain , individual errors () can be computed on each output, and hence, can be approximated more precisely or a larger set of ranking-based loss functions can be represented.

Computation of the Loss. In order to compute (Eq. 4), we only replace Eq. 2 with:

(5)

in three-step algorithm (Section 3.1, Fig. 2 green arrows) and allow all pairs to have a non-zero error, addressing (D2).

Optimization of the Loss. Since the error of a pair, , is minimized when , Eq. 5 has a target of regardless of . Thus, in Eq. 3 is simply the primary term itself: , concluding the derivation of our Identity Update.

4 Rank & Sort Loss

In order to supervise the classifier of visual detectors by considering the localisation qualities of the predictions (e.g. IoU), RS Loss decomposes the problem into two tasks: (i) Ranking task, which aims to rank each positive higher than all negatives, and (ii) sorting task, which aims to sort the logits in descending order wrt. continuous ground-truth labels (e.g. IoUs). We define RS Loss and compute its gradients using our Identity Update (Section 3.2 – Fig. 2).

Definition. Given logits and their continuous ground-truth labels (e.g. IoU), we define RS Loss as the average of the differences between the current () and target () RS errors over positives (i.e. ):

(6)

where is a summation of the current ranking error and current sorting error:

(7)

For , while the “current ranking error” is simply the precision error, the “current sorting error” penalizes the positives with logits larger than by the average of their inverted labels, . Note that when is ranked above all , and target ranking error, , is . For target sorting error, we average over the inverted labels of with larger logits () and labels () than corresponding to the desired sorted order,

(8)

where is the Iverson Bracket (i.e. 1 if predicate is True; else 0), and similar to previous work [8], is smoothed in the interval as .

Computation. We follow the three-step algorithm (Section 3, Fig. 2) and define primary terms, , using Eq. 5, which allows us to express the errors among positives as:

(9)

where ranking () and sorting pmfs (

) uniformly distribute ranking and sorting errors on

respectively over examples causing error (i.e. for ranking, with ; for sorting, with but ):

(10)

Optimization. To obtain , we simply replace (Eq. 3) by the primary terms of RS Loss, (Eq. 9), following Identity Update (Section 3.2). The resulting for then becomes (see Appendix A for derivations):

(11)

Owing to the additional sorting error (Eq. 7, 8), for includes update signals for both promotion and demotion to sort the positives accordingly:

(12)

Note that the directions of the first and second part of Eq. 12 are different. To place in the desired ranking, promotes based on the error computed on itself, whereas demotes based on the signal from . We provide more insight for RS Loss on an example in Appendix A.

5 Using RS Loss to Train Visual Detectors

Figure 3: (a) A generic visual detection pipeline includes many heads from possibly multiple stages. An aux. head, in addition to the standard ones, is common in recent methods (e.g. centerness head for ATSS [43], IoU head for IoU-Net [15], and mask IoU head for Mask-scoring R-CNN [14]) to regress localisation quality and prioritize examples during inference (e.g. by multiplying classification scores by the predicted localisation quality). Sampling heuristics are also common to ensure balanced training. Such architectures use many hyper-parameters and are sensitive for tuning. (b) Training detectors with our RS Loss removes (i) aux. heads by directly supervising the classification (Cls.) head with continuous IoUs (in red&bold), (ii) sampling heuristics owing to its robustness against class imbalance. We use losses with similar range with our RS Loss in other branches (i.e. GIoU Loss, Dice Loss) also by weighting each by using classification scores, obtained applying sigmoid to logits. (c) Instead of tuning s, we simply balance tasks by considering loss values. With this design, we train several detectors only by tuning the learning rate and improve their performance consistently.

This section develops an overall loss function to train detectors with RS Loss, in which only the learning rate needs tuning. As commonly performed in the literature [17, 16], Section 5.2 analyses different design choices on ATSS [43], a SOTA one-stage object detector (i.e. in Eq. 1); and Section 5.3 extends our design to other architectures.

5.1 Dataset and Implementation Details

Unless explicitly specified, we use (i) standard configuration of each detector and only replace the loss function, (ii) mmdetection framework [7], (iii) 16 images with a size of in a single batch ( images/GPU, Tesla V100) during training, (iv)

training schedule (12 epochs), (v) single-scale test with images with a size of

, (vi) ResNet-50 backbone with FPN [20], (vii) COCO trainval35K (115K images) and minival (5k images) sets [22] to train and test our models, (iix) report COCO-style AP.

5.2 Analysis and Tuning-Free Design Choices

ATSS [43] with its classification, box regression and centerness heads is originally trained by minimizing:

(13)

where is Focal Loss [21]; is GIoU Loss [32]; is Cross-entropy Loss with continuous labels to supervise centerness prediction; and and . We first remove the centerness head and replace by our RS Loss (Section 4), , using between a prediction box () and ground truth box () as the continuous labels:

(14)

where , the task-level balancing coefficient, is generally set to a constant scalar by grid search.

Inspired by recent work [26, 5], we investigate two tuning-free heuristics to determine every iteration: (i) value-based: , and (ii) magnitude-based: where is L1 norm, and are box regression and classification head outputs respectively. In our analysis on ATSS trained with RS Loss, we observed that value-based task balancing performs similar to tuning ( AP on average). Also, we use score-based weighting [17] by multiplying the GIoU Loss of each prediction using its classification score (details of analysis are in Appendix B). Note that value-based task balancing and score-based instance weighting are both hyper-parameter-free and easily applicable to all networks. With these design choices, Eq. 14 has only hyper-parameter (i.e. in , set to , to smooth the unit-step function)

5.3 Training Different Architectures

Fig. 3 presents a comparative overview on how we adopt RS Loss to train different architectures: When we use RS Loss to train the classifier (Fig. 3(b)), we remove aux. heads (e.g. IoU head in IoU-Net [15]) and sampling heuristics (e.g. OHEM in YOLACT [1], random sampling in Faster R-CNN [31]). We adopt score-based weighting in box regression and mask prediction heads, and prefer Dice Loss, instead of the common Cross-entropy Loss, to train mask prediction head for instance segmentation due to (i) its bounded range (between and ), and (ii) holistic evaluation of the predictions, both similar to GIoU Loss. Finally, we set (Eq. 1) to scalar (i.e. ) every iteration (Fig. 3(c)) with the single exception of RPN where we multiply the losses of RPN by following aLRP Loss.

6 Experiments

Method Assigner Sampler Aux. Head AP oLRP H# Venue
FPN [20] IoU-based Random None CVPR 17
aLRP Loss [26] IoU-based None None NeurIPS 20
GIoU Loss [32] IoU-based Random None CVPR 19
IoU-Net [15] IoU-based Random IoU Head ECCV 18
Libra R-CNN [29] IoU-based IoU-based None CVPR 19
AutoLoss-A [23] IoU-based Random None ICLR 21
Carafe FPN [38] IoU-based Random None ICCV 19
Dynamic R-CNN [42] Dynamic Random None ECCV 20
RS-R-CNN (Ours) IoU-based None None
RS-R-CNN+ (Ours) IoU-based None None
Table 1: RS-R-CNN uses the standard IoU-based assigner, is sampling-free, employs no aux. head, is almost tuning-free wrt. task-balancing weights (s – Eq. 1), and thus, has the least number of hyper-parameters (H# = – two , one for each RS Loss to train RPN & R-CNN, and one RPN weight). Still, RS-R-CNN improves standard Faster R-CNN with FPN by AP; aLRP Loss (ranking-based loss baseline) by AP; IoU-Net (a method with IoU head) by AP. RS-R-CNN+ replaces upsampling of FPN by lightweight Carafe operation [38] and maintains AP gap from Carafe FPN ( to AP). All models use ResNet-50, are evaluated in COCO minival and trained for 12 epochs on mmdetection except for IoU-Net. H#: Number of hyper-parameters (Appendix C presents details on H#.)

To present the contribution of RS Loss in terms of performance and tuning simplicity, we conduct experiments on seven visual detectors with a diverse set of architectures: four object detectors (i.e. Faster R-CNN [31], Cascade R-CNN [2], ATSS [43] and PAA [16] – Section 6.1) and three instance segmentation methods (i.e. Mask R-CNN [12], YOLACT [1] and SOLOv2 [39] – Section 6.2). Finally, Section 6.3 presents ablation analysis.

6.1 Experiments on Object Detection

6.1.1 Multi-stage Object Detectors

To train Faster R-CNN [31] and Cascade R-CNN [2] by our RS Loss (i.e. RS-R-CNN), we remove sampling from all stages (i.e. RPN and R-CNNs), use all anchors to train RPN and top-scoring proposals/image (by default, for Faster R-CNN and Cascade R-CNN in mmdetection [7]), replace softmax classifiers by binary sigmoid classifiers and set the initial learning rate to .

RS Loss reaches AP on a standard Faster R-CNN and outperforms (Table 1): (i) FPN [20] (Cross Entropy & Smooth L1 losses) by AP, (ii) aLRP Loss [26], a SOTA ranking-based baseline, by AP, (iii) IoU-Net [15] with aux. head by AP and (iv) Dynamic R-CNN, closest counterpart, by AP. We, then, use the lightweight Carafe [38] as the upsampling operation in FPN and obtain AP (RS-R-CNN+), still maintaining AP gap from Carafe FPN [38] ( AP) and outperforming all methods in all AP- and oLRP-based [25, 27] performance measures except , which implies that our main contribution is in classification task trained by our RS Loss and there is still room for improvement in the localisation task. RS Loss also improves the stronger baseline Cascade R-CNN [2] by AP from AP to AP (Appendix C presents detailed results for Cascade R-CNN). Finally, RS Loss has the least number of hyper-parameters (H# = , Table 1) and does not need a sampler, an aux. head or tuning of s (Eq. 1).

6.1.2 One-stage Object Detectors

Loss Function Unified Rank-based Aux. Head ATSS [43] PAA [16] H#
AP oLRP AP oLRP
Focal Loss [21]
AP Loss [8]
QFL [17]
aLRP Loss [26]
RS Loss (Ours) 67.9 67.3
Table 2: RS Loss has the least number of hyper-parameters (H#) and outperforms (i) rank-based alternatives significantly, (ii) the default setting with an aux. head (underlined) by AP, (iii) score-based alternative, QFL, especially on PAA. We test unified losses (i.e. a loss considering localisation quality while training classification head) only without aux. head. All models use ResNet-50.

We train ATSS [43] and PAA [16] including a centerness head and an IoU head respectively in their architectures. We adopt the anchor configuration of Oksuz et al. [26] for all ranking-based losses (different anchor configurations do not affect performance of standard ATSS [43]) and set learning rate to . While training PAA, we keep the scoring function, splitting positives and negatives, for a fair comparison among different loss functions.

Comparison with AP and aLRP Losses, ranking-based baselines: We simply replaced Focal Loss by AP Loss to train networks, and as for aLRP Loss, similar to our RS Loss, we tuned its learning rate as due to its tuning simplicity. Both for ATSS and PAA, RS Loss provides significant gains over ranking-based alternatives, which were trained for 100 epochs using SSD-like augmentation [24] in previous work [8, 26]: / AP gain for ATSS and / gain for PAA for AP/aLRP Loss (Table 2).

Comparison with Focal Loss, default loss function: RS Loss provides around AP gain when both networks are equally trained without an aux. head (Table 2) and AP gain compared to the default networks with aux. heads.

Comparison with QFL, score-based loss function using continuous IoUs as labels: To apply QFL [17] to PAA, we remove the aux. IoU head (as we did with ATSS), test two possible options ((i) default PAA setting with and IoU-based weighting, (ii) default QFL setting: and score-based weighting – Section 5.2) and report the best result for QFL. While the results of QFL and RS Loss are similar for ATSS, there is AP gap in favor of our RS Loss, which can be due to the different positive-negative labelling method in PAA (Table 2).

6.1.3 Comparison with SOTA

Here, we use our RS-R-CNN since it yields the largest improvement over its baseline. We train RS-R-CNN for 36 epochs using multiscale training by randomly resizing the shorter size within on ResNet-101 with DCNv2 [44]. Table 3 reports the results on COCO test-dev: Our RS-R-CNN reaches AP and outperforms similarly trained Faster R-CNN and Dynamic R-CNN by and AP respectively. Although we do not increase the number of parameters for Faster R-CNN, RS R-CNN outperforms all multi-stage detectors including TridentNet [18], which has more parameters. Our RS-R-CNN+ (Section 6.1.1) reaches AP, and RS-Mask R-CNN+ (Section 6.2) reaches AP, outperforming all one- and multi-stage counterparts.

Method AP
One-stage ATSS [43]
GFL [17]
PAA [16]
RepPointsv2 [10]
Multi-stage Faster R-CNN [42]
Trident Net [18]
Dynamic R-CNN [42]
D2Det [3]
Ours RS-R-CNN
RS-R-CNN+
RS-Mask R-CNN+
RS-Mask R-CNN+*
Table 3: Comparison with SOTA for object detection on COCO test-dev using ResNet-101 (except *) with DCN. The result of the similarly trained Faster R-CNN is acquired from Zhang et al. [42]. +: upsampling of FPN is Carafe [38], *:ResNeXt-64x4d-101

6.2 Experiments on Instance Segmentation

6.2.1 Multi-stage Instance Segmentation Methods

We train Mask R-CNN [12] on COCO and LVIS datasets by keeping all design choices of Faster R-CNN the same.

COCO: We observe AP gain for both segmentation and detection performance (Table 4) over Mask R-CNN. Also, as hypothesized, RS-Mask R-CNN outperforms Mask Scoring R-CNN [14], with an additional mask IoU head, by and mask and box AP; and by mask oLRP.

Method Aux Segmentation Performance H#
Head AP oLRP
Mask R-CNN
Mask-sc. R-CNN
RS-Mask R-CNN
Table 4: Without an aux. head, RS-Mask R-CNN improves Mask R-CNN [12] by AP and outperforms Mask-scoring R-CNN [14] which employs an additional mask IoU head as aux. head.

LVIS: Replacing the cross entropy loss to train Mask R-CNN with repeat factor sampling (RFS) by our RS Loss improves the performance by mask AP on the long-tailed LVIS dataset ( to with improvement on rare classes) and outperforms recent counterparts (Table 5).

Method Venue
RFS [11] CVPR 19
BAGS [19] CVPR 20
Eq. Lossv2 [36] CVPR 21
RFS+RS Loss
Table 5: Comparison on LVIS v1.0 val set [11] with ResNet-50.

6.2.2 One-stage Instance Segmentation Methods

Here, we train two different approaches with our RS Loss: (i) YOLACT [1], a real-time instance segmentation method, involving sampling heuristics (e.g. OHEM [35]), aux. head and carefully-tuned loss weight, and demonstrate RS Loss can discard all by improving its performance (ii) SOLOv2 [39] as an anchor-free SOTA method.

Method Additional Training Heuristics Segmentation Performance Detection Performance H#
OHEM [35] Size-based Norm. Sem.Segm. Head AP oLRP AP oLRP
YOLACT [1]
RS-YOLACT
Table 6: RS-YOLACT does not employ any additional training heuristics and outperforms YOLACT by significant margin.

YOLACT: Following YOLACT [1], we train (also test) RS-YOLACT by images with size for epochs. Instead of searching for epochs to decay learning rate, carefully tuned for YOLACT as , , and , we simply adopt cosine annealing with an initial learning rate of . Then, we remove (i) OHEM, (ii) semantic segmentation head, (iii) carefully tuned task weights (i.e. , ) and (iv) size-based normalization (i.e. normalization of mask head loss of each instance by the ground-truth area). Removing each heuristic ensues a slight to significant performance drop (at least requires retuning of – Table 6). After these simplifications, our RS-YOLACT improves baseline by mask AP and box AP.

SOLOv2: Following Wang et al. [39], we train anchor-free SOLOv2 with RS Loss for 36 epochs using multiscale training on its two different settings: (i) SOLOv2-light is the real-time setting with ResNet-34 and images with size at inference. We use 32 images/batch and learning rate for training. (ii) SOLOv2 is the SOTA setting with ResNet-101 and images with size at inference. We use 16 images/batch and learning rate for training. Since SOLOv2 does not have box regression head, we use Dice coefficient as the continuous labels of RS Loss (see Appendix C for an analysis of using different localisation qualities as labels for instance segmentation). Again, RS Loss performs better than the baseline (i.e. Focal Loss and Dice Loss) only by tuning the learning rate (Table 7).

6.2.3 Comparison with SOTA

We use our RS-Mask R-CNN (i.e. standard Mask R-CNN with RS Loss) to compare with SOTA methods. In order to fit in 16GB memory of our V100 GPUs and keep all settings unchanged, we limit the number of maximum proposals in the mask head by 200, which can simply be omitted for GPUs with larger memory. Following our counterparts [39, 40], we first train RS-Mask R-CNN for 36 epochs with multiscale training between using ResNet-101 and reach mask AP (Table 8), improving Mask R-CNN by mask AP and outperforming all SOTA methods by a notable margin ( AP). Then, we train RS-Mask R-CNN+ (i.e. standard Mask R-CNN except upsampling of FPN is lightweight Carafe [38]) also by extending the multiscale range to and reach mask AP, which even outperforms all models with DCN. With DCN [44] on ResNet-101, our RS-Mask R-CNN+ reaches mask AP.

Method Backbone AP oLRP H#
SOLOv2-light ResNet-34
RS-SOLOv2-light ResNet-34
SOLOv2 ResNet-101
RS-SOLOv2 ResNet-101
Table 7: Comparison on anchor-free SOLOv2.
Method AP
w/o DCN Polar Mask [41]
Mask R-CNN [9]
SOLOv2 [39]
Center Mask [40]
RS-Mask R-CNN (Ours)
RS-Mask R-CNN+ (Ours)
w DCN Mask-scoring R-CNN [14]
BlendMask [4]
SOLOv2 [39]
RS-Mask R-CNN+ (Ours)
RS-Mask R-CNN+* (Ours)
Table 8: Comparison with SOTA for instance segmentation on COCO test-dev. All methods (except *) use ResNet-101. The result of the similarly trained Mask R-CNN is acquired from Chen et al. [9]. +: upsampling of FPN is Carafe, *:ResNeXt-64x4d-101

6.3 Ablation Experiments

Contribution of the components: Replacing Focal Loss by RS Loss improves the performance significantly ( AP - Table 9). Score-based weighting has a minor contribution and value-based task balancing simplifies tuning.

Architecture RS Loss score-based w. task bal. H#
3
ATSS+ResNet50 2
w.o. aux. head 2
1
Table 9: Contribution of the components of RS Loss on ATSS.
Dataset Sampler Desired Neg # Actual Neg #
RPN R-CNN RPN R-CNN RPN R-CNN
COCO Random Random 1 3 7 702 38.5
None Random 1 N/A 6676 702 39.3
None None N/A N/A 6676 1142 39.6
LVIS None None N/A N/A 3487 10470 25.2
Table 10: Ablation with different degrees of imbalance on different datasets and samplers. Number of negatives (neg) corresponding to a single positive (pos) averaged over the iterations of the first epoch is presented. Quantitatively, pos:neg ratio varies between 1:7 to 1:10470. RS Loss successfully trains different degrees of imbalance without tuning (Tables 1 and 5). Details: Appendix C

Robustness to imbalance: Without tuning, RS Loss can train models with very different imbalance levels successfully (Table 10): Our RS Loss (i) yields AP on COCO with the standard random samplers (i.e. data is relatively balanced especially for RPN), (ii) utilizes more data when the samplers are removed, resulting in AP gain ( to AP), and (iii) outperforms all counterparts on the long-tailed LVIS dataset (c.f. Table 5), where the imbalance is extreme for R-CNN (pos:neg ratio is - Table 10). Appendix C presents detailed discussion.

Contribution of the sorting error: To see the contribution of our additional sorting error, during training, we track Spearman’s ranking correlation coefficient () between IoUs and classification scores, as an indicator of the sorting quality, with and without our additional sorting error (see Eq. 6-8). As hypothesized, using sorting error improves sorting quality, , averaged over all/last 100 iterations, from to for RS-R-CNN.

Effect on Efficiency: On average, one training iteration of RS Loss takes around longer than score-based losses. See Appendix C for more discussion on the effect of RS Loss on training and inference time.

7 Conclusion

In this paper, we proposed RS Loss as a ranking-based loss function to train object detectors and instance segmentation methods. Unlike existing ranking-based losses, which aim to rank positives above negatives, our RS Loss also sorts positives wrt. their localisation qualities, which is consistent with NMS and the performance measure, AP. With RS Loss, we employed a simple, loss-value-based, tuning-free heuristic to balance all heads in the visual detectors. As a result, we showed on seven diverse visual detectors that RS Loss both consistently improves performance and significantly simplifies the training pipeline.

Acknowledgments: This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) (under project no 117E054 and 120E494). We also gratefully acknowledge the computational resources kindly provided by TÜBİTAK ULAKBIM High Performance and Grid Computing Center (TRUBA) and Roketsan Missiles Inc. used for this research. Dr. Oksuz is supported by the TÜBİTAK 2211-A National Scholarship Programme for Ph.D. students. Dr. Kalkan is supported by the BAGEP Award of the Science Academy, Turkey.

References

  • [1] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee (2019) YOLACT: Real-time instance segmentation. In

    IEEE/CVF International Conference on Computer Vision (ICCV)

    ,
    Cited by: §2, §5.3, §6.2.2, §6.2.2, Table 6, §6.
  • [2] Z. Cai and N. Vasconcelos (2018) Cascade R-CNN: Delving into high quality object detection. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §C.2, Table A.15, §2, §6.1.1, §6.1.1, §6.
  • [3] J. Cao, H. Cholakkal, R. M. Anwer, F. S. Khan, Y. Pang, and L. Shao (2020) D2Det: towards high quality object detection and instance segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 3.
  • [4] H. Chen, K. Sun, Z. Tian, C. Shen, Y. Huang, and Y. Yan (2020) BlendMask: top-down meets bottom-up for instance segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 8.
  • [5] J. Chen, D. Liu, T. Xu, S. Zhang, S. Wu, B. Luo, X. Peng, and E. Chen (2019) Is sampling heuristics necessary in training deep object detectors?. arXiv 1909.04868. Cited by: §2, §5.2.
  • [6] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) Hybrid task cascade for instance segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [7] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv 1906.07155. Cited by: §5.1, §6.1.1, footnote 1.
  • [8] K. Chen, W. Lin, J. li, J. See, J. Wang, and J. Zou (2020) AP-loss for accurate one-stage object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (), pp. 1–1. Cited by: §1, §C.1, §C.6.1, Table A.19, §2, §3.1, §3.1, §3.1, §3, §4, §6.1.2, Table 2, §A.2.
  • [9] X. Chen, R. Girshick, K. He, and P. Dollár (2019) Tensormask: a foundation for dense object segmentation. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: Table 8.
  • [10] Y. Chen, Z. Zhang, Y. Cao, L. Wang, S. Lin, and H. Hu (2020) RepPoints v2: verification meets regression for object detection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Table 3.
  • [11] A. Gupta, P. Dollar, and R. Girshick (2019) LVIS: a dataset for large vocabulary instance segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.5, Table 5.
  • [12] K. He, G. Gkioxari, P. Dollar, and R. Girshick (2017) Mask R-CNN. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §6.2.1, Table 4, §6.
  • [13] Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [14] Z. Huang, L. Huang, Y. Gong, C. Huang, and X. Wang (2019) Mask scoring r-cnn. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, Figure 3, §6.2.1, Table 4, Table 8.
  • [15] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang (2018) Acquisition of localization confidence for accurate object detection. In The European Conference on Computer Vision (ECCV), Cited by: §1, 2nd item, §2, Figure 3, §5.3, §6.1.1, Table 1.
  • [16] K. Kim and H. S. Lee (2020) Probabilistic anchor assignment with iou prediction for object detection. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2, §2, §5, §6.1.2, Table 2, Table 3, §6, §B.1, §B.2, Table A.11.
  • [17] X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang (2020) Generalized focal loss: learning qualified and distributed bounding boxes for dense object detection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §C.5, §2, §5.2, §5, §6.1.2, Table 2, Table 3, §B.1, Table A.11.
  • [18] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §6.1.3, Table 3.
  • [19] Y. Li, T. Wang, B. Kang, S. Tang, C. Wang, J. Li, and J. Feng (2020) Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 5.
  • [20] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 1st item, §5.1, §6.1.1, Table 1.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2020) Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 42 (2), pp. 318–327. Cited by: §1, §C.5, Table A.18, §5.2, Table 2.
  • [22] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: Common Objects in Context. In The European Conference on Computer Vision (ECCV), Cited by: §C.5, §5.1.
  • [23] P. Liu, G. Zhang, B. Wang, H. Xu, X. Liang, Y. Jiang, and Z. Li (2021) Loss function discovery for object detection via convergence-simulation driven search. In International Conference on Learning Representations (ICLR), Cited by: §C.3, Table 1.
  • [24] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In The European Conference on Computer Vision (ECCV), Cited by: §6.1.2, 2nd item, Table A.12.
  • [25] K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan (2018) Localization recall precision (LRP): a new performance metric for object detection. In The European Conference on Computer Vision (ECCV), Cited by: §2, §6.1.1.
  • [26] K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan (2020) A ranking-based, balanced loss function unifying classification and localisation in object detection. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §1, §C.6.1, Table A.19, §2, §2, Figure 2, §3.1, §3.1, §3.1, §3.1, §3.1, §3.2, §3, §5.2, §6.1.1, §6.1.2, §6.1.2, Table 1, Table 2, §A.2, item 1, §B.1, §B.2, Table A.11, Table A.12.
  • [27] K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan (2020) One metric to measure them all: localisation recall precision (lrp) for evaluating visual detection tasks. arXiv 2011.10772. Cited by: §6.1.1.
  • [28] K. Oksuz, B. C. Cam, S. Kalkan, and E. Akbas (2020) Imbalance problems in object detection: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (), pp. 1–1. Cited by: §1, §1, §2.
  • [29] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin (2019) Libra R-CNN: Towards balanced learning for object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: 3rd item, Table 1.
  • [30] Q. Qian, L. Chen, H. Li, and R. Jin (2020) DR loss: improving object detection by distributional ranking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [31] S. Ren, K. He, R. Girshick, and J. Sun (2017) Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 39 (6), pp. 1137–1149. Cited by: §1, §C.5, §2, §5.3, §6.1.1, §6, 1st item, §B.2.
  • [32] H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese (2019) Generalized intersection over union: a metric and a loss for bounding box regression. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.3, §5.2, Table 1, §B.1.
  • [33] M. Rolínek, V. Musil, A. Paulus, M. Vlastelica, C. Michaelis, and G. Martius (2020) Optimizing rank-based metrics with blackbox differentiation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [34] F. Rosenblatt (1958) The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, pp. 65–386. Cited by: §2, §3.1, §3.
  • [35] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.2.2, Table 6.
  • [36] J. Tan, X. Lu, G. Zhang, C. Yin, and Q. Li (2021) Equalization loss v2: a new gradient balance approach for long-tailed object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.5, Table 5.
  • [37] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2, 1st item, §B.1, Table A.11.
  • [38] J. Wang, K. Chen, R. Xu, Z. Liu, C. C. Loy, and D. Lin (2019) CARAFE: content-aware reassembly of features. In IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §C.3, §6.1.1, §6.2.3, Table 1, Table 3.
  • [39] X. Wang, R. Zhang, T. Kong, L. Li, and C. Shen (2020) SOLOv2: dynamic and fast instance segmentation. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.2.2, §6.2.2, §6.2.3, Table 8, §6.
  • [40] Y. Wang, Z. Xu, H. Shen, B. Cheng, and L. Yang (2020) CenterMask: single shot instance segmentation with point representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.2.3, Table 8.
  • [41] E. Xie, P. Sun, X. Song, W. Wang, X. Liu, D. Liang, C. Shen, and P. Luo (2020) PolarMask: single shot instance segmentation with polar representation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 8.
  • [42] H. Zhang, H. Chang, B. Ma, N. Wang, and X. Chen (2020) Dynamic r-cnn: towards high quality object detection via dynamic training. In The European Conference on Computer Vision (ECCV), Cited by: 4th item, Table 1, Table 3.
  • [43] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2020) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §C.5, §2, Figure 3, §5.2, §5, §6.1.2, Table 2, Table 3, §6, 1st item, §B.1, §B.1, §B.2.
  • [44] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1.3, §6.2.3.

Appendices

A Details of RS Loss

In this section, we present the derivations of gradients and obtain the loss value and gradients of RS Loss on an example in order to provide more insight.

a.1 Derivation of the Gradients

The gradients of a ranking-based loss function can be determined as follows. Eq. 3 in the paper states that

(A.15)

Our identity update reformulation suggests replacing by yielding:

(A.16)

We split both summations into two based on the labels of the examples, and express using four terms:

(A.17)

Then simply by using the primary terms of RS Loss, defined in Eq. 9 in the paper as:

(A.18)

With the primary term definitions, we obtain the gradients of RS Loss using Eq. A.17.

Gradients for . For , we can respectively express the four terms in Eq. A.17 as follows:

  • ,

  • (no negative-to-negative error is defined for RS Loss – see Eq. A.18),

  • (no error when and for – see Eq. A.18),

  • (no negative-to-negative error is defined for RS Loss – see Eq. A.18),

which, then, can be expressed as (by also replacing following the definition of RS Loss):

(A.19)
(A.20)
(A.21)

concluding the derivation of the gradients if .

Gradients for . We follow the same methodology for and express the same four terms as follows:

  • ,

  • (no error when and for – see Eq. A.18),

  • reduces to simply by rearranging the terms and since is a pmf:

    (A.22)
    (A.23)
    (A.24)
  • Similarly, reduces to