Official PyTorch Implementation of Rank & Sort Loss [ICCV2021]
We propose Rank Sort (RS) Loss, as a ranking-based loss function to train deep object detection and instance segmentation methods (i.e. visual detectors). RS Loss supervises the classifier, a sub-network of these methods, to rank each positive above all negatives as well as to sort positives among themselves with respect to (wrt.) their continuous localisation qualities (e.g. Intersection-over-Union - IoU). To tackle the non-differentiable nature of ranking and sorting, we reformulate the incorporation of error-driven update with backpropagation as Identity Update, which enables us to model our novel sorting error among positives. With RS Loss, we significantly simplify training: (i) Thanks to our sorting objective, the positives are prioritized by the classifier without an additional auxiliary head (e.g. for centerness, IoU, mask-IoU), (ii) due to its ranking-based nature, RS Loss is robust to class imbalance, and thus, no sampling heuristic is required, and (iii) we address the multi-task nature of visual detectors using tuning-free task-balancing coefficients. Using RS Loss, we train seven diverse visual detectors only by tuning the learning rate, and show that it consistently outperforms baselines: e.g. our RS Loss improves (i) Faster R-CNN by 3 box AP and aLRP Loss (ranking-based baseline) by 2 box AP on COCO dataset, (ii) Mask R-CNN with repeat factor sampling (RFS) by 3.5 mask AP ( 7 AP for rare classes) on LVIS dataset; and also outperforms all counterparts. Code available at https://github.com/kemaloksuz/RankSortLossREAD FULL TEXT VIEW PDF
We propose average Localization-Recall-Precision (aLRP), a unified, boun...
One-stage object detectors are trained by optimizing classification-loss...
Cascaded architectures have brought significant performance improvement ...
In object detection, the intersection over union (IoU) threshold is
Adversarial noises are useful tools to probe the weakness of deep learni...
The majority of current object detectors lack context: class predictions...
The prevalent object detectors to date, such as Faster R-CNN and RetinaN...
Official PyTorch Implementation of Rank & Sort Loss [ICCV2021]
A provisional repository for Rank & Sort Loss on SOLOv2
Owing to their multi-task (e.g. classification, box regression, mask prediction) nature, object detection and instance segmentation methods rely on loss functions of the form:
which combines , the loss function for task on stage (e.g. for Faster R-CNN  with RPN and R-CNN), weighted by a hyper-parameter . In such formulations, the number of hyper-parameters can easily exceed 10 , with additional hyper-parameters arising from task-specific imbalance problems , e.g. the positive-negative imbalance in the classification task, and if a cascaded architecture is used (e.g. HTC  employs R-CNNs with different ). Thus, although such loss functions have led to unprecedented successes in several benchmarks, they necessitate tuning, which is time consuming, leads to sub-optimal solutions and makes fair comparison of methods challenging.
Recently proposed ranking-based loss functions, namely “Average Precision (AP) Loss”  and “average Localisation Recall Precision (aLRP) Loss” , offer two important advantages over the classical score-based functions (e.g. Cross-entropy Loss and Focal Loss ): (1) They directly optimize the performance measure (e.g. AP), thereby providing consistency between training and evaluation objectives. This also reduces the number of hyper-parameters as the performance measure (e.g. AP) does not typically have any hyper-parameters. (2) They are robust to class-imbalance due to their ranking-based error definition. Although these losses have yielded impressive performances, they require longer training and more augmentation.
Broadly speaking, the ranking-based losses (AP Loss and aLRP Loss) focus on ranking positive examples over negatives, and they do not explicitly model positive-to-positive interactions. However, prioritizing predictions wrt. their localisation qualities by using an auxiliary (aux. - e.g. IoU, centerness) head has been a common approach to improve performance [15, 37, 43, 16]. Besides, as recently shown by Li et al.  (in Quality Focal Loss - QFL), when the classifier is directly supervised to regress IoUs of the predictions (i.e. to prioritize predictions wrt. IoU), one can remove the aux. head and further improve the performance.
In this paper, we propose Rank & Sort (RS) Loss as a ranking-based loss function to train visual detection (VD – i.e. object detection and instance segmentation) methods. RS Loss not only ranks positives above negatives (Fig. 1(a)) but also sorts positives among themselves with respect to their continuous IoU values (Fig. 1(b)). This approach brings in several crucial benefits. Due to the prioritization of positives during training, detectors trained with RS Loss do not need an aux. head, and due to its ranking-based nature, RS Loss can handle extremely imbalanced data (e.g. object detection ) without any sampling heuristics. Besides, except for the learning rate, RS Loss does not need any hyper-parameter tuning thanks to our tuning-free task-balancing coefficients. Owing to this significant simplification of training, we can apply RS Loss to different methods (i.e. multi-stage, one-stage, anchor-based, anchor-free) easily (i.e. only by tuning the learning rate) and demonstrate that RS Loss consistently outperforms baselines.
Our contributions can be summarized as follows:
(1) We reformulate the incorporation of error-driven optimization into backpropagation to optimize non-differentiable ranking-based losses as Identity Update, which uniquely provides interpretable loss values during training and allows definition of intra-class errors (e.g. the sorting error among positives).
(2) We propose Rank & Sort Loss that defines a ranking objective between positives and negatives as well as a sorting objective to prioritize positives wrt. their continuous IoUs. Due to this ranking-based nature, RS Loss can train models in the presence of highly imbalanced data.
(3) We present the effectiveness of RS Loss on a diverse set of four object detectors and three instance segmentation methods only by tuning the learning rate and without any aux. heads or sampling heuristics on the widely-used COCO and long-tailed LVIS benchmarks: E.g. (i) Our RS-R-CNN improves Faster-CNN by box AP on COCO, (ii) our RS-Mask R-CNN improves repeat factor sampling by mask AP ( AP for rare classes) on LVIS.
or uncertainty (i.e. variance) head and combining these predictions with the classification scores for NMS are shown to improve detection performance. Lin et al.  discovered that using continuous IoUs of predictions to supervise the classifier outperforms using an aux. head. Currently, Lin et al.’s “Quality Focal Loss”  is the only method that is robust to class imbalance  and uses continuous labels to train the classifier. In this work, we investigate the generalizability of this idea on different networks (e.g. multi-stage networks [31, 2]) and on a different task (i.e. instance segmentation) by using our ranking-based RS Loss.
Ranking-based losses in VD. Despite their advantages, ranking-based losses are non-differentiable and difficult to optimize. To address this challenge, black-box solvers 
use an interpolated AP surface, though yielding little gain in object detection. DR Loss achieves ranking between positives and negatives by enforcing a margin with Hinge Loss, which is differentiable. Finally, AP Loss  and aLRP Loss  optimize the performance metrics, AP and LRP 
respectively, by using the error-driven update of perceptron learning for the non-differentiable parts. The main difference of RS Loss is that it also considers continuous localisation qualities as labels.
Objective imbalance in VD. The common strategy in VD is to use (Eq. 1), a scalar multiplier, on each task and tune them by grid search [16, 1]. Recently, Oksuz et al.  employed a self-balancing strategy to balance classification and box regression heads, both of which compete for the bounded range of aLRP Loss. Similarly, Chen et al.  use the ratio of classification and regression losses to balance these tasks. In our design, each loss for a specific head has its own bounded range and thus, no competition ensues among heads. Besides, we use s with similar ranges, and show that our RS Loss can simply be combined with a simple task balancing strategy based on loss values, and hence does not require any tuning except the learning rate.
Using a ranking-based loss function is attractive thanks to its compatibility with common performance measures (e.g. AP). It is challenging, however, due to the non-differentiable nature of ranking. Here we first revisit an existing solution [8, 26] that overcomes this non-differentiability by incorporating error-driven update  into backpropagation (Section 3.1), and then present our reformulation (Section 3.2), which uniquely (i) provides interpretable loss values and (ii) takes into account intra-class errors, which is crucial for using continuous labels.
Definition of the Loss. Oksuz et al.  propose writing a ranking-based loss as where is a problem specific normalization constant, is the set of positive examples and is the error term computed on .
Computation of the Loss.
Given logits (), can be computed in three steps [8, 26] (Fig. 2 green arrows):
Step 1. The difference transform between logits and is computed by .
Step 2. Using , errors originating from each pair of examples are calculated as primary terms ():
is a probability mass function (pmf) that distributes, the error computed on , over where is the set of negative examples. By definition, the ranking-based error , and thus , requires pairwise-binary-ranking relation between outputs and , which is determined by the non-differentiable unit step function (i.e. if and otherwise) with input .
Using , different ranking-based functions can be introduced to define and : e.g. the rank of the th example, ; the rank of the th example among positives, ; and number of false positives with logits larger than , . As an example, for AP Loss , using these definitions, and can be simply defined as and respectively .
Step 3. Finally, is calculated as the normalized sum of the primary terms : .
Optimization of the Loss. Here, the aim is to find updates , and then proceed with backpropagation through model parameters. Among the three computation steps (Fig. 2 orange arrows), Step 1 and Step 3 are differentiable, whereas a primary term is not a differentiable function of difference transforms. Denoting this update in by
and using the chain rule,can be expressed as:
Chen et al.  incorporate the error-driven update  and replace by where is the target primary term indicating the desired error for pair . Both AP Loss  and aLRP Loss  are optimized this way.
We first identify two drawbacks of the formulation in Section 3.1: (D1) Resulting loss value () does not consider the target , and thus, is not easily interpretable when (cf. aLRP Loss  and our RS Loss - Section 4); (D2) Eq. 2 assigns a non-zero primary term only if and , effectively ignoring intra-class errors. These errors become especially important with continuous labels: The larger the label of , the larger should be.
Definition of the Loss. We redefine the loss function as:
where is the desired error term on . Our loss definition has two benefits: (i) directly measures the difference between the target and the desired errors, yielding an interpretable loss value to address (D1), and (ii) we do not constrain to be defined only on positives and replace “” with “”. Although we do not use “” to model RS Loss, it makes the definition of complete in the sense that, if necessary to obtain , individual errors () can be computed on each output, and hence, can be approximated more precisely or a larger set of ranking-based loss functions can be represented.
In order to supervise the classifier of visual detectors by considering the localisation qualities of the predictions (e.g. IoU), RS Loss decomposes the problem into two tasks: (i) Ranking task, which aims to rank each positive higher than all negatives, and (ii) sorting task, which aims to sort the logits in descending order wrt. continuous ground-truth labels (e.g. IoUs). We define RS Loss and compute its gradients using our Identity Update (Section 3.2 – Fig. 2).
Definition. Given logits and their continuous ground-truth labels (e.g. IoU), we define RS Loss as the average of the differences between the current () and target () RS errors over positives (i.e. ):
where is a summation of the current ranking error and current sorting error:
For , while the “current ranking error” is simply the precision error, the “current sorting error” penalizes the positives with logits larger than by the average of their inverted labels, . Note that when is ranked above all , and target ranking error, , is . For target sorting error, we average over the inverted labels of with larger logits () and labels () than corresponding to the desired sorted order,
where is the Iverson Bracket (i.e. 1 if predicate is True; else 0), and similar to previous work , is smoothed in the interval as .
where ranking () and sorting pmfs (
) uniformly distribute ranking and sorting errors onrespectively over examples causing error (i.e. for ranking, with ; for sorting, with but ):
Note that the directions of the first and second part of Eq. 12 are different. To place in the desired ranking, promotes based on the error computed on itself, whereas demotes based on the signal from . We provide more insight for RS Loss on an example in Appendix A.
This section develops an overall loss function to train detectors with RS Loss, in which only the learning rate needs tuning. As commonly performed in the literature [17, 16], Section 5.2 analyses different design choices on ATSS , a SOTA one-stage object detector (i.e. in Eq. 1); and Section 5.3 extends our design to other architectures.
Unless explicitly specified, we use (i) standard configuration of each detector and only replace the loss function, (ii) mmdetection framework , (iii) 16 images with a size of in a single batch ( images/GPU, Tesla V100) during training, (iv)
training schedule (12 epochs), (v) single-scale test with images with a size of, (vi) ResNet-50 backbone with FPN , (vii) COCO trainval35K (115K images) and minival (5k images) sets  to train and test our models, (iix) report COCO-style AP.
ATSS  with its classification, box regression and centerness heads is originally trained by minimizing:
where is Focal Loss ; is GIoU Loss ; is Cross-entropy Loss with continuous labels to supervise centerness prediction; and and . We first remove the centerness head and replace by our RS Loss (Section 4), , using between a prediction box () and ground truth box () as the continuous labels:
where , the task-level balancing coefficient, is generally set to a constant scalar by grid search.
Inspired by recent work [26, 5], we investigate two tuning-free heuristics to determine every iteration: (i) value-based: , and (ii) magnitude-based: where is L1 norm, and are box regression and classification head outputs respectively. In our analysis on ATSS trained with RS Loss, we observed that value-based task balancing performs similar to tuning ( AP on average). Also, we use score-based weighting  by multiplying the GIoU Loss of each prediction using its classification score (details of analysis are in Appendix B). Note that value-based task balancing and score-based instance weighting are both hyper-parameter-free and easily applicable to all networks. With these design choices, Eq. 14 has only hyper-parameter (i.e. in , set to , to smooth the unit-step function)
Fig. 3 presents a comparative overview on how we adopt RS Loss to train different architectures: When we use RS Loss to train the classifier (Fig. 3(b)), we remove aux. heads (e.g. IoU head in IoU-Net ) and sampling heuristics (e.g. OHEM in YOLACT , random sampling in Faster R-CNN ). We adopt score-based weighting in box regression and mask prediction heads, and prefer Dice Loss, instead of the common Cross-entropy Loss, to train mask prediction head for instance segmentation due to (i) its bounded range (between and ), and (ii) holistic evaluation of the predictions, both similar to GIoU Loss. Finally, we set (Eq. 1) to scalar (i.e. ) every iteration (Fig. 3(c)) with the single exception of RPN where we multiply the losses of RPN by following aLRP Loss.
|FPN ||IoU-based||Random||None||CVPR 17|
|aLRP Loss ||IoU-based||None||None||NeurIPS 20|
|GIoU Loss ||IoU-based||Random||None||CVPR 19|
|IoU-Net ||IoU-based||Random||IoU Head||–||–||–||–||–||ECCV 18|
|Libra R-CNN ||IoU-based||IoU-based||None||CVPR 19|
|AutoLoss-A ||IoU-based||Random||None||ICLR 21|
|Carafe FPN ||IoU-based||Random||None||ICCV 19|
|Dynamic R-CNN ||Dynamic||Random||None||ECCV 20|
To present the contribution of RS Loss in terms of performance and tuning simplicity, we conduct experiments on seven visual detectors with a diverse set of architectures: four object detectors (i.e. Faster R-CNN , Cascade R-CNN , ATSS  and PAA  – Section 6.1) and three instance segmentation methods (i.e. Mask R-CNN , YOLACT  and SOLOv2  – Section 6.2). Finally, Section 6.3 presents ablation analysis.
To train Faster R-CNN  and Cascade R-CNN  by our RS Loss (i.e. RS-R-CNN), we remove sampling from all stages (i.e. RPN and R-CNNs), use all anchors to train RPN and top-scoring proposals/image (by default, for Faster R-CNN and Cascade R-CNN in mmdetection ), replace softmax classifiers by binary sigmoid classifiers and set the initial learning rate to .
RS Loss reaches AP on a standard Faster R-CNN and outperforms (Table 1): (i) FPN  (Cross Entropy & Smooth L1 losses) by AP, (ii) aLRP Loss , a SOTA ranking-based baseline, by AP, (iii) IoU-Net  with aux. head by AP and (iv) Dynamic R-CNN, closest counterpart, by AP. We, then, use the lightweight Carafe  as the upsampling operation in FPN and obtain AP (RS-R-CNN+), still maintaining AP gap from Carafe FPN  ( AP) and outperforming all methods in all AP- and oLRP-based [25, 27] performance measures except , which implies that our main contribution is in classification task trained by our RS Loss and there is still room for improvement in the localisation task. RS Loss also improves the stronger baseline Cascade R-CNN  by AP from AP to AP (Appendix C presents detailed results for Cascade R-CNN). Finally, RS Loss has the least number of hyper-parameters (H# = , Table 1) and does not need a sampler, an aux. head or tuning of s (Eq. 1).
|Loss Function||Unified||Rank-based||Aux. Head||ATSS ||PAA ||H#|
|Focal Loss |
|AP Loss ||✓|
|aLRP Loss ||✓||✓|
|RS Loss (Ours)||✓||✓||67.9||67.3|
We train ATSS  and PAA  including a centerness head and an IoU head respectively in their architectures. We adopt the anchor configuration of Oksuz et al.  for all ranking-based losses (different anchor configurations do not affect performance of standard ATSS ) and set learning rate to . While training PAA, we keep the scoring function, splitting positives and negatives, for a fair comparison among different loss functions.
Comparison with AP and aLRP Losses, ranking-based baselines: We simply replaced Focal Loss by AP Loss to train networks, and as for aLRP Loss, similar to our RS Loss, we tuned its learning rate as due to its tuning simplicity. Both for ATSS and PAA, RS Loss provides significant gains over ranking-based alternatives, which were trained for 100 epochs using SSD-like augmentation  in previous work [8, 26]: / AP gain for ATSS and / gain for PAA for AP/aLRP Loss (Table 2).
Comparison with Focal Loss, default loss function: RS Loss provides around AP gain when both networks are equally trained without an aux. head (Table 2) and AP gain compared to the default networks with aux. heads.
Comparison with QFL, score-based loss function using continuous IoUs as labels: To apply QFL  to PAA, we remove the aux. IoU head (as we did with ATSS), test two possible options ((i) default PAA setting with and IoU-based weighting, (ii) default QFL setting: and score-based weighting – Section 5.2) and report the best result for QFL. While the results of QFL and RS Loss are similar for ATSS, there is AP gap in favor of our RS Loss, which can be due to the different positive-negative labelling method in PAA (Table 2).
Here, we use our RS-R-CNN since it yields the largest improvement over its baseline. We train RS-R-CNN for 36 epochs using multiscale training by randomly resizing the shorter size within on ResNet-101 with DCNv2 . Table 3 reports the results on COCO test-dev: Our RS-R-CNN reaches AP and outperforms similarly trained Faster R-CNN and Dynamic R-CNN by and AP respectively. Although we do not increase the number of parameters for Faster R-CNN, RS R-CNN outperforms all multi-stage detectors including TridentNet , which has more parameters. Our RS-R-CNN+ (Section 6.1.1) reaches AP, and RS-Mask R-CNN+ (Section 6.2) reaches AP, outperforming all one- and multi-stage counterparts.
|Multi-stage||Faster R-CNN |
|Trident Net |
|Dynamic R-CNN |
We train Mask R-CNN  on COCO and LVIS datasets by keeping all design choices of Faster R-CNN the same.
COCO: We observe AP gain for both segmentation and detection performance (Table 4) over Mask R-CNN. Also, as hypothesized, RS-Mask R-CNN outperforms Mask Scoring R-CNN , with an additional mask IoU head, by and mask and box AP; and by mask oLRP.
LVIS: Replacing the cross entropy loss to train Mask R-CNN with repeat factor sampling (RFS) by our RS Loss improves the performance by mask AP on the long-tailed LVIS dataset ( to with improvement on rare classes) and outperforms recent counterparts (Table 5).
Here, we train two different approaches with our RS Loss: (i) YOLACT , a real-time instance segmentation method, involving sampling heuristics (e.g. OHEM ), aux. head and carefully-tuned loss weight, and demonstrate RS Loss can discard all by improving its performance (ii) SOLOv2  as an anchor-free SOTA method.
|Method||Additional Training Heuristics||Segmentation Performance||Detection Performance||H#|
|OHEM ||Size-based Norm.||Sem.Segm. Head||AP||oLRP||AP||oLRP|
YOLACT: Following YOLACT , we train (also test) RS-YOLACT by images with size for epochs. Instead of searching for epochs to decay learning rate, carefully tuned for YOLACT as , , and , we simply adopt cosine annealing with an initial learning rate of . Then, we remove (i) OHEM, (ii) semantic segmentation head, (iii) carefully tuned task weights (i.e. , ) and (iv) size-based normalization (i.e. normalization of mask head loss of each instance by the ground-truth area). Removing each heuristic ensues a slight to significant performance drop (at least requires retuning of – Table 6). After these simplifications, our RS-YOLACT improves baseline by mask AP and box AP.
SOLOv2: Following Wang et al. , we train anchor-free SOLOv2 with RS Loss for 36 epochs using multiscale training on its two different settings: (i) SOLOv2-light is the real-time setting with ResNet-34 and images with size at inference. We use 32 images/batch and learning rate for training. (ii) SOLOv2 is the SOTA setting with ResNet-101 and images with size at inference. We use 16 images/batch and learning rate for training. Since SOLOv2 does not have box regression head, we use Dice coefficient as the continuous labels of RS Loss (see Appendix C for an analysis of using different localisation qualities as labels for instance segmentation). Again, RS Loss performs better than the baseline (i.e. Focal Loss and Dice Loss) only by tuning the learning rate (Table 7).
We use our RS-Mask R-CNN (i.e. standard Mask R-CNN with RS Loss) to compare with SOTA methods. In order to fit in 16GB memory of our V100 GPUs and keep all settings unchanged, we limit the number of maximum proposals in the mask head by 200, which can simply be omitted for GPUs with larger memory. Following our counterparts [39, 40], we first train RS-Mask R-CNN for 36 epochs with multiscale training between using ResNet-101 and reach mask AP (Table 8), improving Mask R-CNN by mask AP and outperforming all SOTA methods by a notable margin ( AP). Then, we train RS-Mask R-CNN+ (i.e. standard Mask R-CNN except upsampling of FPN is lightweight Carafe ) also by extending the multiscale range to and reach mask AP, which even outperforms all models with DCN. With DCN  on ResNet-101, our RS-Mask R-CNN+ reaches mask AP.
|w/o DCN||Polar Mask |
|Mask R-CNN |
|Center Mask ||–||–|
|RS-Mask R-CNN (Ours)|
|RS-Mask R-CNN+ (Ours)|
|w DCN||Mask-scoring R-CNN |
|RS-Mask R-CNN+ (Ours)|
|RS-Mask R-CNN+* (Ours)|
Contribution of the components: Replacing Focal Loss by RS Loss improves the performance significantly ( AP - Table 9). Score-based weighting has a minor contribution and value-based task balancing simplifies tuning.
|Architecture||RS Loss||score-based w.||task bal.||H#|
|w.o. aux. head||✓||✓||2|
|Dataset||Sampler||Desired Neg #||Actual Neg #|
Robustness to imbalance: Without tuning, RS Loss can train models with very different imbalance levels successfully (Table 10): Our RS Loss (i) yields AP on COCO with the standard random samplers (i.e. data is relatively balanced especially for RPN), (ii) utilizes more data when the samplers are removed, resulting in AP gain ( to AP), and (iii) outperforms all counterparts on the long-tailed LVIS dataset (c.f. Table 5), where the imbalance is extreme for R-CNN (pos:neg ratio is - Table 10). Appendix C presents detailed discussion.
Contribution of the sorting error: To see the contribution of our additional sorting error, during training, we track Spearman’s ranking correlation coefficient () between IoUs and classification scores, as an indicator of the sorting quality, with and without our additional sorting error (see Eq. 6-8). As hypothesized, using sorting error improves sorting quality, , averaged over all/last 100 iterations, from to for RS-R-CNN.
Effect on Efficiency: On average, one training iteration of RS Loss takes around longer than score-based losses. See Appendix C for more discussion on the effect of RS Loss on training and inference time.
In this paper, we proposed RS Loss as a ranking-based loss function to train object detectors and instance segmentation methods. Unlike existing ranking-based losses, which aim to rank positives above negatives, our RS Loss also sorts positives wrt. their localisation qualities, which is consistent with NMS and the performance measure, AP. With RS Loss, we employed a simple, loss-value-based, tuning-free heuristic to balance all heads in the visual detectors. As a result, we showed on seven diverse visual detectors that RS Loss both consistently improves performance and significantly simplifies the training pipeline.
Acknowledgments: This work was supported by the Scientific and Technological Research Council of Turkey (TÜBİTAK) (under project no 117E054 and 120E494). We also gratefully acknowledge the computational resources kindly provided by TÜBİTAK ULAKBIM High Performance and Grid Computing Center (TRUBA) and Roketsan Missiles Inc. used for this research. Dr. Oksuz is supported by the TÜBİTAK 2211-A National Scholarship Programme for Ph.D. students. Dr. Kalkan is supported by the BAGEP Award of the Science Academy, Turkey.
IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2, §5.3, §6.2.2, §6.2.2, Table 6, §6.
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §C.2, Table A.15, §2, §6.1.1, §6.1.1, §6.
In this section, we present the derivations of gradients and obtain the loss value and gradients of RS Loss on an example in order to provide more insight.
The gradients of a ranking-based loss function can be determined as follows. Eq. 3 in the paper states that
Our identity update reformulation suggests replacing by yielding:
We split both summations into two based on the labels of the examples, and express using four terms:
Then simply by using the primary terms of RS Loss, defined in Eq. 9 in the paper as:
With the primary term definitions, we obtain the gradients of RS Loss using Eq. A.17.
Gradients for . For , we can respectively express the four terms in Eq. A.17 as follows:
(no negative-to-negative error is defined for RS Loss – see Eq. A.18),
(no error when and for – see Eq. A.18),
(no negative-to-negative error is defined for RS Loss – see Eq. A.18),
which, then, can be expressed as (by also replacing following the definition of RS Loss):
concluding the derivation of the gradients if .
Gradients for . We follow the same methodology for and express the same four terms as follows:
(no error when and for – see Eq. A.18),
reduces to simply by rearranging the terms and since is a pmf:
Similarly, reduces to