1 Introduction
Rankbased metrics are frequently used to evaluate performance on a wide variety of computer vision tasks. For example, in the case of image retrieval, these metrics are required since, at testtime, the models produce a ranking of images based on their relevance to a query. Rankbased metrics are also popular in classification tasks with unbalanced class distributions or multiple classes per image. One prominent example is object detection, where an average over multiple rankbased metrics is used for final evaluation. The most common metrics are recall [12], Average Precision () [68], normalized discounted cumulative gain () [6], and the Spearman Coefficient [9].
Directly optimizing for the rankbased metrics is inviting but also notoriously difficult due to the nondifferentiable (piecewise constant) and nondecomposable nature of such metrics. A trivial solution is to use one of several popular surrogate functions such as 01 loss [32], the area under the ROC curve [1] or cross entropy. Many studies from the last two decades have addressed direct optimization with approaches ranging from histogram binning approximations [4, 48, 19]
, finite difference estimation
[22], lossaugmented inference [68, 39], gradient approximation [55] all the way to using a large LSTM to fit the ranking operation [10].Despite the clear progress in direct optimization [39, 4, 8], these methods are notably omitted in the most publicly used implementation hubs for object detection [7, 64, 36, 24], and image retrieval [50]. The reasons include poor scaling with sequence lengths, lack of publicly available implementations that are efficient on modern hardware, and fragility of the optimization itself.
In a clean formulation, backpropagation through rankbased losses reduces to providing a meaningful gradient of the piecewise constant ranking function. This is an interpolation problem, rather than a gradient estimation problem (the true gradient is simply zero almost everywhere). Accordingly, the properties of the resulting interpolation (whose gradients are returned) should be of central focus, rather than the gradient itself.
In this work, we interpolate the ranking function via blackbox backpropagation [60]
, a framework recently proposed in the context of combinatorial solvers. This framework is the first one to give mathematical guarantees on an interpolation scheme. It applies to piecewise constant functions that originate from minimizing a discrete objective function. To use this framework, we reduce the ranking function to a combinatorial optimization problem. In effect, we inherit two important features of
[60]: mathematical guarantees and the ability to compute the gradient only with the use of a nondifferentiable blackbox implementationof the ranking function. This allows using implementations of ranking functions that are already present in popular machine learning frameworks which results in straightforward implementation and significant practical speedup. Finally, differentiating directly the ranking function gives additional flexibility for designing loss functions.
Having a conceptually pure solution for the differentiation, we can then focus on another key aspect: sound loss design. To avoid adhoc modifications, we take a deeper look at the caveats of direct optimization for rankbased metrics. We offer multiple approaches for addressing these caveats, most notably we introduce marginbased versions of rankbased losses and mathematically derive a recallbased loss function that provides dense supervision.
Experimental evaluation is carried out on image retrieval tasks where we optimize the recallbased loss and on object detection where we directly optimize mean Average Precision. On the retrieval experiments, we achieve performance that is onpar with stateoftheart while using a simpler setup. On the detection tasks, we show consistent improvement over highlyoptimized implementations that use the crossentropy loss, while our loss is used in an outofthebox fashion. We release the code used for our experiments^{1}^{1}1https://github.com/martiuslab/blackboxbackprop..
2 Related work
Optimizing for rankbased metrics
As rankbased evaluation metrics are now central to multiple research areas, their direct optimization has become of great interest to the community. Traditional approaches typically rely on different flavors of lossaugmented inference
[39, 68, 38, 37], or gradient approximation [55, 22]. These approaches often require solving a combinatorial problem as a subroutine where the nature of the problem is dependent on the particular rankbased metric. Consequently, efficient algorithms for these subproblems were proposed [39, 55, 68].More recently, differentiable histogrambinning approximations [4, 19, 20, 48] have gained popularity as they offer a more flexible framework. Completely different techniques including learning a distribution over rankings [58], using a policygradient update rule [44], learning the sorting operation entirely with a deep LSTM [10]
or perceptronlike errordriven updates have also been applied
[8].Metric learning
There is a great body of work on metric learning for retrieval tasks, where defining a suitable loss function plays an essential role. Bellet et al. [2] and Kulis et al. [29] provide a broader survey of metric learning techniques and applications. Approaches with local losses range from employing pair losses [3, 28], triplet losses [51, 23, 55] to quadruplet losses [30]. While the majority of these works focus on local, decomposable losses as above, multiple lines of work exist for directly optimizing global rankbased losses [10, 58, 48]. The importance of good batch sampling strategies is also wellknown, and is the subject of multiple studies [41, 51, 12, 63], while others focus on generating novel training examples [70, 55, 40].
Object detection
Modern object detectors use a combination of different losses during training [14, 47, 33, 46, 18, 31]. While the biggest performance gains have originated from improved architectures [47, 18, 45, 13] and feature extractors [17, 71], some works focused on formulating better loss functions [31, 49]. Since its introduction in the Pascal VOC object detection challenge [11] mean Average Precision () has become the main evaluation metric for detection benchmarks. Using the metric as a replacement for other less suitable objective functions has thus been studied in several works [55, 22, 44, 8].
3 Background
3.1 Rankbased metrics
For a positive integer , we denote by the set of all permutations of
. The rank of vector
, denoted by is a permutation satisfying(1) 
i.e. sorting . Note, that rank is not defined uniquely for those vectors for which any two components coincide. In the formal presentation, we reduce our attention to proper rankings in which ties do not occur.
The rank of the th element is one plus the number of members in the sequence exceeding its value, i.e.
(2) 
3.1.1 Average Precision
For a fixed query, let be a vector of relevance scores of examples. We denote by the vector of their ground truth labels (relevant/irrelevant) and by
(3) 
the set of indices of the relevant examples. Then Average Precision is given by
(4) 
where precision at is defined as
(5) 
and describes the ratio of relevant examples among the highestscoring examples.
In classification tasks, the dataset typically consists of annotated images. This we formalize as pairs where is an input image and is a binary class vector, where, for every , each denotes whether an image belongs to the class . Then, for each example the model provides a vector of suggested classrelevance scores , where are the parameters of the model.
To evaluate mean Average Precision (), we consider for each class the vector of scores and labels . We then take the mean of Average Precisions over all the classes
(6) 
Note that and that the highest score 1 corresponds to perfect score prediction in which all relevant examples precede all irrelevant examples.
3.1.2 Recall
Recall is a metric that is often used for information retrieval. Let again and be the scores and the groundtruth labels for a given query over a dataset. For a positive integer , we set
(7) 
where is given in Eq. 3.
In a setup where each element of the dataset is a possible query, we define the ground truth matrix as follows. We set if belongs to the same class as the query , and zero otherwise. The scores suggested by the model are again denoted by .
In order to evaluate the model over the whole dataset , we average over all the queries , namely
(8) 
Again, for every . The highest score 1 means that a relevant example is always found among the top predictions.
3.2 Blackbox differentiation of combinatorial solvers
In order to differentiate the ranking function, we employ a method for efficient backpropagation through combinatorial solvers – recently proposed in [60]. It turns algorithms or solvers for problems like shortestpath, travelingsalesmanproblem
, and various graph cuts into differentiable building blocks of neural network architectures.
With minor simplifications, such solvers (e.g. for the Multicut problem) can be formalized as maps that take continuous input (e.g. edge weights of a fixed graph) and return discrete output (e.g. indicator vector of a subset of edges forming a cut) such that it minimizes a combinatorial objective expressed as an inner product (e.g. the cost of the cut). Note that the notation differs from [60] ( was and was ). In short, a blackbox solver is
(9) 
where is the discrete set of admissible assignments (e.g. subsets of edges forming cuts).
The key technical challenge when computing the backward pass is meaningful differentiation of the piecewise constant function where is the final loss of the network. To that end, [60]
constructs a family of continuous and (almost everywhere) differentiable functions parametrized by a single hyperparameter
that controls the tradeoff between “faithfulness to original function” and “informativeness of the gradient”, see Fig. 1. For a fixed the gradient of such an interpolation at point is computed and passed further down the network (instead of the true zero gradient) as(10) 
where is the output of the solver for a certain precisely constructed modification of the input. The modification is where the incoming gradient information is used. For full details including the mathematical guarantees on the tightness of the interpolation, see [60].
The main advantage of this method is that only a blackbox implementation of the solver (i.e. of the forward pass) is required to compute the backward pass. This implies that powerful optimized solvers can be used instead of relying on suboptimal differentiable relaxations.
4 Method
4.1 Blackbox differentiation for ranking
In order to apply blackbox differentiation method for ranking, we need to find a suitable combinatorial objective. Let be a vector of real numbers (the scores) and let be their ranks. The connection between blackbox solver and ranking is captured in the following proposition.
In other words, the mapping is a minimizer of a linear combinatorial objective just as Eq. 9 requires.
The proof of Proposition 1 rests upon a classical rearrangement inequality [15, Theorem 368]. The following theorem is its weaker formulation that is sufficient for our purpose.
Theorem 1 (Rearrangement inequality).
For every positive integer , every choice of real numbers and every permutation it is true that
Moreover, if are distinct, equality occurs precisely for the identity permutation .
[Proof of Proposition 1.] Let be the permutation that minimizes (11). This means that the value of the sum
(12) 
is the lowest possible. Using the inverse permutation (12) rewrites as
(13) 
and therefore, being minimal in (13) makes (1) hold due to Theorem 1. This shows that .
The resulting gradient computation is provided in Algorithm 1 and only takes a few lines of code. We call the method Ranking Metric Blackbox Optimization (RaMBO).
Note again the presence of a blackbox ranking operation. In practical implementation, we can delegate this to a builtin function of the employed framework (e.g. torch.argsort). Consequently, we inherit the computational complexity as well as a fast vectorized implementation on a GPU. To our knowledge, the resulting algorithm is the first to have both truly subquadratic complexity (for both forward and backward pass) and to operate with a general ranking function (see also Tab. 1).
Method  forward + backward  general ranking 
RaMBO  ✓  
Mohapatra et al. [39]  x  
Chen et al. [8]  ✓  
Yue et al. [68]  x  
FastAP [4]  ✓  
SoDeep [10]  ✓ 
4.2 Caveats for sound loss design
Is resolving the nondifferentiability all that is needed for direct optimization? Unfortunately not. To obtain wellbehaved loss functions, some delicate considerations need to be made. Below we list a few problems 4.2–4.2 that arise from direct optimization without further adjustments.
4.2
Evaluation of rankbased metrics is typically carried out over the whole test set while direct optimization methods rely on minibatch approximations. This, however, does not yield an unbiased gradient estimate. Particularly small minibatch sizes result in optimizing a very poor approximation of , see Fig. 3.
4.2
Rankbased metrics are brittle when many ties happen in the ranking. As an example, note that any rankbased metric attains all its values in the neighborhood of a datasetwide tie. Additionally, once a positive example is rated higher than all negative examples even by the slightest difference, the metric gives no incentive for increasing the difference. This induces a high sensitivity to potential shifts in the statistics when switching to the test set. The need to pay special attention to ties was also noted in [4, 19].
4.2
Some metrics give only sparse supervision. For example, the value of only improves if the highestranked positive example moves up the ranking, while the other positives have no incentive to do so. Similarly, Average Precision does not give the incentive to decrease the possibly high scores of negative examples, unless also some positive examples are present in the minibatch. Since positive examples are typically rare, this can be problematic.
4.3 Score memory
In order to mitigate the negative impact of small batch sizes on approximating the datasetwide loss 4.2 we introduce a simple running memory. It stores the scores for elements of the last previous batches, thereby reducing the bias of the estimate. All entries are concatenated for loss evaluation, but the gradients only flow through the current batch. This is a simpler variant of “batchextension” mechanisms introduced in [4, 48].
4.4 Score margin
Our remedy for brittleness around ties 4.2 is inspired by the triplet loss [51]; we introduce a shift in the scores during training in order to induce a margin. In particular, we add a negative shift to the positively labeled scores and positive shift the negatively labeled scores as illustrated in Fig. 3. This also implicitly removes the destabilizing scaleinvariance. Using notation as before, we modify the scores as
(14) 
where is the prescribed margin. In the implementation, we replace the ranking operation with given by
(15) 
4.5 Recall loss design
Let be scores and the truth labels, as usual. As noted in 4.2 the value of only depends on the highest scoring relevant element. We overcome the sparsity of the supervision by introducing a refined metric
(16) 
where denotes the set of relevant elements (3) and stands for the number of irrelevant elements outrunning the th element. Formally,
(17) 
in which denotes the rank of the th element only within the relevant ones. Note that depends on all the relevant elements as intended. We then define the loss at as
(18) 
Next, we choose a weighting of these losses
(19) 
over values of .
4.6 Average Precision loss design
Having differentiable ranking, the generic does not require any further modifications. Indeed, for any relevant element index , its precision obeys
(21) 
where is the rank of the th element within all the relevant ones. The loss then reads
(22) 
For calculating the mean Average Precision loss , we simply take the mean over the classes .
5 Experiments
We evaluate the performance of RaMBO on object detection and several image retrieval benchmarks. The experiments demonstrate that our method for differentiating through and recall is generally onpar with the stateoftheart results and yields in some cases better performance. We will release code upon publication. Throughout the experimental section, the numbers we report for RaMBO are averaged over three restarts.
5.1 Image retrieval
To evaluate the proposed Recall Loss (Eq. 20) derived from RaMBO we run experiments for image retrieval on the CUB2002011 [62], Stanford Online Products [54], and Inshop Clothes [34] benchmarks. We compare against a variety of methods from recent years, multiple of which achieve stateoftheart performance. The bestperforming methods are ABE8 [26], FastAP [4], and Proxy NCA [40].
Architecture
For all experiments, we follow the most standard setup. We use a pretrained ResNet50 [17]
in which we replace the final softmax layer with a fully connected embedding layer which produces a
dimensional vector for each batch element. We normalize each vector so that it represents a point on the unit sphere. The cosine similarities of all the distinct pairs of elements in the batch are then computed and the ground truth similarities are set to 1 for those elements belonging to the same class and 0 otherwise. The obvious similarity of each element with itself is disregarded. We compute the
loss for each batch element with respect to all other batch elements using the similarities and average it to compute the final loss. Note that our method does not require any sophisticated sampling strategy to compute the loss.Parameters
We use Adam optimizer [27] with an amplified learning rate for the embedding layer. We consistently set the batch size to 128 so that each experiment runs on a GPU with 16GB memory. Full details regarding training schedules and exact values of hyperparameters for the different datasets are in the Supplementary material.
Datasets
For data preparation, we resize images to and randomly crop and flip them to during training, using a single center crop on evaluation.
We use the Stanford Online Products dataset consisting of images with classes crawled from Ebay. The classes are grouped into 12 superclasses (e.g. cup, bicycle) which are used for minibatch preparation following the procedure proposed in [4]. We follow the evaluation protocol proposed in [54], using images corresponding to classes for training and images corresponding to classes for testing.
The Inshop Clothes dataset consists of images with classes. The classes are grouped into 23 superclasses (e.g. MEN/Denim, WOMEN/Dresses), which we use for minibatch preparation as before. We follow previous work by using images corresponding to classes for training and + images corresponding to classes each for testing (split into a query + gallery set respectively). Given an image from the query set, we retrieve corresponding images from the gallery set.
The CUB2002011 dataset consists of images of bird categories. Again we follow the evaluation protocol proposed in [54], using the first classes consisting of images for training and the remaining classes with images for testing.
Results
For all retrieval results in the tables we add the embedding dimension as a superscript and the backbone architecture as a subscript. The letters R, G, I, V represent ResNet [21], GoogLeNet [57], Inception [56], and VGG16 [52], respectively. We report results for both RaMBO and RaMBO , the main difference being if the logarithm is applied once or twice to the rank in Eq. (20).
On Stanford Online Products we report for in Tab. 3. The fact that the dataset contains the highest number of classes seems to favor RaMBO, as it outperforms all other methods. Some example retrievals are presented in Fig. 4.
On CUB2002011 we report for in Tab. 3. For fairness, we include the performance of Proxy NCA with a ResNet50 [17] backbone even though the results are only reported in an online implementation [50]. With this implementation Proxy NCA and RaMBO are the bestperforming methods.
On Inshop Clothes we report for value of in Tab. 5
. The bestperforming method is probably FastAP, even though the situation regarding reproducibility is rather puzzling
^{2}^{2}2FastAP public code [5] offers Matlab and PyTorch implementations. Confusingly, the two implementations give very different results. We contacted the authors but neither we nor they were able to identify the source of this discrepancy in two seemingly identical implementations. We report both numbers.. RaMBO matches the performance of ABE8 [26], a complex attentionbased ensemble method.We followed the reporting strategy of [26] by evaluating on the test set in regular training intervals and reporting performance at a timepoint that maximizes .
5.2 Object detection
We follow a common protocol for testing new components by using Faster RCNN [47], the most commonly used model in object detection, with standard hyperparameters for all our experiment. We compare against baselines from the highly optimized mmdetection toolbox [7]
and only exchange the crossentropy loss of the classifier with a weighted combination of
and and adjust the learning rate.Datasets and evaluation
All experiments are performed on the widely used Pascal VOC dataset [11]. We train our models on the Pascal VOC 07 and VOC 12 trainval sets and test them on the VOC 07 test set. Performance is measured in which is computed for bounding boxes with at least intersectionoverunion overlap with any of the ground truth bounding boxes.
Parameters
Training is done for 12 epochs on a single GPU with a batchsize of 8. The initial learning rate 0.1 is reduced by a factor of 10 after 9 epochs. For the
loss, we use a memory length , a margin of , and a . The losses and are weighted in the ratio.Results
We evaluate Faster RCNN trained on VOC 07 and VOC 07+12 with three different backbones (ResNet50, ResNet101, and ResNeXt101 32x4d [17, 65]). Training with our loss gives a consistent improvement (see Tab. 5) and pushes the standard Faster RCNN very close to stateoftheart values () achieved by significantly more complex architectures [69, 25].
5.3 Speed
Since RaMBO can be implemented using sorting functions it is very fast to compute (see Tab. 7) and can be used on very long sequences. Computing loss for sequences with 320k elements as in the object detection experiments takes less than 5 ms for the forward/backward pass. This is of the overall computation time on a batch.
5.4 Ablation studies
We verify the validity of our loss design in multiple ablation studies. Table 7 shows the relevance of margin and batch memory for the retrieval task. In fact, some of the runs without a margin even diverged. The importance of margin is also shown for the loss in Tab. 8. Moreover, we can see that the hyperparameter of the blackbox optimization scheme does not need precise tuning. Values of that are within a factor 5 of the selected still outperform the baseline.
Method  RaMBO  margin  
Faster RCNN  74.2  
Faster RCNN  ✓  0.5  74.6  
Faster RCNN  ✓  0.1  ✓  75.2 
Faster RCNN  ✓  0.5  ✓  75.7 
Faster RCNN  ✓  2.5  ✓  74.3 
6 Discussion
The proposed method RaMBO is singled out by its conceptual purity in directly optimizing for the desired metric while being simple, flexible, and computationally efficient. Driven only by basic lossdesign principles and without serious engineering efforts, it can compete with stateoftheart methods on image retrieval and consistently improve nearstateoftheart object detectors. Exciting opportunities for future work lie in utilizing the ability to efficiently optimize rankingmetrics of sequences with millions of elements.
Acknowledgement
We thank the International Max Planck Research School for Intelligent Systems (IMPRSIS) for supporting Marin Vlastelica and Claudio Michaelis. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B). Claudio Michaelis was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) via grant EC 479/11 and the Collaborative Research Center (Projektnummer 276693517 – SFB 1233: Robust Vision).
References
 Bartell et al. [1994] Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. Automatic combination of multiple ranked retrieval systems. In ACM Conference on Research and Development in Information Retrieval, SIGIR ’94, pages 173–181. Springer, 1994.
 Bellet et al. [2013] Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013.
 Bromley et al. [1994] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a “siamese” time delay neural network. In Advances in Neural Information Processing Systems, NIPS’94, pages 737–744, 1994.

Cakir et al. [2019a]
Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff.
Deep metric learning to rank.
In
IEEE Conference on Computer Vision and Pattern Recognition
, CVPR’19, pages 1861–1870, 2019a.  Cakir et al. [2019b] Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff. Deep Metric Learning to Rank. https://github.com/kunhe/FastAPmetriclearning, 2019b. Commit: 7ca48aa.
 Chakrabarti et al. [2008] Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru Bhattacharyya. Structured learning for nonsmooth ranking losses. In KDD, 2008.
 Chen et al. [2019a] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019a. Commit: 9d767a03c0ee60081fd8a2d2a200e530bebef8eb.
 Chen et al. [2019b] Kean Chen, Jianguo Li, Weiyao Lin, John See, Ji Wang, Lingyu Duan, Zhibo Chen, Changwei He, and Junni Zou. Towards accurate onestage object detection with aploss. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5119–5127, 2019b.
 Cohendet et al. [2018] R. Cohendet, ClairHélène Demarty, Ngoc Duong, Mats Sjöberg, Bogdan Ionescu, , and ThanhToan Do. MediaEval 2018: Predicting media memorability. arXiv:1807.01052, 2018.
 Engilberge et al. [2019] Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. Sodeep: a sorting deep net to learn ranking loss surrogates. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10792–10801, 2019.
 Everingham et al. [2010] Mark Everingham, Luc Van Gool, Chris Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010.
 Ge [2018] Weifeng Ge. Deep metric learning with hierarchical triplet loss. In European Conference on Computer Vision, ECCV’18, pages 269–285, 2018.
 Ghiasi et al. [2019] Golnaz Ghiasi, TsungYi Lin, and Quoc V Le. Nasfpn: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019.
 Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’14, pages 580–587, 2014.
 Hardy et al. [1952] Godfrey Harold Hardy, John Edensor Littlewood, and George Pólya. Inequalities. Cambridge University Press, Cambridge, England, 1952.
 Harwood et al. [2017] Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, Tom Drummond, et al. Smart mining for deep metric learning. In IEEE International Conference on Computer Vision, pages 2821–2829, 2017.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’18, pages 770–778, 2016.
 He et al. [2017a] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask RCNN. In IEEE International Conference on Computer Vision, ICCV ’17, pages 2980–2988. IEEE, 2017a.
 He et al. [2018a] Kun He, Fatih Cakir, Sarah Adel Bargal, and Stan Sclaroff. Hashing as tieaware learning to rank. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4023–4032, 2018a.
 He et al. [2018b] Kun He, Yan Lu, and Stan Sclaroff. Local descriptors optimized for average precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 596–605, 2018b.
 He et al. [2017b] Wenhao He, XuYao Zhang, Fei Yin, and ChengLin Liu. Deep direct regression for multioriented scene text detection. In ICCV, 2017b.
 Henderson and Ferrari [2016] Paul Henderson and Vittorio Ferrari. Endtoend training of object class detectors for mean average precision. In Asian Conference on Computer Vision, pages 198–213. Springer, 2016.
 Hoffer and Ailon [2015] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on SimilarityBased Pattern Recognition, pages 84–92. Springer, 2015.
 Huang et al. [2017] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Tensorflow Object Detection API. https://github.com/tensorflow/models/tree/master/research/object_detection, 2017. Commit: 0ba83cf.
 Kim et al. [2018a] SeungWook Kim, HyongKeun Kook, JeeYoung Sun, MunCheon Kang, and SungJea Ko. Parallel feature pyramid network for object detection. In European Conference on Computer Vision, ECCV’18, pages 230–256, 2018a.
 Kim et al. [2018b] Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon. Attentionbased ensemble for deep metric learning. In European Conference on Computer Vision, ECCV’18, pages 736–751, 2018b.
 Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, ICLR’14, 2014. arXiv:1412.6980.

Koch et al. [2015]
Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.
Siamese neural networks for oneshot image recognition.
In
ICML deep learning workshop
, volume 2, 2015.  Kulis et al. [2013] Brian Kulis et al. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287–364, 2013.
 Law et al. [2013] Marc T Law, Nicolas Thome, and Matthieu Cord. Quadrupletwise image similarity learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 249–256, 2013.
 Lin et al. [2018] TsungYi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 Lin et al. [2002] Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classification in nonstandard situations. Machine Learning, 46(1):191–202, 2002. doi: 10.1023/A:1012406528296.
 Liu et al. [2016a] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, ChengYang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016a.
 Liu et al. [2016b] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016b.
 Liu et al. [2016c] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016c.
 Massa and Girshick [2018] Francisco Massa and Ross Girshick. maskrcnnbenchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnnbenchmark, 2018. Commit: f027259.
 McFee and Lanckriet [2010] Brian McFee and Gert R Lanckriet. Metric learning to rank. In Proceedings of the 27th International Conference on Machine Learning (ICML10), pages 775–782, 2010.
 Mohapatra et al. [2014] Pritish Mohapatra, CV Jawahar, and M Pawan Kumar. Efficient optimization for average precision svm. In Advances in Neural Information Processing Systems, pages 2312–2320, 2014.
 Mohapatra et al. [2018] Pritish Mohapatra, Michal Rolinek, CV Jawahar, Vladimir Kolmogorov, and M Pawan Kumar. Efficient optimization for rankbased loss functions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3693–3701, 2018.
 MovshovitzAttias et al. [2017] Yair MovshovitzAttias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In IEEE International Conference on Computer Vision, pages 360–368, 2017.
 Oh Song et al. [2016] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’16, pages 4004–4012, 2016.
 Oh Song et al. [2017] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep metric learning via facility location. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5382–5390, 2017.
 Opitz et al. [2018] Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. Deep metric learning with BIER: Boosting independent embeddings robustly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
 Rao et al. [2018] Yongming Rao, Dahua Lin, Jiwen Lu, and Jie Zhou. Learning globally optimized object detector via policy gradient. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’18, pages 6190–6198, 2018.
 Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
 Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, realtime object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’16, 2016.
 Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster rcnn: Towards realtime object detection with region proposal networks. In Advances in Neural Information Processing Systems, NIPS’15, pages 91–99, 2015.
 Revaud et al. [2019] Jerome Revaud, Jon Almazan, Rafael Sampaio de Rezende, and Cesar Roberto de Souza. Learning with Average Precision: Training image retrieval with a listwise loss. arXiv:1906.07589, 2019.
 Rezatofighi et al. [2019] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
 Roth and Brattoli [2019] Karsten Roth and Biagio Brattoli. Easily extendable basic deep metric learning pipeline. https://github.com/Confusezius/DeepMetricLearningBaselines, 2019. Commit: 59d48f9.

Schroff et al. [2015]
Florian Schroff, Dmitry Kalenichenko, and James Philbin.
Facenet: A unified embedding for face recognition and clustering.
In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’15, pages 815–823, 2015.  Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 Sohn [2016] Kihyuk Sohn. Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems, NIPS’16, pages 1857–1865, 2016.
 Song et al. [2016a] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
 Song et al. [2016b] Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training deep neural networks via direct loss minimization. In International Conference on Machine Learning, ICML’16, pages 2169–2177, 2016b.
 Szegedy et al. [2014] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.
 Taylor et al. [2008] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank: optimizing nonsmooth rank metrics. In 2008 International Conference on Web Search and Data Mining, pages 77–86. ACM, 2008.
 Ustinova and Lempitsky [2016] Evgeniya Ustinova and Victor Lempitsky. Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, NIPS’16, pages 4170–4178, 2016.
 Vlastelica et al. [2019] Marin Vlastelica, Anselm Paulus, Vít Musil, Georg Martius, and Michal Rolínek. Differentiation of blackbox combinatorial solvers, 2019. URL https://arxiv.org/abs/1912.02175.
 Wang et al. [2017] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In IEEE International Conference on Computer Vision, ICCV’17, pages 2593–2601, 2017.
 Welinder et al. [2010] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. CaltechUCSD Birds 200. Technical Report CNSTR2010001, California Institute of Technology, 2010.
 Wu et al. [2017] ChaoYuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In IEEE International Conference on Computer Vision, ICCV’17, pages 2840–2848, 2017.
 Wu et al. [2019] Yuxin Wu, Alexander Kirillov, Francisco Massa, WanYen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019. Commit: dd5926a.
 Xie et al. [2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’17, pages 5987–5995, 2017.
 Xuan et al. [2018] Hong Xuan, Richard Souvenir, and Robert Pless. Deep randomized ensembles for metric learning. In European Conference on Computer Vision, ECCV’18, pages 723–734, 2018.
 Yuan et al. [2017] Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hardaware deeply cascaded embedding. In IEEE International Conference on Computer Vision, ICCV’17, pages 814–823, 2017.
 Yue et al. [2007] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 271–278. ACM, 2007.
 Zhang et al. [2018] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Singleshot refinement neural network for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’18, pages 4203–4212, 2018.
 Zhao et al. [2018] Yiru Zhao, Zhongming Jin, Guojun Qi, Hongtao Lu, and Xiansheng Hua. An adversarial approach to hard triplet generation. In European Conference on Computer Vision, ECCV’18, pages 501–517, 2018.
 Zoph et al. [2018] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.
Appendix A Parameters of retrieval experiments
In all experiments we used the ADAM optimizer with a weight decay value of and batch size 128. All experiments ran at most 80 epochs with a learning rate drop by after 35 epochs and a batch memory of length 3. We used higher learning rates for the embedding layer as specified by defaults in Cakir et al. [5].
We used a superlabel batch preparation strategy in which we sample a consecutive batches for the same superlabel pair, as specified by Cakir et al. [5]. For the Inshop Clothes dataset we used 4 batches per pair of superlabels and 8 samples per class within a batch. In the Online Products dataset we used 10 batches per pair of superlabels along with 4 samples per class within a batch. For CUB200, there are no superlabels and we just sample 4 examples per classes within a batch. These values again follow Cakir et al. [5]. The remaining settings are in Table 9.
Online Products  Inshop  CUB200  
margin  0.02  0.05  0.02 
4  0.2  0.2 
Appendix B Proofs
Lemma 1.
Let be a sequence of nonnegative weights and let be positive integers. Then
(24) 
where
(25) 
Note that the sum on the left handside of (24) is finite.
Proposition 2.
[Proof of (20).] Let us set for . Then from Taylor’s expansion of we have the desired and
Appendix C Ranking surrogates visualization
For the interested reader, we additionally present visualizations of smoothing effects introduced by different approaches for direct optimization of rankbased metrics. We display the behaviour of our approach using blackbox differentiation [60], of FastAP [4], and of SoDeep [10].
In the following, we fix a 20dimensional score vector and a loss function which is a (random but fixed) linear combination of the ranks of . We plot a (random but fixed) twodimensional section of of the loss landscape . In Fig. 5(a) we see the true piecewise constant function. In Fig. 5(b), Fig. 5(c) and Fig. 5(d) the ranking is replaced by interpolated ranking [60], FastAP softbinning ranking [4] and by pretrained SoDeep LSTM [10], respectively. In Fig. 4(a) and Fig. 4(b) the evolution of the loss landscape with respect to parameters is displayed for the blackbox ranking and FastAP.


Comments
There are no comments yet.