Optimizing Rank-based Metrics with Blackbox Differentiation

12/07/2019 ∙ by Michal Rolinek, et al. ∙ UNIFI 18

Rank-based metrics are some of the most widely used criteria for performance evaluation of computer vision models. Despite years of effort, direct optimization for these metrics remains a challenge due to their non-differentiable and non-decomposable nature. We present an efficient, theoretically sound, and general method for differentiating rank-based metrics with mini-batch gradient descent. In addition, we address optimization instability and sparsity of the supervision signal that both arise from using rank-based metrics as optimization targets. Resulting losses based on recall and Average Precision are applied to image retrieval and object detection tasks. We obtain performance that is competitive with state-of-the-art on standard image retrieval datasets and consistently improve performance of near state-of-the-art object detectors.



There are no comments yet.


page 1

page 8

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Rank-based metrics are frequently used to evaluate performance on a wide variety of computer vision tasks. For example, in the case of image retrieval, these metrics are required since, at test-time, the models produce a ranking of images based on their relevance to a query. Rank-based metrics are also popular in classification tasks with unbalanced class distributions or multiple classes per image. One prominent example is object detection, where an average over multiple rank-based metrics is used for final evaluation. The most common metrics are recall [12], Average Precision () [68], normalized discounted cumulative gain () [6], and the Spearman Coefficient [9].

Directly optimizing for the rank-based metrics is inviting but also notoriously difficult due to the non-differentiable (piecewise constant) and non-decomposable nature of such metrics. A trivial solution is to use one of several popular surrogate functions such as 0-1 loss [32], the area under the ROC curve [1] or cross entropy. Many studies from the last two decades have addressed direct optimization with approaches ranging from histogram binning approximations [4, 48, 19]

, finite difference estimation

[22], loss-augmented inference [68, 39], gradient approximation [55] all the way to using a large LSTM to fit the ranking operation [10].

Despite the clear progress in direct optimization [39, 4, 8], these methods are notably omitted in the most publicly used implementation hubs for object detection [7, 64, 36, 24], and image retrieval [50]. The reasons include poor scaling with sequence lengths, lack of publicly available implementations that are efficient on modern hardware, and fragility of the optimization itself.

In a clean formulation, backpropagation through rank-based losses reduces to providing a meaningful gradient of the piecewise constant ranking function. This is an interpolation problem, rather than a gradient estimation problem (the true gradient is simply zero almost everywhere). Accordingly, the properties of the resulting interpolation (whose gradients are returned) should be of central focus, rather than the gradient itself.

In this work, we interpolate the ranking function via blackbox backpropagation [60]

, a framework recently proposed in the context of combinatorial solvers. This framework is the first one to give mathematical guarantees on an interpolation scheme. It applies to piecewise constant functions that originate from minimizing a discrete objective function. To use this framework, we reduce the ranking function to a combinatorial optimization problem. In effect, we inherit two important features of

[60]: mathematical guarantees and the ability to compute the gradient only with the use of a non-differentiable blackbox implementation

of the ranking function. This allows using implementations of ranking functions that are already present in popular machine learning frameworks which results in straightforward implementation and significant practical speed-up. Finally, differentiating directly the ranking function gives additional flexibility for designing loss functions.

Having a conceptually pure solution for the differentiation, we can then focus on another key aspect: sound loss design. To avoid ad-hoc modifications, we take a deeper look at the caveats of direct optimization for rank-based metrics. We offer multiple approaches for addressing these caveats, most notably we introduce margin-based versions of rank-based losses and mathematically derive a recall-based loss function that provides dense supervision.

Experimental evaluation is carried out on image retrieval tasks where we optimize the recall-based loss and on object detection where we directly optimize mean Average Precision. On the retrieval experiments, we achieve performance that is on-par with state-of-the-art while using a simpler setup. On the detection tasks, we show consistent improvement over highly-optimized implementations that use the cross-entropy loss, while our loss is used in an out-of-the-box fashion. We release the code used for our experiments111https://github.com/martius-lab/blackbox-backprop..

2 Related work

Optimizing for rank-based metrics

As rank-based evaluation metrics are now central to multiple research areas, their direct optimization has become of great interest to the community. Traditional approaches typically rely on different flavors of loss-augmented inference

[39, 68, 38, 37], or gradient approximation [55, 22]. These approaches often require solving a combinatorial problem as a subroutine where the nature of the problem is dependent on the particular rank-based metric. Consequently, efficient algorithms for these subproblems were proposed [39, 55, 68].

More recently, differentiable histogram-binning approximations [4, 19, 20, 48] have gained popularity as they offer a more flexible framework. Completely different techniques including learning a distribution over rankings [58], using a policy-gradient update rule [44], learning the sorting operation entirely with a deep LSTM [10]

or perceptron-like error-driven updates have also been applied


Metric learning

There is a great body of work on metric learning for retrieval tasks, where defining a suitable loss function plays an essential role. Bellet et al. [2] and Kulis et al. [29] provide a broader survey of metric learning techniques and applications. Approaches with local losses range from employing pair losses [3, 28], triplet losses [51, 23, 55] to quadruplet losses [30]. While the majority of these works focus on local, decomposable losses as above, multiple lines of work exist for directly optimizing global rank-based losses [10, 58, 48]. The importance of good batch sampling strategies is also well-known, and is the subject of multiple studies [41, 51, 12, 63], while others focus on generating novel training examples [70, 55, 40].

Object detection

Modern object detectors use a combination of different losses during training [14, 47, 33, 46, 18, 31]. While the biggest performance gains have originated from improved architectures [47, 18, 45, 13] and feature extractors [17, 71], some works focused on formulating better loss functions [31, 49]. Since its introduction in the Pascal VOC object detection challenge [11] mean Average Precision () has become the main evaluation metric for detection benchmarks. Using the metric as a replacement for other less suitable objective functions has thus been studied in several works [55, 22, 44, 8].

3 Background

3.1 Rank-based metrics

For a positive integer , we denote by the set of all permutations of

. The rank of vector

, denoted by is a permutation satisfying


i.e. sorting . Note, that rank is not defined uniquely for those vectors for which any two components coincide. In the formal presentation, we reduce our attention to proper rankings in which ties do not occur.

The rank of the -th element is one plus the number of members in the sequence exceeding its value, i.e.


3.1.1 Average Precision

For a fixed query, let be a vector of relevance scores of examples. We denote by the vector of their ground truth labels (relevant/irrelevant) and by


the set of indices of the relevant examples. Then Average Precision is given by


where precision at is defined as


and describes the ratio of relevant examples among the highest-scoring examples.

In classification tasks, the dataset typically consists of annotated images. This we formalize as pairs where is an input image and is a binary class vector, where, for every , each denotes whether an image belongs to the class . Then, for each example the model provides a vector of suggested class-relevance scores , where are the parameters of the model.

To evaluate mean Average Precision (), we consider for each class the vector of scores and labels . We then take the mean of Average Precisions over all the classes


Note that and that the highest score 1 corresponds to perfect score prediction in which all relevant examples precede all irrelevant examples.

3.1.2 Recall

Recall is a metric that is often used for information retrieval. Let again and be the scores and the ground-truth labels for a given query over a dataset. For a positive integer , we set


where is given in Eq. 3.

In a setup where each element of the dataset is a possible query, we define the ground truth matrix as follows. We set if belongs to the same class as the query , and zero otherwise. The scores suggested by the model are again denoted by .

In order to evaluate the model over the whole dataset , we average over all the queries , namely


Again, for every . The highest score 1 means that a relevant example is always found among the top predictions.

3.2 Blackbox differentiation of combinatorial solvers

In order to differentiate the ranking function, we employ a method for efficient backpropagation through combinatorial solvers – recently proposed in [60]. It turns algorithms or solvers for problems like shortest-path, traveling-salesman-problem

, and various graph cuts into differentiable building blocks of neural network architectures.

With minor simplifications, such solvers (e.g. for the Multicut problem) can be formalized as maps that take continuous input (e.g. edge weights of a fixed graph) and return discrete output (e.g. indicator vector of a subset of edges forming a cut) such that it minimizes a combinatorial objective expressed as an inner product (e.g. the cost of the cut). Note that the notation differs from [60] ( was and was ). In short, a blackbox solver is


where is the discrete set of admissible assignments (e.g. subsets of edges forming cuts).

The key technical challenge when computing the backward pass is meaningful differentiation of the piecewise constant function where is the final loss of the network. To that end, [60]

constructs a family of continuous and (almost everywhere) differentiable functions parametrized by a single hyperparameter

that controls the trade-off between “faithfulness to original function” and “informativeness of the gradient”, see Fig. 1. For a fixed the gradient of such an interpolation at point is computed and passed further down the network (instead of the true zero gradient) as


where is the output of the solver for a certain precisely constructed modification of the input. The modification is where the incoming gradient information is used. For full details including the mathematical guarantees on the tightness of the interpolation, see [60].

The main advantage of this method is that only a blackbox implementation of the solver (i.e. of the forward pass) is required to compute the backward pass. This implies that powerful optimized solvers can be used instead of relying on suboptimal differentiable relaxations.

4 Method

4.1 Blackbox differentiation for ranking

In order to apply blackbox differentiation method for ranking, we need to find a suitable combinatorial objective. Let be a vector of real numbers (the scores) and let be their ranks. The connection between blackbox solver and ranking is captured in the following proposition.

Proposition 1.

In the notation set by Eqs. (1) and (2), we have


In other words, the mapping is a minimizer of a linear combinatorial objective just as Eq. 9 requires.

The proof of Proposition 1 rests upon a classical rearrangement inequality [15, Theorem 368]. The following theorem is its weaker formulation that is sufficient for our purpose.

Theorem 1 (Rearrangement inequality).

For every positive integer , every choice of real numbers and every permutation it is true that

Moreover, if are distinct, equality occurs precisely for the identity permutation .

[Proof of Proposition 1.] Let be the permutation that minimizes (11). This means that the value of the sum


is the lowest possible. Using the inverse permutation (12) rewrites as


and therefore, being minimal in (13) makes (1) hold due to Theorem 1. This shows that .

The resulting gradient computation is provided in Algorithm 1 and only takes a few lines of code. We call the method Ranking Metric Blackbox Optimization (RaMBO).

Note again the presence of a blackbox ranking operation. In practical implementation, we can delegate this to a built-in function of the employed framework (e.g. torch.argsort). Consequently, we inherit the computational complexity as well as a fast vectorized implementation on a GPU. To our knowledge, the resulting algorithm is the first to have both truly sub-quadratic complexity (for both forward and backward pass) and to operate with a general ranking function (see also Tab. 1).

 Method forward + backward general ranking
 Mohapatra et al. [39] x
 Chen et al. [8]
 Yue et al. [68] x
 FastAP [4]
 SoDeep [10]
Table 1: Computational complexity of different approaches for differentiable ranking. The numbers of the negative and of the positive examples are denoted by and , respectively. For SoDeep, denotes the LSTM’s hidden state size () and for FastAP denotes the number of bins. RaMBO is the first method to directly differentiate general ranking with a truly sub-quadratic complexity.


define Ranker as blackbox operation computing ranks
function ForwardPass()
     save  and for backward pass


function BackwardPass()
     load  and from forward pass
     load hyperparameter
Algorithm 1 RaMBO: Blackbox differentiation for ranking

4.2 Caveats for sound loss design

Is resolving the non-differentiability all that is needed for direct optimization? Unfortunately not. To obtain well-behaved loss functions, some delicate considerations need to be made. Below we list a few problems 4.24.2 that arise from direct optimization without further adjustments.


Evaluation of rank-based metrics is typically carried out over the whole test set while direct optimization methods rely on mini-batch approximations. This, however, does not yield an unbiased gradient estimate. Particularly small mini-batch sizes result in optimizing a very poor approximation of , see Fig. 3.


Rank-based metrics are brittle when many ties happen in the ranking. As an example, note that any rank-based metric attains all its values in the neighborhood of a dataset-wide tie. Additionally, once a positive example is rated higher than all negative examples even by the slightest difference, the metric gives no incentive for increasing the difference. This induces a high sensitivity to potential shifts in the statistics when switching to the test set. The need to pay special attention to ties was also noted in [4, 19].


Some metrics give only sparse supervision. For example, the value of only improves if the highest-ranked positive example moves up the ranking, while the other positives have no incentive to do so. Similarly, Average Precision does not give the incentive to decrease the possibly high scores of negative examples, unless also some positive examples are present in the mini-batch. Since positive examples are typically rare, this can be problematic.

4.3 Score memory

In order to mitigate the negative impact of small batch sizes on approximating the dataset-wide loss 4.2 we introduce a simple running memory. It stores the scores for elements of the last previous batches, thereby reducing the bias of the estimate. All entries are concatenated for loss evaluation, but the gradients only flow through the current batch. This is a simpler variant of “batch-extension” mechanisms introduced in [4, 48].

4.4 Score margin

Our remedy for brittleness around ties 4.2 is inspired by the triplet loss [51]; we introduce a shift in the scores during training in order to induce a margin. In particular, we add a negative shift to the positively labeled scores and positive shift the negatively labeled scores as illustrated in Fig. 3. This also implicitly removes the destabilizing scale-invariance. Using notation as before, we modify the scores as


where is the prescribed margin. In the implementation, we replace the ranking operation with given by

Figure 2: Mini-batch estimation of mean Average Precision. The expected (i.e. the optimized loss) is an overly optimistic estimator of the true

over the dataset; particularly for small batch sizes. The mean and standard deviations over sampled mini-batch estimates are displayed.

Figure 3: Naive rank-based losses can collapse during optimization. Shifting the scores during training induces a margin and a suitable scale for the scores. Red lines indicate negative scores and green positive scores.

4.5 Recall loss design

Let be scores and the truth labels, as usual. As noted in 4.2 the value of only depends on the highest scoring relevant element. We overcome the sparsity of the supervision by introducing a refined metric


where denotes the set of relevant elements (3) and stands for the number of irrelevant elements outrunning the -th element. Formally,


in which denotes the rank of the -th element only within the relevant ones. Note that depends on all the relevant elements as intended. We then define the loss at  as


Next, we choose a weighting of these losses


over values of .

Proposition 2 (see the Supplementary material) computes a closed form of (19) for a given sequence of weights . Here, we exhibit closed-form solutions for two natural decreasing sequences of weights:


where .

This also gives a theoretical explanation why some previous works [8, 22] found it “beneficial” to optimize the logarithm of a ranking metric, rather than the metric itself. In our case, the arises from the most natural weight decay .

4.6 Average Precision loss design

Having differentiable ranking, the generic does not require any further modifications. Indeed, for any relevant element index , its precision obeys


where is the rank of the -th element within all the relevant ones. The loss then reads


For calculating the mean Average Precision loss , we simply take the mean over the classes .

To alleviate the sparsity of supervision caused by rare positive examples 4.2, we also consider the loss across all the classes. More specifically, we treat the matrices and as concatenated vectors and , respectively, and set


This practice is consistent with [8].

5 Experiments

We evaluate the performance of RaMBO on object detection and several image retrieval benchmarks. The experiments demonstrate that our method for differentiating through and recall is generally on-par with the state-of-the-art results and yields in some cases better performance. We will release code upon publication. Throughout the experimental section, the numbers we report for RaMBO are averaged over three restarts.

5.1 Image retrieval

To evaluate the proposed Recall Loss (Eq. 20) derived from RaMBO we run experiments for image retrieval on the CUB-200-2011 [62], Stanford Online Products [54], and In-shop Clothes [34] benchmarks. We compare against a variety of methods from recent years, multiple of which achieve state-of-the-art performance. The best-performing methods are ABE-8 [26], FastAP [4], and Proxy NCA [40].

Figure 4: Stanford Online Products image retrieval examples.

For all experiments, we follow the most standard setup. We use a pretrained ResNet50 [17]

in which we replace the final softmax layer with a fully connected embedding layer which produces a

-dimensional vector for each batch element. We normalize each vector so that it represents a point on the unit sphere. The cosine similarities of all the distinct pairs of elements in the batch are then computed and the ground truth similarities are set to 1 for those elements belonging to the same class and 0 otherwise. The obvious similarity of each element with itself is disregarded. We compute the

loss for each batch element with respect to all other batch elements using the similarities and average it to compute the final loss. Note that our method does not require any sophisticated sampling strategy to compute the loss.


We use Adam optimizer [27] with an amplified learning rate for the embedding layer. We consistently set the batch size to 128 so that each experiment runs on a GPU with 16GB memory. Full details regarding training schedules and exact values of hyperparameters for the different datasets are in the Supplementary material.


For data preparation, we resize images to and randomly crop and flip them to during training, using a single center crop on evaluation.

We use the Stanford Online Products dataset consisting of images with classes crawled from Ebay. The classes are grouped into 12 superclasses (e.g. cup, bicycle) which are used for mini-batch preparation following the procedure proposed in [4]. We follow the evaluation protocol proposed in [54], using images corresponding to classes for training and images corresponding to classes for testing.

The In-shop Clothes dataset consists of images with classes. The classes are grouped into 23 superclasses (e.g. MEN/Denim, WOMEN/Dresses), which we use for mini-batch preparation as before. We follow previous work by using images corresponding to classes for training and + images corresponding to classes each for testing (split into a query + gallery set respectively). Given an image from the query set, we retrieve corresponding images from the gallery set.

The CUB-200-2011 dataset consists of images of bird categories. Again we follow the evaluation protocol proposed in [54], using the first classes consisting of images for training and the remaining classes with images for testing.

  1 10 100 1000  Contrastive [41] 42.0 58.2 73.8 89.1  Triplet [41] 42.1 63.5 82.5 94.8  LiftedStruct [41] 62.1 79.8 91.3 97.4  Binomial Deviance [59] 65.5 82.3 92.3 97.6  Histogram Loss [59] 63.9 81.7 92.2 97.7  N-Pair-Loss [53] 67.7 83.8 93.0 97.8  Clustering [42] 67.0 83.7 93.2 -  HDC [67] 69.5 84.4 92.8 97.7  Angular Loss [61] 70.9 85.0 93.5 98.0  Margin [63] 72.7 86.2 93.8 98.0  Proxy NCA [40] 73.7 - - -  A-BIER [43] 74.2 86.9 94.0 97.8  HTL [12] 74.8 88.3 94.8 98.4  ABE-8 [26] 76.3 88.4 94.8 98.2  FastAP [4] 76.4 89.1 95.4 98.5  RaMBO 77.8 90.1 95.9 98.7  RaMBO 78.6 90.5 96.0 98.7
Table 2: Comparison with the state-of-the-art on the Stanford Online Products [41]. On this dataset, with the highest number of classes in the test set, RaMBO gives better performance than other state-of-the-art methods.
  1 2 4 8  Contrastive [41] 26.4 37.7 49.8 62.3  Triplet [41] 36.1 48.6 59.3 70.0  LiftedStruct [41] 47.2 58.9 70.2 80.2  Binomial Deviance [59] 52.8 64.4 74.7 83.9  Histogram Loss [59] 50.3 61.9 72.6 82.4  N-Pair-Loss [53] 51.0 63.3 74.3 83.2  Clustering [42] 48.2 61.4 71.8 81.9  Proxy NCA [40] 49.2 61.9 67.9 72.4  Smart Mining [16] 49.8 62.3 74.1 83.3  Margin [63] 63.8 74.4 83.1 90.0  HDC [67] 53.6 65.7 77.0 85.6  Angular Loss [61] 54.7 66.3 76.0 83.9  HTL [12] 57.1 68.8 78.7 86.5  A-BIER [43] 57.5 68.7 78.3 86.2  ABE-8 [26] 60.6 71.5 80.5 87.7  Proxy NCA [50] 64.0 75.4 84.2 90.5  RaMBO 63.5 74.8 84.1 90.4  RaMBO 64.0 75.3 84.1 90.6
Table 3: Comparison with the state-of-the-art on the CUB-200-2011 [62] dataset. Our method RaMBO is on-par with an (unofficial) ResNet50 implementation of Proxy NCA.
  1 10 20 30 50  FashionNet [35] 53.0 73.0 76.0 77.0 80.0  HDC [67] 62.1 84.9 89.0 91.2 93.1  DREML [66] 78.4 93.7 95.8 96.7 -  HTL [12] 80.9 94.3 95.8 97.2 97.8  A-BIER [43] 83.1 95.1 96.9 97.5 98.0  ABE-8 [26] 87.3 96.7 97.9 98.2 98.7  FastAP-Matlab 222FastAP public code [5]

offers Matlab and PyTorch implementations. Confusingly, the two implementations give very different results. We contacted the authors but neither we nor they were able to identify the source of this discrepancy in two seemingly identical implementations. We report both numbers.

90.9? 97.7? 98.5? 98.8? 99.1?
 FastAP-Python [5] 83.8 95.5 96.9 97.5 98.2  RaMBO 88.1 97.0 97.9 98.4 98.8  RaMBO 86.3 96.2 97.4 97.9 98.5
Table 4: Comparison with the state-of-the-art methods on the In-shop Clothes [34] dataset. RaMBO is on par with an ensemble-method ABE-8. Leading performance is achieved with a Matlab implementation of FastAP.
 Method Backbone Training CE RaMBO  Faster R-CNN ResNet50 07 74.2 75.7  Faster R-CNN ResNet50 07+12 80.4 81.4  Faster R-CNN ResNet101 07+12 82.4 82.9  Faster R-CNN X101 324d 07+12 83.2 83.6
Table 5: Object detection performance on the Pascal VOC 07 test set measured in . Backbone X stands for ResNeXt and CE for cross entropy loss. Here is some invisible text

For all retrieval results in the tables we add the embedding dimension as a superscript and the backbone architecture as a subscript. The letters R, G, I, V represent ResNet [21], GoogLeNet [57], Inception [56], and VGG-16 [52], respectively. We report results for both RaMBO and RaMBO , the main difference being if the logarithm is applied once or twice to the rank in Eq. (20).

On Stanford Online Products we report for in Tab. 3. The fact that the dataset contains the highest number of classes seems to favor RaMBO, as it outperforms all other methods. Some example retrievals are presented in Fig. 4.

On CUB-200-2011 we report for in Tab. 3. For fairness, we include the performance of Proxy NCA with a ResNet50 [17] backbone even though the results are only reported in an online implementation [50]. With this implementation Proxy NCA and RaMBO are the best-performing methods.

On In-shop Clothes we report for value of in Tab. 5

. The best-performing method is probably FastAP, even though the situation regarding reproducibility is rather puzzling

222FastAP public code [5] offers Matlab and PyTorch implementations. Confusingly, the two implementations give very different results. We contacted the authors but neither we nor they were able to identify the source of this discrepancy in two seemingly identical implementations. We report both numbers.. RaMBO matches the performance of ABE-8 [26], a complex attention-based ensemble method.

We followed the reporting strategy of [26] by evaluating on the test set in regular training intervals and reporting performance at a time-point that maximizes .

5.2 Object detection

We follow a common protocol for testing new components by using Faster R-CNN [47], the most commonly used model in object detection, with standard hyperparameters for all our experiment. We compare against baselines from the highly optimized mmdetection toolbox [7]

and only exchange the cross-entropy loss of the classifier with a weighted combination of

and and adjust the learning rate.

Datasets and evaluation

All experiments are performed on the widely used Pascal VOC dataset [11]. We train our models on the Pascal VOC 07 and VOC 12 trainval sets and test them on the VOC 07 test set. Performance is measured in which is computed for bounding boxes with at least intersection-over-union overlap with any of the ground truth bounding boxes.


Training is done for 12 epochs on a single GPU with a batch-size of 8. The initial learning rate 0.1 is reduced by a factor of 10 after 9 epochs. For the

loss, we use a memory length , a margin of , and a . The losses and are weighted in the ratio.


We evaluate Faster R-CNN trained on VOC 07 and VOC 07+12 with three different backbones (ResNet50, ResNet101, and ResNeXt101 32x4d [17, 65]). Training with our loss gives a consistent improvement (see Tab. 5) and pushes the standard Faster R-CNN very close to state-of-the-art values () achieved by significantly more complex architectures [69, 25].

5.3 Speed

Since RaMBO can be implemented using sorting functions it is very fast to compute (see Tab. 7) and can be used on very long sequences. Computing loss for sequences with 320k elements as in the object detection experiments takes less than 5 ms for the forward/backward pass. This is of the overall computation time on a batch.

 Length 100k 1M 10M 100M  CPU 33 ms 331 ms 3.86 s 36.4 s  GPU 1.3 ms 7 ms 61 ms 0.62 s
Table 6: Processing time of Average Precision (using plain Pytorch implementation) depending on sequence length for forward/backward computation on a single Tesla V100 GPU and 1 Xeon Gold CPU core at 2.2GHz.
  CUB200 In-shop Online Prod.  Full RaMBO 64.0 88.1 78.6  No batch memory 62.5 87.0 72.4  No margin 63.2 x x
Table 7: Ablation experiments for margin(Sec. 4.4) and batch memory (Sec. 4.3) in retrieval on the CUB200, In-shop and Stanford Online Products datasets.

5.4 Ablation studies

We verify the validity of our loss design in multiple ablation studies. Table 7 shows the relevance of margin and batch memory for the retrieval task. In fact, some of the runs without a margin even diverged. The importance of margin is also shown for the loss in Tab. 8. Moreover, we can see that the hyperparameter of the blackbox optimization scheme does not need precise tuning. Values of that are within a factor 5 of the selected still outperform the baseline.

 Method RaMBO margin
 Faster R-CNN 74.2
 Faster R-CNN 0.5 74.6
 Faster R-CNN 0.1 75.2
 Faster R-CNN 0.5 75.7
 Faster R-CNN 2.5 74.3
Table 8: Ablation for RaMBO on the object detection task.

6 Discussion

The proposed method RaMBO is singled out by its conceptual purity in directly optimizing for the desired metric while being simple, flexible, and computationally efficient. Driven only by basic loss-design principles and without serious engineering efforts, it can compete with state-of-the-art methods on image retrieval and consistently improve near-state-of-the-art object detectors. Exciting opportunities for future work lie in utilizing the ability to efficiently optimize ranking-metrics of sequences with millions of elements.


We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting Marin Vlastelica and Claudio Michaelis. We acknowledge the support from the German Federal Ministry of Education and Research (BMBF) through the Tübingen AI Center (FKZ: 01IS18039B). Claudio Michaelis was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) via grant EC 479/1-1 and the Collaborative Research Center (Projektnummer 276693517 – SFB 1233: Robust Vision).


  • Bartell et al. [1994] Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. Automatic combination of multiple ranked retrieval systems. In ACM Conference on Research and Development in Information Retrieval, SIGIR ’94, pages 173–181. Springer, 1994.
  • Bellet et al. [2013] Aurélien Bellet, Amaury Habrard, and Marc Sebban. A survey on metric learning for feature vectors and structured data. arXiv preprint arXiv:1306.6709, 2013.
  • Bromley et al. [1994] Jane Bromley, Isabelle Guyon, Yann LeCun, Eduard Säckinger, and Roopak Shah. Signature verification using a “siamese” time delay neural network. In Advances in Neural Information Processing Systems, NIPS’94, pages 737–744, 1994.
  • Cakir et al. [2019a] Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff. Deep metric learning to rank. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , CVPR’19, pages 1861–1870, 2019a.
  • Cakir et al. [2019b] Fatih Cakir, Kun He, Xide Xia, Brian Kulis, and Stan Sclaroff. Deep Metric Learning to Rank. https://github.com/kunhe/FastAP-metric-learning, 2019b. Commit: 7ca48aa.
  • Chakrabarti et al. [2008] Soumen Chakrabarti, Rajiv Khanna, Uma Sawant, and Chiru Bhattacharyya. Structured learning for non-smooth ranking losses. In KDD, 2008.
  • Chen et al. [2019a] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open MMLab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019a. Commit: 9d767a03c0ee60081fd8a2d2a200e530bebef8eb.
  • Chen et al. [2019b] Kean Chen, Jianguo Li, Weiyao Lin, John See, Ji Wang, Lingyu Duan, Zhibo Chen, Changwei He, and Junni Zou. Towards accurate one-stage object detection with ap-loss. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5119–5127, 2019b.
  • Cohendet et al. [2018] R. Cohendet, Clair-Hélène Demarty, Ngoc Duong, Mats Sjöberg, Bogdan Ionescu, , and Thanh-Toan Do. MediaEval 2018: Predicting media memorability. arXiv:1807.01052, 2018.
  • Engilberge et al. [2019] Martin Engilberge, Louis Chevallier, Patrick Pérez, and Matthieu Cord. Sodeep: a sorting deep net to learn ranking loss surrogates. In IEEE Conference on Computer Vision and Pattern Recognition, pages 10792–10801, 2019.
  • Everingham et al. [2010] Mark Everingham, Luc Van Gool, Chris Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 2010.
  • Ge [2018] Weifeng Ge. Deep metric learning with hierarchical triplet loss. In European Conference on Computer Vision, ECCV’18, pages 269–285, 2018.
  • Ghiasi et al. [2019] Golnaz Ghiasi, Tsung-Yi Lin, and Quoc V Le. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In CVPR, 2019.
  • Girshick et al. [2014] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’14, pages 580–587, 2014.
  • Hardy et al. [1952] Godfrey Harold Hardy, John Edensor Littlewood, and George Pólya. Inequalities. Cambridge University Press, Cambridge, England, 1952.
  • Harwood et al. [2017] Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, Tom Drummond, et al. Smart mining for deep metric learning. In IEEE International Conference on Computer Vision, pages 2821–2829, 2017.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’18, pages 770–778, 2016.
  • He et al. [2017a] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV ’17, pages 2980–2988. IEEE, 2017a.
  • He et al. [2018a] Kun He, Fatih Cakir, Sarah Adel Bargal, and Stan Sclaroff. Hashing as tie-aware learning to rank. In IEEE Conference on Computer Vision and Pattern Recognition, pages 4023–4032, 2018a.
  • He et al. [2018b] Kun He, Yan Lu, and Stan Sclaroff. Local descriptors optimized for average precision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 596–605, 2018b.
  • He et al. [2017b] Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. Deep direct regression for multi-oriented scene text detection. In ICCV, 2017b.
  • Henderson and Ferrari [2016] Paul Henderson and Vittorio Ferrari. End-to-end training of object class detectors for mean average precision. In Asian Conference on Computer Vision, pages 198–213. Springer, 2016.
  • Hoffer and Ailon [2015] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.
  • Huang et al. [2017] Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang Song, Sergio Guadarrama, and Kevin Murphy. Tensorflow Object Detection API. https://github.com/tensorflow/models/tree/master/research/object_detection, 2017. Commit: 0ba83cf.
  • Kim et al. [2018a] Seung-Wook Kim, Hyong-Keun Kook, Jee-Young Sun, Mun-Cheon Kang, and Sung-Jea Ko. Parallel feature pyramid network for object detection. In European Conference on Computer Vision, ECCV’18, pages 230–256, 2018a.
  • Kim et al. [2018b] Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon. Attention-based ensemble for deep metric learning. In European Conference on Computer Vision, ECCV’18, pages 736–751, 2018b.
  • Kingma and Ba [2014] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations, ICLR’14, 2014. arXiv:1412.6980.
  • Koch et al. [2015] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In

    ICML deep learning workshop

    , volume 2, 2015.
  • Kulis et al. [2013] Brian Kulis et al. Metric learning: A survey. Foundations and Trends in Machine Learning, 5(4):287–364, 2013.
  • Law et al. [2013] Marc T Law, Nicolas Thome, and Matthieu Cord. Quadruplet-wise image similarity learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 249–256, 2013.
  • Lin et al. [2018] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • Lin et al. [2002] Yi Lin, Yoonkyung Lee, and Grace Wahba. Support vector machines for classification in nonstandard situations. Machine Learning, 46(1):191–202, 2002. doi: 10.1023/A:1012406528296.
  • Liu et al. [2016a] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, pages 21–37. Springer, 2016a.
  • Liu et al. [2016b] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016b.
  • Liu et al. [2016c] Ziwei Liu, Ping Luo, Shi Qiu, Xiaogang Wang, and Xiaoou Tang. Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In IEEE conference on computer vision and pattern recognition, pages 1096–1104, 2016c.
  • Massa and Girshick [2018] Francisco Massa and Ross Girshick. maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch. https://github.com/facebookresearch/maskrcnn-benchmark, 2018. Commit: f027259.
  • McFee and Lanckriet [2010] Brian McFee and Gert R Lanckriet. Metric learning to rank. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 775–782, 2010.
  • Mohapatra et al. [2014] Pritish Mohapatra, CV Jawahar, and M Pawan Kumar. Efficient optimization for average precision svm. In Advances in Neural Information Processing Systems, pages 2312–2320, 2014.
  • Mohapatra et al. [2018] Pritish Mohapatra, Michal Rolinek, CV Jawahar, Vladimir Kolmogorov, and M Pawan Kumar. Efficient optimization for rank-based loss functions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3693–3701, 2018.
  • Movshovitz-Attias et al. [2017] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In IEEE International Conference on Computer Vision, pages 360–368, 2017.
  • Oh Song et al. [2016] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’16, pages 4004–4012, 2016.
  • Oh Song et al. [2017] Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep metric learning via facility location. In IEEE Conference on Computer Vision and Pattern Recognition, pages 5382–5390, 2017.
  • Opitz et al. [2018] Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. Deep metric learning with BIER: Boosting independent embeddings robustly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • Rao et al. [2018] Yongming Rao, Dahua Lin, Jiwen Lu, and Jie Zhou. Learning globally optimized object detector via policy gradient. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’18, pages 6190–6198, 2018.
  • Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
  • Redmon et al. [2016] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’16, 2016.
  • Ren et al. [2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, NIPS’15, pages 91–99, 2015.
  • Revaud et al. [2019] Jerome Revaud, Jon Almazan, Rafael Sampaio de Rezende, and Cesar Roberto de Souza. Learning with Average Precision: Training image retrieval with a listwise loss. arXiv:1906.07589, 2019.
  • Rezatofighi et al. [2019] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, 2019.
  • Roth and Brattoli [2019] Karsten Roth and Biagio Brattoli. Easily extendable basic deep metric learning pipeline. https://github.com/Confusezius/Deep-Metric-Learning-Baselines, 2019. Commit: 59d48f9.
  • Schroff et al. [2015] Florian Schroff, Dmitry Kalenichenko, and James Philbin.

    Facenet: A unified embedding for face recognition and clustering.

    In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’15, pages 815–823, 2015.
  • Simonyan and Zisserman [2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • Sohn [2016] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, NIPS’16, pages 1857–1865, 2016.
  • Song et al. [2016a] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016a.
  • Song et al. [2016b] Yang Song, Alexander Schwing, Raquel Urtasun, et al. Training deep neural networks via direct loss minimization. In International Conference on Machine Learning, ICML’16, pages 2169–2177, 2016b.
  • Szegedy et al. [2014] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. arXiv preprint arXiv:1409.4842, 2014.
  • Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, 2015.
  • Taylor et al. [2008] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. Softrank: optimizing non-smooth rank metrics. In 2008 International Conference on Web Search and Data Mining, pages 77–86. ACM, 2008.
  • Ustinova and Lempitsky [2016] Evgeniya Ustinova and Victor Lempitsky. Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, NIPS’16, pages 4170–4178, 2016.
  • Vlastelica et al. [2019] Marin Vlastelica, Anselm Paulus, Vít Musil, Georg Martius, and Michal Rolínek. Differentiation of blackbox combinatorial solvers, 2019. URL https://arxiv.org/abs/1912.02175.
  • Wang et al. [2017] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In IEEE International Conference on Computer Vision, ICCV’17, pages 2593–2601, 2017.
  • Welinder et al. [2010] Peter Welinder, Steve Branson, Takeshi Mita, Catherine Wah, Florian Schroff, Serge Belongie, and Pietro Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
  • Wu et al. [2017] Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In IEEE International Conference on Computer Vision, ICCV’17, pages 2840–2848, 2017.
  • Wu et al. [2019] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019. Commit: dd5926a.
  • Xie et al. [2017] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’17, pages 5987–5995, 2017.
  • Xuan et al. [2018] Hong Xuan, Richard Souvenir, and Robert Pless. Deep randomized ensembles for metric learning. In European Conference on Computer Vision, ECCV’18, pages 723–734, 2018.
  • Yuan et al. [2017] Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hard-aware deeply cascaded embedding. In IEEE International Conference on Computer Vision, ICCV’17, pages 814–823, 2017.
  • Yue et al. [2007] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizing average precision. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 271–278. ACM, 2007.
  • Zhang et al. [2018] Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, and Stan Z Li. Single-shot refinement neural network for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR’18, pages 4203–4212, 2018.
  • Zhao et al. [2018] Yiru Zhao, Zhongming Jin, Guo-jun Qi, Hongtao Lu, and Xian-sheng Hua. An adversarial approach to hard triplet generation. In European Conference on Computer Vision, ECCV’18, pages 501–517, 2018.
  • Zoph et al. [2018] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scalable image recognition. In CVPR, 2018.

Appendix A Parameters of retrieval experiments

In all experiments we used the ADAM optimizer with a weight decay value of and batch size 128. All experiments ran at most 80 epochs with a learning rate drop by after 35 epochs and a batch memory of length 3. We used higher learning rates for the embedding layer as specified by defaults in Cakir et al. [5].

We used a super-label batch preparation strategy in which we sample a consecutive batches for the same super-label pair, as specified by Cakir et al. [5]. For the In-shop Clothes dataset we used 4 batches per pair of super-labels and 8 samples per class within a batch. In the Online Products dataset we used 10 batches per pair of super-labels along with 4 samples per class within a batch. For CUB200, there are no super-labels and we just sample 4 examples per classes within a batch. These values again follow Cakir et al. [5]. The remaining settings are in Table 9.

Online Products In-shop CUB200
 margin 0.02 0.05 0.02
  4 0.2 0.2
Table 9: Hyperparameter values for retrieval experiments.

Appendix B Proofs

Lemma 1.

Let be a sequence of nonnegative weights and let be positive integers. Then




Note that the sum on the left hand-side of (24) is finite.

Proposition 2.

Let be nonnegative weights for and assume that is given by




where is as in (25).

[Proof.]Taking the complement of the set in the definition of , we get


whence (26) reads as

Equation (27) then follows by Lemma 1.

[proof of Lemma 1.] Observe that and . Then

and (24) follows.

[Proof of (20).] Let us set for . Then from Taylor’s expansion of we have the desired and

If we set

then, using Taylor’s expansions again,


The conclusion then follows by Proposition 2.

Appendix C Ranking surrogates visualization

For the interested reader, we additionally present visualizations of smoothing effects introduced by different approaches for direct optimization of rank-based metrics. We display the behaviour of our approach using blackbox differentiation [60], of FastAP [4], and of SoDeep [10].

In the following, we fix a 20-dimensional score vector and a loss function which is a (random but fixed) linear combination of the ranks of . We plot a (random but fixed) two-dimensional section of of the loss landscape . In Fig. 5(a) we see the true piecewise constant function. In Fig. 5(b), Fig. 5(c) and Fig. 5(d) the ranking is replaced by interpolated ranking [60], FastAP soft-binning ranking [4] and by pretrained SoDeep LSTM [10], respectively. In Fig. 4(a) and Fig. 4(b) the evolution of the loss landscape with respect to parameters is displayed for the blackbox ranking and FastAP.

(a) Ranking interpolation by [60] for .
(b) FastAp [4] with bin counts .
Figure 5: Evolution of the ranking-surrogate landscapes with respect to their parameters.
(a) Original piecewise constant landscape
(b) Piecewise linear interpolation scheme of [60] with
(c) SoDeep LSTM-based ranking surrogate [10]
(d) FastAP [4] soft-binning with 10 bins.
Figure 6: Visual comparison of various differentiable proxies for piecewise constant function.