1 Introduction
Learning visual similarity is of great importance for a wide range of vision tasks, such as image clustering
(Bouchacourt et al., 2018)(Schroff et al., 2015) and image retrieval (Wu et al., 2017). Measuring similarity requires learning an embedding space which captures images and reasonably reflects their similarity using a defined distance metric. One of the most adopted classes of algorithms for this task is Deep Metric Learning (DML) which leverages deep neural networks to learn such a distance preserving embedding.
Due to the growing interest in DML, a large corpus of literature has been proposed contributing to its success. However, as recent DML approaches explore more diverse research directions such as architectures (Xuan et al., 2018; Jacob et al., 2019), objectives functions (Wang et al., 2019a; Yuan et al., 2019) and additional training tasks (Roth et al., 2019; Lin et al., 2018), an unbiased comparison of their results becomes more and more difficult. Further, undisclosed technical details such as data augmentations and training regularization pose a challenge to the reproducibility of such methods, which is of great concern in the machine learning community in general (Bouthillier et al., 2019). One goal of this work is to counteract this worrying trend by providing a comprehensive comparison of the most important and widely used DML baselines under identical training conditions on standard benchmark datasets (Fig. 1). In addition, we thoroughly review common design choices of DML models which strongly influence generalization performance to allow for better comparability of current and future work.
On that basis, we extend our analysis to: (i) The process of data sampling which is wellknown to impact the DML optimization (Schroff et al., 2015). While previous works only studied this process in the specific context of triplet mining strategies for rankingbased objectives (Wu et al., 2017; Harwood et al., 2017), we examine the modelagnostic case of sampling informative minibatches. (ii) The generalization capabilities of DML models by analyzing the structure of their learned embedding spaces. While we are not able to reliably link typically targeted concepts such as large interclass margins (Liu et al., 2017; Deng et al., 2018)
and intraclass variance
(Lin et al., 2018) to generalization performance, we uncover a strong correlation to the compression of the learned representations. Lastly, based on this observation, we propose a simple, yet effective, regularization technique which effectively boosts the performance of rankingbased approaches on standard benchmark datasets as also demonstrated in Fig. 1. In summary, our most important contributions can be described as follows:
We provide an exhaustive analysis of recent DML objective functions, their training strategies, the influence of datasampling, and model design choices to set a standard benchmark. To this end, we will publish all code.

We provide new insights into DML generalization by analyzing its correlation to the embedding space compression (as measured by its spectral decay), interclass margins and intraclass variance.

Based on the result above, we propose a simple technique to regularize the embedding space compression which we find to boost generalization performance of rankingbased DML approaches.
2 Related Works
Deep Metric Learning:
Deep Metric Learning (DML) has become increasingly important for applications ranging from image retrieval (MovshovitzAttias et al., 2017; Roth et al., 2019; Wu et al., 2017; Lin et al., 2018) to zeroshot classification (Schroff et al., 2015; Sanakoyeu et al., 2019) and face verification (Hu et al., 2014; Liu et al., 2017). Many approaches use rankingbased objectives based on tuples of samples such as pairs (Hadsell et al., 2006), triplets (Wu et al., 2017; Yu et al., 2018), quadruplets(Chen et al., 2017) or higherorder variants like NPairs(Sohn, 2016), lifted structure losses (Oh Song et al., 2016; Yu et al., 2018) or NCAbased criteria(MovshovitzAttias et al., 2017). Further, classificationbased methods adjusted to DML (Deng et al., 2018; Zhai and Wu, 2018) have proven to be effective for learning distance preserving embedding spaces. To address the computational complexity of tuplebased methods^{1}^{1}1As an example, the number of triplets scales with , where is the dataset size., different sampling strategies have been introduced (Schroff et al., 2015; Wu et al., 2017; Ge, 2018). Moreover, proxybased approaches address this issue by approximating class distributions using only few virtual representatives (MovshovitzAttias et al., 2017; Qian et al., 2019).
Additionally, more involved research extending above objectives has been proposed: Sanakoyeu et al. (2019) follow a divideandconquer strategy by splitting and subsequently merging both the data and embedding space; Opitz et al. (2018); Xuan et al. (2018) employ an ensemble of specialized learners and Roth et al. (2019)
combine DML with selfsupervised learning. Moreover,
Lin et al. (2018) and Zheng et al. (2019) generate artificial samples to effectively augment the training data, thus learning more complex ranking relations. The majority of these methods are trained using the essential objective functions and, further, hinge on the training parameters discussed in our study, thus directly benefiting from our findings. Moreover, we propose an effective regularization technique to improve rankingbased objectives.Minibatch selection: The benefits of large minibatches for training are well studied (Smith et al., 2017; Goyal et al., 2017; Keskar et al., 2016). However, there has been limited research examining effective strategies for the creation of minibatches. Research into minibatch creation has been done to improve convergence in optimization methods for classification tasks(Mirzasoleiman et al., 2020; Johnson and Guestrin, 2018) or to construct informative minibatches using coreset selection to optimize generative models (Sinha et al., 2019)
. Similarly, we analyze mining strategies maximizing data diversity and compare their impact to standard heuristics employed in DML
(Wu et al., 2017; Roth et al., 2019; Sanakoyeu et al., 2019)).Generalization in DML: Generalization capabilities of representations (Achille and Soatto, 2016; ShwartzZiv and Tishby, 2017) and, in particular, of discriminative models has been well studied (Jiang* et al., 2020; Belghazi et al., 2018; Goyal et al., 2017), e.g. in the light of compression (Tishby and Zaslavsky, 2015; ShwartzZiv and Tishby, 2017) which is covered by strong experimental support (Goyal et al., 2019; Belghazi et al., 2018; Alemi et al., 2016). Verma et al. (2018) link compression to a ’flattening’ of a representation in the context of classification. We apply this concept to analyze generalization in DML and find that strong compression actually hurts DML generalization. Existing works on generalization in metric learning focus on robustness of linear or kernelbased distance metrics (Bellet and Habrard, 2015; Bellet, 2013) and examine bounds on the generalization error (Huai et al., 2019). In contrast, we examine the correlation between generalization and structural characteristics of the learned embedding space.
3 Training a Deep Metric Learning Model
In this section, we briefly summarize key components for training a DML model and motivate the main aspects of our study. We first introduce the common categories of training objectives which we consider for comparison in Sec. 3.1. Next, in Sec. 3.2 we examine the data sampling process and present strategies for sampling informative minibatches. Finally, in Sec. 3.3, we discuss components of a DML model which impact its performance and exhibit an increased divergence in the field, thus impairing objective comparisons.
3.1 The objective function
In Deep Metric Learning we learn an embedding function mapping datapoints into an embedding space , which allows to measure the similarity between as with being a predefined distance function.
For that, let be a deep neural network parametrised by with its output typically normalized to the real hypersphere for regularization purposes (Wu et al., 2017; Huai et al., 2019).
In order to train to reflect the semantic similarity defined by given labels , many objective functions have been proposed based on different concepts which we now briefly summarize.
Rankingbased:
The most popular family are rankingbased loss functions operating on pairs
(Hadsell et al., 2006), triplets (Schroff et al., 2015; Wu et al., 2017) or larger sets of datapoints (Sohn, 2016; Oh Song et al., 2016; Chen et al., 2017; Wang et al., 2019a). Learning is defined as an ordering task, such that the distances between an anchor and positive of the same class, , is minimized and the distances of to negative samples with different class labels, , is maximized. For example, tripletbased formulations typically optimize their relative distances as long as a margin is violated, i.e. as long as . Further, rankingbased objectives are also extended to histogram matching, as proposed in (Ustinova and Lempitsky, 2016).Classificationbased: As DML is essentially solving a discriminative task, some approaches (Zhai and Wu, 2018; Deng et al., 2018; Liu et al., 2017)
can be derived from softmaxlogits
. For example, Deng et al. (2018) exploit the regularization to the real hypersphere and the equality to maximize the margin between classes by direct optimization over angles . Further, also standard crossentropy optimization proves to be effective under normalization (Zhai and Wu, 2018).Proxybased: These methods approximate the distributions for the full class by one (MovshovitzAttias et al., 2017) or more (Qian et al., 2019) learned representatives. By considering the class representatives for computing the training loss, individual samples are directly compared to an entire class. Additionally, proxybased methods help to alleviate the issue of tuple mining which is encountered in rankingbased loss functions.
3.2 Data sampling
The synergy between tuple mining strategies and ranking losses has been widely studied (Wu et al., 2017; Schroff et al., 2015; Ge, 2018). To analyze the impact of datasampling on performance in the scope of our study, we consider the process of mining informative minibatches . This process is independent of the specific training objective and so far has been commonly neglected in DML research. Following we present batch mining strategies operating on both labels and the data itself: label samplers, which are sampling heuristics that follow selection rules based on label information only, and embedded samplers, which operate on data embeddings themselves to create batches of diverse data statistics.
Label Samplers: To control the class distribution within , we examine two different heuristics based on the number, , of ’Samples Per Class’ (SPC) heuristic:
SPC2/4/8: Given batchsize , we randomly select unique classes from which we select samples randomly.
SPCR: We randomly select samples from the dataset and choose the last sample to have the same label as one of the other samples to ensure that at least one triplet can be mined from . Thus, we effectively vary the number of unique classes within minibatches.
Embedded Samplers:
Increasing the batchsize has proven to be beneficial for stabilizing optimization due to an effectively larger data diversity and richer training information (Mirzasoleiman et al., 2020; Brock et al., 2018; Sinha et al., 2019). As the DML training is commonly performed on a single GPU (limited especially due to tuple mining process on the minibatch), the batchsize is bounded by memory. Nevertheless, in order to ‘virtually’ maximize the data diversity, we distill the information content of a large set of samples into a minibatch by matching the statistics of and under the embedding . To avoid computational overhead, we sample from a continuously updated memory bank of embedded training samples. Similar to Misra and van der Maaten (2019), is generated by iteratively updating its elements based on the steady stream of training batches . Using , we mine minibatches by first randomly sampling from with and subsequently find a minibatch to match its data statistics by using one of the following criteria:
Greedy Coreset Distillation (GC):
Greedy Coreset (Agarwal et al., 2005) finds a batch by iteratively adding samples which maximize the distance from the samples that have already been selected , thereby maximizing the covered space within by solving .
Matching of distance distributions (DDM):
DDM aims to preserve the distance distribution of . We randomly select candidate minibatches and choose the batch with smallest Wasserstein distance between normalized distance histograms of and (Rubner et al., 2000).
FRDScore Matching (FRD):
Similar to the recent GAN evaluation setting, we compute the frechet distance (Heusel et al., 2017)) between and to measure the similarity between their distributions using ,
with being the mean and covariance of the embedded set of samples. Like in DDM, we select the closest batch to among randomly sampled candidates.
3.3 Training parameters, regularization and architecture
Network  GN  IBN  R50 

CUB200, R@1  45.41  48.78  43.77 
CARS196, R@1  35.31  43.36  36.39 
SOP, R@1  44.28  49.05  48.65 
Recall performance of commonly used network architectures after ImageNet pretraining. Final linear layer is randomly initialized and normalized.
Next to the objective function and data sampling process, successfully learning a DML model hinges on a reasonable choice of the training environment. While there is a multitude of parameters to be set, we identify several factors which both influence performance and exhibit an divergence in lately proposed works.
Architectures: In recent DML literature predominantly three basis network architectures are used: GoogLeNet (Szegedy et al., 2015) (GN, typically with embedding dimensionality 512), InceptionBN (Ioffe and Szegedy, 2015) (IBN, 512) and ResNet50 (He et al., 2016)
(R50, 128) (with optionally frozen BatchNormalization layers for improved convergence and stability across varying batch sizes
^{2}^{2}2Note that BatchNormalization is still performed, but no parameters are learned., see e.g. Roth et al. (2019); Cakir et al. (2019)). Due to the varying number of parameters and configuration of layers, each architecture exhibits a different starting point for learning, based on its initialization by ImageNet pretraining (Deng et al., 2009). Table 1 compares their initial DML performance measured in Recall@1 (R@1). The reference to differences in architecture is one of the main arguments used by individual works not compare themselves to competing approaches. Disconcertingly, even when reporting additional results using adjusted networks is feasible, typically only results using a single architecture are reported. Consequently, a fair comparison between approaches is heavily impaired.Weight Decay: Commonly, network optimization is regularized using weight decay/L2regularization (Krogh and Hertz, 1992). In DML, particularly on small datasets its careful adjustment is crucial to maximize generalization performance. Nevertheless, many works do not report this.
Embedding dimensionality: Choosing a dimensionality of the embedding space influences the learned manifold and consequently generalization performance. While each architecture typically uses an individual, standardized dimensionality in DML, recent works differ without reporting proper baselines using an adjusted dimensionality. Again, comparison to existing works and the assessment of the actual contribution is impaired.
Advanced DML methodologies There are many extensions to objective functions, architectures and the training setup discussed so far. However, although extensions are highly individual, they still rely on these components and thus benefit from findings in the following experiments, evaluations and analysis.
4 Analyzing DML training strategies
Datasets
The three examined benchmarking datasets are:
CUB2002011: Contains 11,788 images of birds over 200 classes. Train/Test sets are made up of the first/last 100 classes and 5,864/5,924 images respectively (Wah et al., 2011). Samples are distributed evenly across classes.
CARS196: Contains 16,185 images of cars in 196 classes, with even sample distribution. Train/Test sets are made up of the first/last 98 classes and 8054/8131 images respectively (Krause et al., 2013).
Stanford Online Products (SOP): Contains 120,053 product images divided into 22,634 classes. Train/Test sets are provided, contain 11,318 classes/59,551 images in the Train and 11,316 classes/60,502 images in the Test set (Oh Song et al., 2016). In SOP, unlike the other benchmarks, most classes have few instances, leading to significantly different data distribution compared to CUB2002011 and CARS196.
4.1 Experimental Protocol
Our training protocol follows parts of Wu et al. (2017), which utilize a ResNet50 architecture with frozen BatchNormalization layers. We set the embedding dimensionality to 128 to be comparable with already proposed results with this architecture. While both GoogLeNet (Szegedy et al., 2015) and InceptionBN (Ioffe and Szegedy, 2015) are also often employed in DML literature, we choose ResNet50 with due to its success in recent stateoftheart approaches (Roth et al., 2019; Sanakoyeu et al., 2019). In line with standard practices we randomly resize and crop images to to resolution for training and center crop to the same size for evaluation. During training, random horizontal flipping () is used. Optimization is performed using Adam (Kingma and Ba, 2015) with learning rate fixed to . For all evaluations and experiments, no learning rate scheduling is used for unbiased comparison. Weight decay, if not noted otherwise, is set to a constant value of , as motivated in section 4.2
. We implemented all models using the PyTorch framework
(Paszke et al., 2017), and experiments are performed on individual Nvidia Titan X, V100 and T4 GPUs with memory usage limited to 12GB. Each training is run over 150 epochs for CUB2002011/CARS196 and 100 epochs for Stanford Online Products, if not stated otherwise. For batch sampling we utilize the the SPC2 strategy, as motivated in section
4.3. Finally, each result is averaged over multiple seeds to avoid seedbased performance fluctuations. All lossspecific hyperparameters are discussed in the supplementary material, along with their original implementation details. For our study, we examine the following evaluation metrics (described further in the supplementary): Recall at 1 and 2
(Jegou et al., 2011), Normalized Mutual Information (NMI) (Manning et al., 2010), F1 score (Sohn, 2016) and mean average precision measured on recall (mAP).4.2 Studying DML parameters and architectures
Benchmarks  CUB2002011  CARS196  SOP  
Approaches  R@1  NMI  R@1  NMI  R@1  NMI 
Imagenet (Deng et al., 2009)  
Angular (Wang et al., 2017)  
Arcface (Deng et al., 2018)  
Contrastive(D) (Hadsell et al., 2006)  
GenLifted (Yu et al., 2018)  
Hist. (Ustinova and Lempitsky, 2016)  
Margin(D) (Wu et al., 2017)  
MultiSim. (Wang et al., 2019b)  
NPair (Sohn, 2016)  
Pnca (MovshovitzAttias et al., 2017)      
Quadruplet(D) (Chen et al., 2017)  
SNR(D) (Yuan et al., 2019)  
SoftTriplet (Qian et al., 2019)      
Softmax (Zhai and Wu, 2018)  
Triplet(R) (Schroff et al., 2015)  
Triplet(S) (Schroff et al., 2015)  
Triplet(D) (Wu et al., 2017)  
Margin(D,  
RMargin, (D, )  
RMargin, (D, )  
RContrastive, (D)  
RTriplet, (D)  
RSNR, (D) 
Now we study the influence of the parameter and architectures discussed in Sec. 3.3 using five different objective functions. For each experiment performance is measured across all metrics noted in Sec. 4.1. For each loss, every metric is normalized by the maximum across the evaluated value range. This enables to report an aggregated summary of performance across all metrics, where differences correspond to metricagnostic relative improvement.
Fig. 2 analyzes each component by evaluating a reasonable range of values, while fixing the other parameters to the experimental setup of Sec. 4.1. For weight decay we observe a heavily model and dataset dependent behavior, while affecting the relative performance up to . This underlines the importance of a complete declaration of the training protocol to facilitate reproducibility and comparability. Similar results are observed for the embedding dimensionality . Our analysis shows that training objectives perform differently given a certain dimensionality and seem to converge at . However, e.g. in the case of R50, is typically fixed to , thus disadvantaging some training objectives over others. Finally, comparing common DML architectures reveals a strong impact on performance. In addition, the variance in performance between loss functions for each network is varying, with R50 and IBN being more consistent than GN.
Implications: In order to warrant unbiased comparability, equal training protocols, architectures and transparent model evaluation are essential, as even small deviations can result in large deviations in performance.
4.3 Batch sampling impacts DML training
We now analyze how the data sampling process for minibatches impacts the performance of DML models using the sampling strategies presented in Sec. 3.2. To conduct an unbiased study, we experiment with six conceptually different objective functions: Marginloss with DistanceWeighted Sampling, Triplet Loss with Random Sampling, ProxyNCA, MultiSimilarity Loss, Histogram loss and Normalized Softmax loss. To aggregate our evaluation metrics (cf. 4.1), we utilize the same normalization procedure discussed in Sec. 4.2. Fig. 3 summarizes the results for each sampling strategy by reporting the distributions of normalized scores of all pairwise combinations of training loss and evaluation metrics. Our analysis reveals that the batch sampling process indeed effects DML training with a difference in mean performance up to . While there is no clear winner across all datasets, we observe that the SPC2 and FRD samplers perform very well and, in particular, consistently outperform the SPC4 strategy which is commonly reported to be used in literature (Wu et al., 2017; Schroff et al., 2015).
Implications:
Our study indicates that DML benefits from data diversity in minibatches, independent of the chosen training objective. While complex mining strategies may perform better, simple heuristics like SPC2 are sufficient.
4.4 Comparing DML models
Based on our training parameter and batchsampling evaluations we now compare a large selection of different DML objectives under fixed training conditions noted in sections 4.1 and 4.2. For rankingbased models, we employ distancebased tuple mining (D) (Wu et al., 2017) which proved most effective, except for the tuple mining study using the classic triplet loss, for which we also include random and semihard sampling (Schroff et al., 2015). Lossspecific hyperparameters are determined via a small crossvalidation gridsearch around originally proposed values to adjust for our training setup. Exact parameters and method details including the originally utilized setup are listed in the supplementary. Fig. 2 summarizes our evaluation results on all benchmarks, while Fig. 4 measures correlations between all evaluation metrics. We observe particularly on CUB2002011 and CARS196 a higher performance saturation between methods as compared to the SOP dataset due to the strong difference in data distribution. We find that representatives of ranking based objectives in general outperform their classification based counterparts. On average, margin loss offers the best performance across datasets. Remarkably, under our carefully chosen training setting, a multitude of losses compete or even outperform more involved stateoftheart DML approaches (including strong ensemble methods) on the SOP dataset. For a detailed comparison to the stateoftheart, we refer to the supplementary material.
Implications: Carefully trained baseline models are able to outperform stateoftheart approaches which use considerable stronger architectures. To evaluate the true benefit of proposed contributions, baseline models need to be competitive.
5 Generalization in Deep Metric Learning
In the previous section we showed how different model and training parameter choices result in models of vastly different performance. However, how such differences can be explained best on basis of the learned embedding space is an open question and, for instance, studied under the concept of compression (Tishby and Zaslavsky, 2015). Recent work (Verma et al., 2018)
links compression to a classconditioned flattening of representation, indicated by an increased decay of singular values obtained by Singular Value Decomposition (SVD) on the data representations. As a result, the class representations occupy a more compact volume, thus reducing the number of directions with significant variance. The subsequent strong focus on the most discriminative directions is shown to be beneficial for classic classification scenario with i.i.d. training and test distributions. However, this effect also overly discards features which could be useful for capturing data characteristics outside the training distribution. As a result, generalization in transfer learning problems like DML is hindered due to the shift in training and testing distribution
(Bellet and Habrard, 2015). Given this observation, we hypothesize that actually retaining a considerable amount of directions of significant variance is an important requirement to learn a well generalizing embedding function .To verify this assumption, we analyze the spectral decay of an embedding space by performing SVD on the embedded training data instead of considering individual training class representations, as testing and training distribution are shifted^{3}^{3}3For comparison we also analyse the classconditioned singular value spectra as Verma et al. (2018) in the supplementary.. Next, we normalize the sorted spectrum of singular values
by their sum and compute the KLdivergence to a Ddimensional discrete uniform distribution
, i.e. ^{4}^{4}4For simplicity we use the notation instead of ., which is proportional to the entropy of . Lower values of indicate more directions of significant variance. Using this measure, we analyze DML models trained with a large selection of different objectives in Fig. 5 (rightmost) on CUB2002011, CARS196 and SOP dataset^{5}^{5}5A detailed comparison can be found in the supplementary. Comparing their R@1 accuracy and reveals a significant inverse correlation () between generalization performance and the spectral decay of the embedding spaces . This observation strongly confirms the positive effect of more directions of variance in the presence of trainingtesting distribution shifts.We now compare our finding to commonly exploited concepts for training such as (i) larger margins between classes (Deng et al., 2018; Liu et al., 2017), i.e. an increase in average interclass distances ; (ii) explicitly introducing intraclass variance (Lin et al., 2018), which is indicated by an increase in average intraclass distance ; and (iii) their relation by using the ratio . Here, denotes the set of embedded samples of a class , their mean embedding and normalization constants. Fig. 5 compares these measures with . It is evident that neither of the distance related measures consistently exhibits significant correlation with generalization performance when taking all three datasets into account. Individually on SOP, only exhibits similarly strong correlation to generalization performance due to the strong imbalance between dataset size and amount of samples per class.
Implications: Generalization performance in DML exhibits strong inverse correlation to the decay of the singular value spectrum of a learned representation. This indicates that representation learning under considerable shifts between training and testing distribution is hurt by excessive compression.
5.1 regularization for improved generalization
We now exploit our findings from the previous section to propose a simple regularization for rankingloss based approaches by counteracting the compression of the representation. We randomly perform a switch operation within tuples by exchanging negative samples with the positive in a given rankingloss formulation (cf. Sec. 3.1
) with fixed probability
. This regularization pushes samples of the same class apart, thus enabling a DML model to capture extra nonlabeldiscriminative features. Simultaneously, this process dampens the compression induced by strong discriminative training signals.Fig. 6 depicts a 2D toy example (details in supplementary) which illustrates the effect of our proposed regularization and further highlights the issue of overly compressed data representations. Even though the training distribution exhibits features needed to separate all test classes, these features are disregarded by the strong discriminative training signal. Regularizing the compression by attenuating the spectral decay enables the model to capture more information and as a result exhibits stronger generalization to the unseen test classes. In addition, Fig. 8 verifies that the regularization also leads to a decreased spectral decay on DML benchmark datasets, resulting in improved recall performance (cf. Tab. 2 (bottom)). We further observe that the vast amounts of classes for datasets such as SOP naturally counteract the compression of a representation, thus already exhibiting a considerable amount of directions of significant variance. Finally, we control the degree of the regularization by varying the probability and study the influence on performance in Fig. 7. Increasing the probability boosts the generalization performance until class boundaries get too close and, thus, discriminativity is lost.
6 Conclusion
In this work, we counteract the worrying trend of diverging training protocols in Deep Metric Learning (DML). We conduct a large, comprehensive study of important training components and objectives for DML to contribute to improved comparability of recent and future approaches. On this basis, we study generalization performance in DML and uncover a strong correlation to the level of compression of learned data representation. Our findings reveal that highly compressed representations disregard helpful features for capturing data characteristics that transfer to unknown test distributions. To this end, we propose a simple technique for rankingbased methods to regularize the compression of the learned embedding space, which results in boosted performance across all benchmark datasets.
Acknowledgements
We thank Alex Lamb (Mila) for insightful discussions that helped with the design of the toy examples. We would also like to thank Sharan Vaswani and Dmitry Serdyuk for their feedback (both Mila). We would also like to thank Nvidia for donating NVIDIA DGX1, and Compute Canada for providing resources for this research.
References
 Information dropout: learning optimal representations through noisy computation. External Links: 1611.01353 Cited by: §2.
 Geometric approximation via coresets. Combinatorial and computational geometry 52, pp. 1–30. Cited by: §3.2.
 Deep variational information bottleneck. External Links: 1612.00410 Cited by: §2.

MINE: mutual information neural estimation
. External Links: 1801.04062 Cited by: §2.  Robustness and generalization for metric learning. Neurocomputing 151, pp. 259–267. External Links: ISSN 09252312, Link, Document Cited by: §2, §5.
 Supervised metric learning with generalization guarantees. External Links: 1307.4514 Cited by: §2.

Multilevel variational autoencoder: learning disentangled representations from grouped observations
. In AAAI 2018, Cited by: §1.  Unreproducible research is reproducible. In International Conference on Machine Learning, pp. 725–734. Cited by: §1.
 Large scale GAN training for high fidelity natural image synthesis. CoRR abs/1809.11096. External Links: Link, 1809.11096 Cited by: §3.2.

Deep metric learning to rank.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §3.3.  Beyond triplet loss: a deep quadruplet network for person reidentification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.1, §2, §3.1, Table 2.
 ImageNet: A LargeScale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.3, Table 2.
 ArcFace: additive angular margin loss for deep face recognition. External Links: 1801.07698 Cited by: §A.1, §1, §2, §3.1, Table 2, §5.
 Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: Table 3, §2, §3.2.
 InfoBot: transfer and exploration via the information bottleneck. External Links: 1901.10902 Cited by: §2.
 Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §2.
 Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.1, Appendix B, §2, §3.1, Table 2.
 Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2829. Cited by: §1.
 Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.3.
 In defense of the triplet loss for person reidentification. External Links: 1703.07737 Cited by: §A.1.
 GANs trained by a two timescale update rule converge to a local nash equilibrium. External Links: 1706.08500 Cited by: §3.2.
 Discriminative deep metric learning for face verification in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §A.1, §A.2, §2.

Deep metric learning: the generalization analysis and an adaptive algorithm.
In
Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI19
, pp. 2535–2541. External Links: Document, Link Cited by: §2, §3.1.  Batch normalization: accelerating deep network training by reducing internal covariate shift. International Conference on Machine Learning. Cited by: §3.3, §4.1.
 Metric learning with horde: highorder regularizer for deep embeddings. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §A.3, §4.1.
 Fantastic generalization measures and where to find them. In International Conference on Learning Representations, External Links: Link Cited by: §2.
 Training deep models faster with robust, approximate importance sampling. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (Eds.), pp. 7265–7275. External Links: Link Cited by: §2.

On largebatch training for deep learning: generalization gap and sharp minima
. arXiv preprint arXiv:1609.04836. Cited by: §2.  Attentionbased ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: Table 3.
 Adam: a method for stochastic optimization. Cited by: §4.1.
 3d object representations for finegrained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: Table 5, Table 8, Appendix G, §4.
 A simple weight decay can improve generalization. In Advances in Neural Information Processing Systems, Cited by: §3.3.
 Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: Table 3, §1, §2, §5.
 SphereFace: deep hypersphere embedding for face recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2, §3.1, §5.
 Least squares quantization in pcm. IEEE Trans. Information Theory 28, pp. 129–136. Cited by: §A.3.
 Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §A.3, §4.1.
 Coresets for accelerating incremental gradient methods. External Links: Link Cited by: §2, §3.2.
 Selfsupervised learning of pretextinvariant representations. External Links: 1912.01991 Cited by: §3.2.
 No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §A.1, §2, §3.1, Table 2.
 Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §A.1, Table 3, Table 6, Table 9, Appendix G, §2, §3.1, §4.
 Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
 Automatic differentiation in pytorch. In NIPSW, Cited by: §4.1.
 SoftTriple loss: deep metric learning without triplet sampling. Cited by: §A.1, §A.1, §2, §3.1, Table 2.
 MIC: mining interclass characteristics for improved metric learning. Cited by: §A.1, Table 3, §1, §2, §3.3, §4.1.
 The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vision 40 (2), pp. 99–121. External Links: ISSN 09205691, Link, Document Cited by: §3.2.
 Divide and conquer the embedding space for metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §A.1, Table 3, §2, §4.1.
 Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §A.1, §A.2, §A.2, §1, §2, §3.1, §3.2, §4.3, §4.4, Table 2.
 Opening the black box of deep neural networks via information. External Links: 1703.00810 Cited by: §2.
 Smallgan: speeding up gan training using coresets. arXiv preprint arXiv:1910.13540. Cited by: §2, §3.2.
 Don’t decay the learning rate, increase the batch size. arXiv preprint arXiv:1711.00489. Cited by: §2.
 Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §A.1, §A.1, §A.3, §2, §3.1, §4.1, Table 2.
 Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §3.3, §4.1.
 Deep learning and the information bottleneck principle. External Links: 1503.02406 Cited by: §2, §5.
 Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, Cited by: §A.1, §A.1, §3.1, Table 2.

Manifold mixup: better representations by interpolating hidden states
. External Links: 1806.05236 Cited by: Appendix C, Appendix F, §2, §5, footnote 3.  The caltechucsd birds2002011 dataset. Cited by: Table 4, Table 7, Appendix G, §4.
 Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §A.1, Table 2.
 Ranked list loss for deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: Table 3, §1, §3.1.
 Multisimilarity loss with general pair weighting for deep metric learning. External Links: 1904.06627 Cited by: §A.1, Table 2.
 Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §A.1, §A.1, §A.2, §A.2, §A.2, Table 3, Appendix D, §1, §2, §3.1, §3.2, Figure 7, §4.1, §4.3, §4.4, Table 2.
 Deep randomized ensembles for metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 723–734. Cited by: §1, §2.
 Correcting the triplet selection bias for triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 71–87. Cited by: §2, Table 2.
 Signaltonoise ratio: a robust distance metric for deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §A.1, §1, Table 2.
 Classification is a strong baseline for deep metric learning. External Links: 1811.12649 Cited by: §A.1, §2, §3.1, Table 2.
 Hardnessaware deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.
Appendix A Description of Methods
In this section, we briefly describe each DML training objective and triplet mining strategy used in our study, as well as the choice of their individual hyperparameters. General training parameters and details of the training protocol are already discussed in the main paper in Sec. 4.1. For notation, we refer to the embedding of an image including output normalization as . The nonnormalized version is denoted as . All methods operate on the minibatch containing image indices. If not mentioned otherwise, all embeddings operate in dimension .
a.1 Training Criteria
Contrastive (Hadsell et al., 2006)
The contrastive training formalism is simple: Given embedding pairs (sampled from a minibatch of size ) containing an anchor from class and either a positive with or a negative from a different class, , the network is trained to minimize
(1) 
with margin , which we set to . The margin ensures that embeddings are not projected arbitrarily far apart from each other. For our distance function we utilize the standard euclidean distance . We combine the contrastive loss with the distanceweighting negative sampling mentioned below.
Triplet (Hu et al., 2014)
Triplets extend the contrastive formalism to provide a concurrent ranking surrogate for both negative and positive sample embeddings using triplets sampled from a minibatch:
(2) 
with margin , thus following recent implementations in e.g. Roth et al. (2019) or (Wu et al., 2017). Initial works (Schroff et al., 2015) using the triplet loss commonly utilized random or semihard triplet sampling and a GoogLeNetbased architecture. Recent methods typically employ the more effective distanceweighted sampling (Wu et al., 2017) and more powerful networks (Roth et al., 2019; Sanakoyeu et al., 2019). For completeness, we compare the tripletloss performance combinde with random, semihard and distanceweighted sampling schemes introduced below.
Generalized Lifted Structure (Hermans et al., 2017)
The Generalized Lifted Structure loss extends the standard lifted structure loss (Oh Song et al., 2016) to include all available anchorpositive and anchornegative distance pairs within a minibatch , instead of utilizing only a single anchorpositive combination:
(3) 
with minibatch samples of a class grouped into , and sets of contained in . denotes the nonnormalized version of . The margin serves the standard purpose of avoiding overdistancing already correct image pairs. To account for increasing values, regularizes the embeddings.
NPair (Sohn, 2016)
NPair or NTuple losses extend the triplet formalism to incorporate all negatives in the minibatch by
(4) 
with embedding regularization , as (Sohn, 2016) noted a slow convergence for normalized embeddings.
Angular (Wang et al., 2017)
By introducing an anglebased penalty, the angular loss effectively introduces scale invariance and higherorder geometric constraints that are not explicitly introduced in normal contrastive losses:
(5) 
with angular margin , which, as proposed in the original paper, is set to . is the tradeoff between standard ranking losses and the angular constraint.
Arcface (Deng et al., 2018)
Arcface transforms the standard softmax formulation typically used in classification problem to retrievalbased problems by enforcing an angular margin between the embeddings and an approximate center for each class, resulting in
(6) 
Further, this training objective also introduces the additive angular margin penalty for increased interclass discrepancy, while the scaling denotes the radius of the effective utilized hypersphere . The class centers are optimized with learning rate .
Histogram (Ustinova and Lempitsky, 2016)
In contrast to many samplebased ranking objective functions, Histogram Loss learns to minimize the probability of a positive sample pair having a higher similarity score than a negative pair. Given a minibatch , the sets of positive similarities and negative similarities , one optimises
(7)  
(8)  
(9) 
resulting in soft, differentiable histogram assignments. The final objective then penalizes strong overlap between the probability of positive pairs having higher distance (i.e. its cumulative distribution to point ) than respective negative pairs. Such a histogram loss introduces a single hyperparameter, namely the degree of histogram discretisation , which we set to for our study. In general, our implementation borrows from the original code base used in (Ustinova and Lempitsky, 2016).
Margin (Wu et al., 2017)
Margin loss extends the standard triplet loss by introducing a dynamic, learnable boundary between positive and negative pairs. This transfers the common triplet ranking problem to a relative ordering of pairs :
(10) 
The learning rate of the boundary is set to , with initial value either or and triplet margin . For our implementation, we utilise the distanceweighted triplet sampling method highlighted below.
MultiSimilarity (Wang et al., 2019b)
Unlike contrastive and triplet based ranking methods, the MultiSimilarity loss concurrently evaluates similarities between anchor and negative, anchor and positive, as well as positivepositive and negativenegative pairs in relation to an anchor:
(11)  
(12) 
where and denote the set of positive and negative samples for a sample
, with cosine similarity
for two normalized vectors
. For our hyperparameters, we use , , and .ProxyNCA (MovshovitzAttias et al., 2017)
The sampling complexity of tuples heavily affects the training convergence. ProxyNCA introduces a remedy by introducing class proxies, which act as approximations to entire classes. This way only an anchor is sampled and compared against the respective positive and negative class proxies. Utilizing one proxy per class , ProxyNCA is then defined as
(13) 
Quadruplet (Chen et al., 2017)
The quadruplet loss is an extension to the triplet loss, which introduces higher level ordering constraint on sample embeddings. By using an anchor, a positive and two exclusive negatives, the quadruplet criterion is defined as:
(14) 
with margin parameters and . We utilize distanceweighted sampling to propose the first negative sample , which we found to work better than the quadruplet sampling scheme originally proposed in the paper.
Snr (Yuan et al., 2019)
The SignaltoNoiseRatio loss (SNR) introduces a novel distance metric based on the ratio between anchor embedding variance and variance of noise, which is simply defined as the difference between anchor and compared embedding. This optimises the embedding space directly for informativeness. The complete loss can then be written as
(15) 
with margin parameter and regularization to ensure zeromean distributions. Note that .
SoftTriple (Qian et al., 2019)
Similar to ProxyNCA, the SoftTriple objective function utilizes learnable data proxies to tackle the sampling problem. However, instead of classdiscriminative proxies, a set of normalized intraclass proxies per class are learned using the NCAbased similarity measure of a sample to all proxies of a class . Denoting the set of available classes as and the total set of proxies as , we get
(16)  
(17)  
(18) 
The second term denotes a regularization on the learned proxies to ensure sparseness in the class set of proxies. For our tests, we utilised the following hyperparameter values (borrowing from the official implementation in (Qian et al., 2019)): , , , and the number of proxies per class . The proxy learning rate is set to .
Normalized Softmax (Zhai and Wu, 2018)
Similar to other classificationbased losses in DML that are based on reformulations of the standard softmax function (such as ), the normalized softmax loss is optimized by comparing input embeddings to class proxies per class :
(19) 
with temperature
for gradient boosting and class proxy learning rate set to
.a.2 Tuple Mining
Basic contrastive, triplet or higher order ranking losses commonly need to mine their training tuples from the available minibatch. In our study, we measure the influence of tuple sampling on the standard triplet loss, while utilising DistanceWeighted Mining for all rankingbased objective functions except NPair based methods.
Random Tuple Mining (Hu et al., 2014)
The trivial way involves the random sampling of tuples. Simply put, per sample we select a respective positive or negative sample .
Semihard Triplet Mining (Schroff et al., 2015)
The potential number of triplets scales cubic in training set size. During learning, more and more of those triplets are correctly ordered and effectively provide no training signal (Schroff et al., 2015), thus impairing the remaining training process. To alleviate this, negative samples are carefully selected based on the anchorpositive sample distance (which are sampled at random). Given an anchor embedding and its positive , the negative is sampled randomly from the set
(20) 
This way, only negatives are considered which are reasonably hard to separate from an anchor. Moreover, this mining strategy avoids the sampling of overlay hard negatives, which often correspond to data noise and potentially lead to model collapses and bad local minima (Schroff et al., 2015).
DistanceWeighted Tuple Mining (Wu et al., 2017)
In DML, the embedding spaces are typically normalized to a dimensional (unit) hypersphere for regularisation purposes (Wu et al., 2017). The analytical distribution of pairwise distances on a hypersphere follows
(21) 
for arbitrary embedding pairs . In order to sample negatives from the whole range of possible distances to an anchor, Wu et al. (2017) propose to sample negatives based on a distance distribution inverse to , i.e.
(22) 
We set and limit the distances to .
a.3 Evaluation Metrics
In this section, we examine the evaluation metrics to measure the performance of the studied models on a the testset .
Recall@k (Jegou et al., 2011)
Let
(23) 
be the set of the first nearest neighbours of a sample , then we measure Recall@k as
(24) 
which measures the average number of cases in which for a given query there is at least one sample among its top nearest neighbours with the same class, i.e. .
Normalized Mutual Information (NMI) (Manning et al., 2010)
To measure the clustering quality using NMI, we embed all samples to obtain and perform a clustering (e.g. Means (Lloyd, 1982)). Following, we assign all samples a cluster label indicating the closest cluster center and define with and being the number of classes and clusters. Similarly for the true labels we define with . The normalized mutual information is then computed as
(25) 
with mutual Information between cluster and labels, and entropy on the clusters and labels respectively.
F1Score (Sohn, 2016)
The F1score measures the harmonic mean between precision and recall and is a commonly used retrieval metric, placing equal importance to both precision and recall. It is defined as
(26) 
with precision and Recall defined over nearest neighbour retrieval as done for Recall@k.
Mean Average Precision measured on Recall (mAP):
The mAPscore measured on recall follows the same definition as standard mAP. In our case, the mAP is equivalent to the mean over the classwise average precision@ with being the number of samples with label . With defined as in eq. 23, this gives
(27) 
Appendix B Correlation between performance and spectral decay
Similar to Fig. 5 (rightmost) in the main paper, we now provide a more detailed illustration in Fig. 9 comparing the performance of the training objectives and their corresponding spectral decay . For ranking losses, we further include the results using regularization while training, which further shows that in each case a gain in performance is related to a decrease of . Especially the contrastive loss (Hadsell et al., 2006) greatly profits from our proposed regularization, as also indicated by the analysis of the singular value spectra (cf. Fig. 8 of main paper). Its large gains, more then on the CARS196 dataset, is well explained by comparison of its training objective with those of tripletbased formulations. The latter optimizes over relative positive ()) and negative distances () up to a fixed margin , which counteracts a compression of the embedding space to a certain extend. On the other hand, the constrastive loss, while controlling only the negative distances by , is able to perform an unconstrained contraction of entire classes, which facilitates overly compressed embedding spaces .
Appendix C Analysis of perclass singular value spectra
In Sec. 5 of our main paper we analyze generalization in DML by considering the decay of the singular value spectrum over all embedded samples . Thus, we analyze the general compression of the entire embedding space as unseen test classes can be projected anywhere in , in contrast to Verma et al. (2018) which conduct a classconditioned analysis for i.i.d. classification problems. In order to show that the effect of regularization (as shown in Fig. 8 in main paper) is also reflected in the classconditioned singular value spectrum, we perform SVD on and subsequently average over all classes . Fig. 10 compares the sorted, first singular values for both, models trained with and without regularization. We clearly see that the regularization decreases the average decay of singular values similar to the total singular value spectra shown in the main paper.
Appendix D Comparison to stateoftheart approaches on SOP dataset
Approach  Architecture  Dim  R@1  R@10  R@100  NMI 

DVML(Lin et al., 2018)  GoogLeNet  512  70.2  85.2  93.8  90.8 
HTL(Ge, 2018)  InceptionBN  512  74.8  88.3  94.8   
MIC(Roth et al., 2019)  ResNet50  128  77.2  89.4  95.6  90.0 
D&C(Sanakoyeu et al., 2019)  ResNet50  128  75.9  88.4  94.9  90.2 
Rank(Wang et al., 2019a)  InceptionBN  1536  79.8  91.3  96.3  90.4 
ABE(Kim et al., 2018)  GoogLeNet  512  76.3  88.4  94.8   
Margin (ours)(Wu et al., 2017)  ResNet50  128  78.3      90.8 
In this section we provide a detailed comparison between current stateoftheart DML approaches and our strongest baseline model, margin loss (D, ) (Wu et al., 2017), on the SOP dataset in Tab. 6. The results for these approaches are taken from their public manuscripts. We observe that our baseline model outperforms each of the models using varying architectures, but especially other ResNet50based implementations. While R50 proves to be a stronger base network (cf. Fig. 2 of main paper) than GoogLeNet based model, improvements over MIC and D&C using the same backbone by at least and methods based on the similarly strong InceptionBN showcase the relevance of a welldefined baseline. Additionally, even though Rank and ABE employ considerable more powerful network ensembles, our carefully motivated baseline exhibits competitive performance.
Appendix E 2D Toy Examples
For our toy examples, we use a fullyconnected network with two 30 neuron layers. Both input and embedding dimension are 2D, while the latter is normalized onto the unit circle. Each of the four training and test lines contain 15 samples taken from either the diagonal or vertical/horizontal line segments, respectively. We train the networks both with and without regularization for
iterations, a batchsize of and learning rate of using a standard contrastive loss (eq. 1) with margin . For regularisation, we set . Similar to Fig.6 in the main paper, Fig. 11 shows another 2D toy example based on vertical lines which again demonstrates the effect of compression and of our proposed regularization. The example consists of four training lines that are separable only by their coordinate and a test set of lines which are separable by their coordinate. As we observe, the test samples are collapsed onto a single point in the nonregularized embedding space, thus can not be distinguished. In contrast, the regularized representation allows us to separate the test classes and, further, exhibits a decreased decay in the singular value spectrum.Appendix F Influence of Manifold Mixup on DML
Now, we examine the effect of applying the regularization proposed in ManifoldMixup (Verma et al., 2018) on the DML transfer learning setting. As ManifoldMixup has been proposed to increase the compression of a learned representation in the context of standard supervised classification, it is expected to decrease the performance of DML models. For that, we train three different DML models on the CUB2002011 dataset: (1) Normalized Softmax, (2) Triplet with Distance Sampling and (3) Margin loss with and Distance Sampling. For (1), the implementation directly follows the standard implementation noted in Verma et al. (2018). For the rankingbased training objectives, we perform mixup in our ResNet50 and generate the mixed class labels, which consequently have either one (if image from the same class are mixed) or two entries (if images from different classes are mixed). Per (mixed) anchor embedding, this gives rise to up to two possible sets of triplets, for which we compute the loss and weigh it by the respective mixup coefficient :
(28) 
where denotes the set of triplets given the