Characterizing Generalization under Out-Of-Distribution Shifts in Deep Metric Learning

07/20/2021 ∙ by Timo Milbich, et al. ∙ 17

Deep Metric Learning (DML) aims to find representations suitable for zero-shot transfer to a priori unknown test distributions. However, common evaluation protocols only test a single, fixed data split in which train and test classes are assigned randomly. More realistic evaluations should consider a broad spectrum of distribution shifts with potentially varying degree and difficulty. In this work, we systematically construct train-test splits of increasing difficulty and present the ooDML benchmark to characterize generalization under out-of-distribution shifts in DML. ooDML is designed to probe the generalization performance on much more challenging, diverse train-to-test distribution shifts. Based on our new benchmark, we conduct a thorough empirical analysis of state-of-the-art DML methods. We find that while generalization tends to consistently degrade with difficulty, some methods are better at retaining performance as the distribution shift increases. Finally, we propose few-shot DML as an efficient way to consistently improve generalization in response to unknown test shifts presented in ooDML. Code available here: https://github.com/Confusezius/Characterizing_Generalization_in_DeepMetricLearning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 16

Code Repositories

Characterizing_Generalization_in_DeepMetricLearning

Implementation and Benchmark Splits to study Out-of-Distribution Generalization in Deep Metric Learning.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image representations that generalize well are the foundation of numerous computer vision tasks, such as image and video retrieval 

Sohn (2016); Wu et al. (2017); Roth et al. (2020); Milbich et al. (2020b); Brattoli et al. (2020), face (re-)identification Schroff et al. (2015); Liu et al. (2017); Deng et al. (2019) and image classification Tian et al. (2020a); Chen et al. (2020); He et al. (2020); Misra and Maaten (2020). Ideally, these representations should not only capture data within the training distribution, but also transfer to new, out-of-distribution (OOD) data. However, in practice, achieving effective OOD generalization is more challenging than in-distribution Koh et al. (2021); Engstrom et al. (2019); Hendrycks and Dietterich (2019); Recht et al. (2019); Krueger et al. (2021)

. In the case of zero-shot generalization, where train and test classes are completely distinct, Deep Metric Learning (DML) is used to learn metric representation spaces that capture and transfer visual similarity to unseen classes, constituting

a priori unknown test distributions with unspecified shift. To approximate such a setting, current DML benchmarks use single, predefined and fixed data splits of disjoint train and test classes, which are assigned arbitrarily111E.g. first and second half of alphabetically sorted classnames in CUB200-2011Wah et al. (2011) and CARS196Krause et al. (2013). Wu et al. (2017); Deng et al. (2019); Sohn (2016); Jacob et al. (2019); Duan et al. (2018); Lin et al. (2018); Zheng et al. (2019); Roth et al. (2019); Milbich et al. (2017); Roth et al. (2020); Musgrave et al. (2020); Kim et al. (2020); Teh et al. (2020); Seidenschwarz et al. (2021). This means that (i) generalization is only evaluated on a fixed problem difficulty, (ii) generalization difficulty is only implicitly defined by the arbitrary data split, (iii) the distribution shift is not measured and (iv) cannot be not changed. As a result, proposed models can overfit to these singular evaluation settings, which puts into question the true zero-shot generalization capabilities of proposed DML models.

In this work, we first construct a new benchmark ooDML to characterize generalization under out-of-distribution shifts in DML. We systematically build ooDML

as a comprehensive benchmark for evaluating OOD generalization in changing zero-shot learning settings which covers a much larger variety of zero-shot transfer learning scenarios potentially encountered in practice. We systematically construct training and testing data splits of increasing difficulty as measured by their Frechet-Inception Distance 

Heusel et al. (2017) and extensively evaluate the performance of current DML approaches.

Our experiments reveal that the standard evaluation splits are often close to i.i.d. evaluation settings. In contrast, our novel benchmark continually evaluates models on significantly harder learning problems, providing a more complete perspective into OOD generalization in DML. Second, we perform a large-scale study of representative DML methods on ooDML, and study the actual benefit of underlying regularizations such as self-supervision Milbich et al. (2020b), knowledge distillation Roth et al. (2020), adversarial regularization Sinha et al. (2020) and specialized objective functions Wu et al. (2017); Wang et al. (2019); Deng et al. (2019); Kim et al. (2020); Roth et al. (2020). We find that conceptual differences between DML approaches play a more significant role as the distribution shift to the test split becomes harder.

Finally, we present a study on few-shot DML as a simple extension to achieve systematic and consistent OOD generalization. As the transfer learning problem becomes harder, even very little in-domain knowledge effectively helps to adjust learned metric representation spaces to novel test distributions. We publish our code and train-test splits on three established benchmark sets, CUB200-2011 Wah et al. (2011), CARS196 Krause et al. (2013) and Stanford Online Products (SOP) Oh Song et al. (2016). Similarly, we provide training and evaluation episodes for further research into few-shot DML.

Overall, our contributions can be summarized as:

  • Proposing the ooDML benchmark to create a set of more realistic train-test splits that evaluate DML generalization capabilities under increasingly more difficult zero-shot learning tasks.

  • Analyzing the current DML method landscape under ooDML to characterize benefits and drawbacks of different conceptual approaches to DML.

  • Introducing and examining few-shot DML as a potential remedy for systematically improved OOD generalization, especially when moving to larger train-test distribution shifts.

2 Related Work

DML has become essential for many applications, especially in zero-shot image and video retrieval Sohn (2016); Wu et al. (2017); Roth et al. (2019); Jacob et al. (2019); Brattoli et al. (2020). Proposed approaches most commonly rely on a surrogate ranking task over tuples during training Suárez et al. (2018), ranging from simple pairs Hadsell et al. (2006) and triplets Schroff et al. (2015); Milbich et al. (2020a) to higher-order quadruplets Chen et al. (2017) and more generic n-tuples Sohn (2016); Oh Song et al. (2016); Hermans et al. (2017); Wang et al. (2019). These ranking tasks can also leverage additional context such as geometrical embedding structures Wang et al. (2017); Deng et al. (2019)

. However, due to the exponentially increased complexity of tuple sampling spaces, these methods are usually also combined with tuple sampling objectives, relying on predefined or learned heuristics to avoid training over tuples that are too easy or too hard

Schroff et al. (2015); Xuan et al. (2020) or reducing tuple redundancy encountered during training Wu et al. (2017); Ge (2018); Harwood et al. (2017); Roth et al. (2020). More recent work has tackled sampling complexity through the usage of proxy-representations utilized as sample stand-ins during training, following a NCA Goldberger et al. (2005) objective Movshovitz-Attias et al. (2017); Kim et al. (2020); Teh et al. (2020), leveraging softmax-style training through class-proxies Deng et al. (2019); Zhai and Wu (2018) or simulating intraclass structures Qian et al. (2019).

Unfortunately, the true benefit of these proposed objectives has been put into question recently, with Roth et al. (2020) and Musgrave et al. (2020) highlighting high levels of performance saturation of these discriminative DML objectives on default benchmark splits under fair comparison. Instead, orthogonal work extending the standard DML training paradigm through multi-task approaches Sanakoyeu et al. (2019); Roth et al. (2019); Milbich et al. (2020c), boosting Opitz et al. (2017, 2018), attention Kim et al. (2018), sample generation Duan et al. (2018); Lin et al. (2018); Zheng et al. (2019), multi-feature learning Milbich et al. (2020b) or self-distillation Roth et al. (2020) have shown more promise with strong relative improvements under fair comparison Roth et al. (2020); Milbich et al. (2020b), however still only in single split benchmark settings. It thus remains unclear how well these methods generalize in more realistic settings Koh et al. (2021) under potentially much more challenging, different train-to-test distribution shifts, which we investigate in this work.

3 ooDML: Constructing a Benchmark for OOD Generalization in DML

An image representation learned on samples drawn from some training distribution generalizes well if can transfer to test data that are not observed during training. In the particular case of OOD generalization, the learned representation is supposed to transfer to samples which are not independently and identically distributed (i.i.d.) to . A successful approach to learning such representations is DML, which is evaluated for the special case of zero-shot generalization, i.e. the transfer of to distributions of unknown classes Schroff et al. (2015); Wu et al. (2017); Jacob et al. (2019); Deng et al. (2019); Roth et al. (2020); Musgrave et al. (2020). More specifically, DML models aim to learn an embedding mapping datapoints into an embedding space , which allows one to measure the similarity between and as . Typically, is a predefined metric, such as the Euclidean or Cosine distance and

is parameterized by a deep neural network.

In realistic zero-shot learning scenarios, test distributions are not specified a priori. Thus, their respective distribution shifts relative to the training, which indicates the difficulty of the transfer learning problem, is unknown as well. To determine the generalization capabilities of , we would ideally measure its performance on different test distributions covering a large spectrum of distribution shifts, which we will also refer to as “problem difficulties" in this work. Unfortunately, standard evaluation protocols test the generalization of on a single and fixed train-test data split of predetermined difficulty. Hence, such protocols only allow for limited general conclusions about true zero-shot generalization properties.

To thoroughly assess and compare zero-shot generalization of DML models, we aim to build an evaluation protocol that resembles the undetermined nature of the transfer learning problem. In order to achieve this, we need to be able to change, measure and control the difficulty of train-test data splits. To this end, we present an approach to construct multiple train-test splits of measurably increasing difficulty to investigate out-of-distribution generalization in DML, which make up the ooDML benchmark. Our generated train-test splits resort to the established DML benchmark sets, and are subsequently used in Sec. 4 to thoroughly analyze the current state-of-the-art in DML. For future research, this approach is also easily applicable to other datasets and transfer learning problems.

3.1 Measuring the gap between train and test distributions

To create our train-test data splits, we need a way of measuring the distance between image datasets. This is a difficult task due to high dimensionality and natural noise in the images. Recently, Frechet Inception Distance (FID) Heusel et al. (2017)

was proposed to measure the distance between two image distributions by using the neural embeddings of an Inception-v2 network trained for classification on the ImageNet dataset. FID assumes that the embeddings of the penultimate layer follow a Gaussian distribution, with a given mean

and covariance for a distribution of images . The FID between two data distributions and is defined as:

(1)

In this paper, instead of the Inception network, we use the embeddings of a ResNet-50 classifier (Frechet

ResNet Distance) for consistency with most DML studies (see e.g. Wu et al. (2017); Teh et al. (2020); Kim et al. (2020); Sanakoyeu et al. (2019); Roth et al. (2019); Milbich et al. (2020b); Roth et al. (2020); Seidenschwarz et al. (2021)). For simplicity, in the following sections we will still use the abbreviation FID.

3.2 On the issue with default train-test splits in DML

Dataset CUB CARS SOP
Default - different classes train/test 52.62 8.59 3.43
i.i.d. - same classes train/test 4.87 0.05 2.33 0.03 0.98 0.01
Table 1: FID scores between i.i.d. subsampled training and test sets in comparison to FID scores measured on default splits used in standard DML evaluation protocols. As can be seen, the train-test distribution shift of two out of three benchmarks are actually close i.i.d. settings, in particular when compared to the train-test splits evaluated in Fig. 1 reaching scores over 200.

To motivate the need for more comprehensive OOD evaluation protocols, we look at the split difficulty as measured by FID of typically used train-test splits and compare to i.i.d. sampling of training and test sets from the same benchmark. Empirical results in Tab. 1 show that commonly utilized DML train-test splits are very close to in-distribution learning problems when compared to more out-of-distribution splits in CARS196 and SOP (see Fig. 1). This indicates that semantic differences due to disjoint train and test classes, do not necessarily relate to actual significant distribution shifts between the train and test set. This also explains the consistently lower zero-shot retrieval performance on CUB200-2011 as compared to both CARS196 and SOP in literature Wu et al. (2017); Wang et al. (2019); Jacob et al. (2019); Roth et al. (2020); Musgrave et al. (2020); Milbich et al. (2020b), despite SOP containing significantly more classes with fewer examples per class. In addition to the previously discussed issues of DML evaluation protocols, this further questions conclusions drawn from these protocols about the OOD generalization of representations .

3.3 Creating a train-test splits of increasing difficulty

Figure 1: FID progression with iterative class swapping and removal for train-test split generation. (Col. 1-3) FID per swapping iteration on all benchmarks. (Rightmost) FID of data splits obtained by additional iterations of removing classes. The blue bar denotes the maximal FID after swapping.

Let and denote the original train and test set of a given benchmark dataset . To generate train-test splits of increasing difficulty while retaining the available data and maintaining balance of their sizes, we exchange samples between them. To ensure semantic consistency and unbiased data distributions based on image context unrelated to the target object categories, we swap entire classes instead of individual samples. Using FID to measure distribution similarity, the goal is to first identify classes and whose exchange yields higher FID . To this end, we select and as

(2)
(3)

where we measure distance to mean class-representations . Similar unimodal approximation to intraclass distributions have also been used in recent literature Lin et al. (2018); Roth et al. (2019); Milbich et al. (2020b). By iteratively exchanging classes between data splits, i.e. and vice versa, we obtain a more difficult train-test split at iteration step . Hence, we obtain a sequence of train-test splits , with and . Note that in Eq. 2, we only consider the class means and neglect the covariance term from Eq. 1; we observed this approximation on the Frechet distance to be sufficient to generate data splits of increasing FID. Fig. 1 (columns 1-3) indeed shows that FID gradually increase with each swap until cannot be further increased by swapping classes. For CUB200-2011 and CARS196, we swap two classes per iteration, while for Stanford Online Products we swap 1000 classes due to a significantly higher class count. Moreover, to cover the overall spectrum of distribution shifts and ensure comparability between benchmarks we also reverse the iteration procedure on CUB200-2011 to generate splits minimizing FID while still maintaining disjunct train and test classes.

To further increase beyond convergence (see Fig. 1) of the swapping procedure, we subsequently also identify and remove classes from both and . More specifically, we remove classes from that are closest to the mean of and vice versa. For steps, we successively repeat class removal as long as of the original data is still maintained in these additional train-test splits. Fig. 1 (rightmost) shows how splits generated through class removal progressively increase the FID beyond what was achieved only by swapping. To analyze if the generated data splits are not inherently biased to the used backbone network for FID computation, we also repeat the train-test generation based on feature representations obtained from different architectures, pretraining methods and datasets. The corresponding results in the supplementary show a consistent, similar increase in (normalized) FID. Overall, using class swapping and removal we select splits that cover the broadest FID range possible, while still maintaining sufficient data. Our splits are significantly more difficult than the default splits, thereby much more closely resembling potential distribution shifts faced in practice.

4 Assessing the State of Generalization in Deep Metric Learning

This section assesses the state of zero-shot generalization in DML via a large experimental study of representative DML methods on our ooDML benchmark, offering a much more complete and thorough perspective on zero-shot generalization in DML as compared to previous DML studies Fehervari et al. (2019); Roth et al. (2020); Musgrave et al. (2020); Milbich et al. (2020c).

Figure 2: Zero-Shot Generalization performance under varying distribution shifts. (top row) Absolute Recall@1 performance for each increasingly more difficult train-test split in the goodDML benchmark (cf. Sec. 3

) on CUB200-2011, CARS196 and SOP. We report mean Recall@1 performances and standard deviations over 5 runs. For results based on mAP@1000 see supplementary. (bottom) Differences of performances against the mean over all methods for each train-test split.

4.1 Experimental Setup

For our experiments we use the three most widely used benchmarks in DML, CUB200-2011Wah et al. (2011), CARS196Krause et al. (2013) and Stanford Online ProductsOh Song et al. (2016). For a complete list of implementation and training details see the supplementary if not explicitly stated in the respective sections. Moreover, to measure generalization performance, we resort to the most widely used metric for image retrieval in DML, Recall@k Jegou et al. (2011). Additionally, we also evaluate results over mean average precision (mAP@1000) Roth et al. (2020); Musgrave et al. (2020), but provide respective tables and visualizations in the supplementary when the interpretation of results is not impacted.

The exact training and test splits ultimately utilized throughout this work are selected based on Fig. 1

to ensure approximately uniform coverage of the spectrum of distribution shifts within intervals ranging from the lowest (near i.i.d. splits) to the highest generated shift achieved with class removal. For experiments on CARS196 and Stanford Online Products, eight total splits were investigated, included the original default benchmark split. For CUB200-2011, we select nine splits to also account for benchmark additions with reduced distributional shifts. The exact FID ranges are provided in the supplementary. Training on CARS196 and CUB200-2011 was done for a maximum of 200 epochs following standard training protocols utilized in

Roth et al. (2020), while 150 epochs were used for the much larger SOP dataset. Additional training details if not directly stated in the respective sections can be found in the supplementary. For further usage, we will release our source code and train-test splits.

4.2 Zero-shot generalization under varying distribution shifts

Many different concepts have been proposed in DML to learn embedding functions that generalize from the training distribution to differently distributed test data. To analyze the zero-shot transfer capabilities of DML models, we consider representative approaches making use of the following concepts: (i) surrogate ranking tasks and tuple mining heuristics (Margin loss with distance-based sampling Wu et al. (2017) and Multisimilarity loss Wang et al. (2019)), (ii) geometric constraints or class proxies (ArcFace Deng et al. (2019), ProxyAnchor Kim et al. (2020)), (iii) learning of semantically diverse features (R-Margin Roth et al. (2020)) and self-supervised training (DiVA Milbich et al. (2020b)), adversarial regularization (Uniform Prior Sinha et al. (2020)) and (iv) knowledge self-distillation (S2SD Roth et al. (2020)). Fig. 2 (top) analyzes these methods for their generalization to distribution shifts the varying degrees represented in ooDML. The top row shows absolute zero-shot retrieval performance measured on Recall@1 (results for mAP@1000 can be found in the supplementary) with respect to the FID between train and test sets. Additionally, Fig. 2 (bottom) examines the relative differences of performance to the performance mean over all methods for each train-test split. Based on these experiments, we make the following main observations:

(i) Performance deteriorates monotonically with the distribution shifts.

Independent of dataset, approach or evaluation metric, performance drops steadily as the distribution shift increases.


(ii) Relative performance differences are affected by train-test split difficulty. We see that the overall ranking between approaches oftentimes remains stable on the CARS196 and CUB200-2011 datasets. However, we also see that particularly on a large-scale dataset (SOP), established proxy-based approaches ArcFace Deng et al. (2019) (which incorporates additional geometric constraints) and ProxyAnchor Kim et al. (2020) are surprisingly susceptible to more difficult distribution shifts. Both methods perform poorly compared to the more consistent general trend of the other approaches. Hence, conclusions on the generality of methods solely based on the default benchmarks need to be handled with care, as at least for SOP, performance comparisons reported on single (e.g. the standard) data splits do not translate to more general train-test scenarios.
(iii) Conceptual differences matter at larger distribution shifts While the ranking between most methods is largely consistent on CUB200-2011 and CARS196, their differences in performance becomes more prominent with increasing distribution shifts. The relative changes (deviation from the mean of all methods at the stage) depicted in Fig. 2

(bottom) clearly indicates that particular methods based on machine learning techniques such as self-supervision, feature diversity (DiVA, R-Margin) and self-distillation (S2SD) are among the best at generalizing in DML on more challenging splits while retaining strong performance on more i.i.d. splits as well.

While directly stating performance in dependence to the individual distribution shifts offers a detailed overview, the overall comparison of approaches is typically based on single benchmark scores. To further provide a single metric of comparison, we utilize the well-known Area-under-Curve (AUC) score to condense performance (either based on Recall@1 or mAP@1000) over all available distribution shifts into a single aggregated score indicating general zero-shot capabilities. This Aggregated Generalization Score (AGS) is computed based on the normalized FID scores of our splits to the interval . As Recall@k or mAP@k scores are naturally bounded to , AGS is similarly bound to with higher being the better model. Our corresponding results are visualized in Fig. 3. Indeed, we see that AGS reflects our observations from Fig. 2, with self-supervision (DiVA) and self-distillation (S2SD) generally performing best when facing unknown train-test shifts. Exact scores are provided in the supplementary.

Figure 3: Comparison of DML methods via AGS based on Recall@1 across benchmarks. To compute AGS (cf. Sec. D.1), we aggregate the performances from Fig. 2 across all train-test distribution shifts of our proposed ooDML benchmark using the Area-Under-the-Curve metric.

4.3 Consistency of structural representation properties on ooDML

Figure 4: Generalization metrics computed on ooDML benchmark for CUB200-2011 and SOP. Each column plots one of the (normalized) measured structural representation property (cf. 4.3) over the corresponding Recall@1 performance for all examined DML methods and distribution shifts. Trendlines are computed as least squares fit over all datapoints (overall), respectively only those corresponding to default splits (default).

Roth et al. Roth et al. (2020) attempts to identify potential drivers of generalization in DML by measuring the following structural properties of a representation : (i) the mean distance between the centers of the embedded samples of each class, (ii) the mean distance between the embedded samples within a class , (iii) the ‘embedding space density’ measured as the ratio and (iv) ‘spectral decay’

measuring the degree of uniformity of singular values obtained by singular value decomposition on the training sample representations, which indicates the number of significant directions of variance. For a more detailed description, we refer to

Roth et al. (2020). These metrics indeed are empirically shown to exhibit a certain correlation to generalization performance on the default benchmark splits. In contrast, we are now interested if these observations hold when measuring generalization performance on the ooDML train-test splits of varying difficulty.

We visualize our results in Fig. 4 for CUB200-2011 and SOP, with CARS196 provided in the supplementary. For better visualization we normalize all measured values obtained for both metrics (i)-(iv) and the recall performances (Recall@1) to the interval for each train-test split. Thus, the relation between structural properties and generalization performance becomes comparable across all train-test splits, allowing us to examine if superior generalization is still correlated to the structural properties of the learned representation , i.e. if the correlation is independent of the underlying distribution shifts. For a perfectly descriptive metric, one should expect to see heavy correlation between normalized metric and normalized generalization performance jointly across shifts. Unfortunately, our results show only a small indication of any structural metric being consistently correlated with generalization performance over varying distribution shifts. This is also observed when evaluating only against basic, purely discriminative DML objectives as was done in Roth et al. (2020) for the default split, as well as when incorporating methods that extend and change the base DML training setup (such as DiVA Milbich et al. (2020b) or adversarial regularization Sinha et al. (2020)).

This not only demonstrates that experimental conclusions derived from the analysis of only single benchmark split may not hold for overall zero-shot generalization, but also that future research should consider more general learning problems and difficulty to better understand the conceptual impact various regulatory approaches. To this end, our benchmark protocol offers more comprehensive experimental ground for future studies to find potential drivers of zero-shot generalization in DML.

4.4 Network capacity and pretrained representations

A common way to improve generalization, as also highlighted in Roth et al. (2020) and Musgrave et al. (2020)

, is to select a stronger backbone architecture for feature extraction. In this section, we look at how changes in network capacity can influence OOD generalization across distribution shifts. Moreover, we also analyze the zero-shot performance of a diverse set of state-of-the-art pretraining approaches.

Influence of network capacity. In Fig. 5, we compare different members of the ResNet architecture family He et al. (2016)

with increasing capacity, each of which achieve increasingly higher performance on i.i.d. test benchmarks such as ImageNet

Deng et al. (2009), going from a small ResNet18 (R18) over ResNet50 (R50) to ResNet101 (R101) variants. As can be seen, while larger network capacity helps to some extent, we observe that performance actually saturates in zero-shot transfer settings, regardless of the DML approach and dataset (in particular also the large scale SOP dataset). Interestingly, we also observe that the performance drops with increasing distribution shifts are consistent across network capacity, suggesting that zero-shot generalization is less driven by network capacity but rather conceptual choices of the learning formulation (compare Fig. 2).

Generic representations versus Deep Metric Learning.

Figure 5: Generalization performance for different backbone architectures for varying distribution shifts on CUB200-2011. We show absolute Recall@1 performances averaged over 5 runs for each train-test split. Other datasets show similar results and are provided in the supplementary.
Figure 6: Comparison of DML to various non-adapted generic representations pretrained on large amounts unlabelled data and state-of-the-art architectures. For DML, we show best and worst DML objectives based on results in Fig. 2. Performance of generic representations are heavily dependent on target dataset, architecture, amount of training data and learning objective.
Figure 7: Few-Shot adaptation of DML representations on CUB200-2011. Columns show average Recall@1 performance over 10 episodes of 2- and 5-shot adaption for various DML approaches (fewshot and zeroshot), highlighting a substantial benefit of few-shot adaptation for a priori unknown distribution shifts (see black line highlighting relative improvements).

Recently, self-supervised representation learning has taken great leaps with ever stronger models trained on huge amounts of image Kolesnikov et al. (2020); Radford et al. (2021) and language data Devlin et al. (2019); Liu et al. (2019); Brown et al. (2020). These approaches are designed to learn expressive, well-transferring features and methods like CLIP Radford et al. (2021) even prove surprisingly useful for zero-shot classification. We now evaluate and compare such representations against state-of-the-art DML models to understand if generic representations that are readily available nowadays actually pose an alternative to explicit application of DML. We select state-of-the-art self-supervision models SwAV Caron et al. (2020) (ResNet50 backbone), CLIP Radford et al. (2021) trained via natural language supervision on a large dataset of 400 million image and sentence pairs (VisionTransformer Dosovitskiy et al. (2021) backbone), BiT(-M) Kolesnikov et al. (2020), which trains a ResNet50-V2 Kolesnikov et al. (2020) on both the standard ImageNet Deng et al. (2009) (1 million training samples) and the ImageNet-21k dataset Deng et al. (2009); Ridnik et al. (2021) with 14 million training samples and over 21 thousand classes, an EfficientNet-B0 Tan and Le (2019) trained on ImageNet, and a standard baseline ResNet50 network trained on ImageNet. We note that none of these representations have been additionally adapted to the benchmark sets and only the pretrained representations are being evaluated, in contrast to the DML approaches which have been trained on the respective train splits.

The results presented in Fig. 6 show large performance differences of the pretrained representations, which are largely dependent on the test dataset. While BiT outperforms the DML state-of-the-art on CUB200-2011 without any finetuning, it significantly trails behind the DML models on the other two datasets. On CARS196, only CLIP comes close to the DML approaches when the distribution shift is sufficiently large. Finally, on SOP, none of these models comes even close to the adapted DML methods. This shows how although representations learned by extensive pretraining can offer strong zero-shot generalization, their performance heavily depends on the target dataset and specific model. Furthermore, the generalization abilities notably depend on the size of the pretraining dataset (compare e.g. BiT-1k vs BiT-21k or CLIP), which is significantly larger than the number of training images seen by the DML methods. We see that only actual training on these datasets provides sufficiently reliable performance.

4.5 Few-shot adaption boosts generalization performance in DML

Since distribution shifts can be arbitrarily large, the zero-shot transfer of can be ill-posed. Features learned on a training set will not meaningfully transfer to test samples once they are sufficiently far from , as also already indicated by Fig. 2. As a remedy few-shot learning Snell et al. (2017); Triantafillou et al. (2017); Finn et al. (2017); Raghu et al. (2020); Lee et al. (2019); Chen et al. (2020); Tian et al. (2020b) assumes few samples of the test distribution to to be available during training, i.e. adjusting a previously learned representation. While these approaches are typically explicitly trained for fast adaption to novel classes, we are now interested if similar adaptation of DML representations helps to bridge increasingly large distribution shifts.

To investigate this hypothesis, we follow the evaluation protocol of few-shot learning and use representatives (also referred to as shots) of each class from a test set as a support set for finetuning the penultimate embedding network layer. The remaining test samples then represent the new test set to evaluate retrieval performance, also referred to as query set. For evaluation we perform episodes, i.e. we repeat and average the adaptation of over different, randomly sampled support and corresponding query sets. Independent of the DML model used for learning the original representation on , adaptation to the support data is conducted using the Marginloss Wu et al. (2017) objective with distance-based sampling Wu et al. (2017) due to its faster convergence. This also ensures fair comparisons to the adaptation benefit to and also allows to adapt complex approaches like self-supervision (DiVA Milbich et al. (2020b)) to the small number of samples in the support sets.

Fig. 7 shows 2 and 5 shot results on CUB200-2011, with CARS196 available in the supplementary. SOP is not considered since each class is already composed of small number of samples. As we see, even very limited in-domain data can significantly improve generalization performance, with the benefit becoming stronger for larger distribution shifts. Moreover, we observe that weaker approaches like ArcFace Deng et al. (2019) seem to benefit more than state-of-the-art methods like S2SD Roth et al. (2020) or DiVA Milbich et al. (2020b). We presume this to be caused by their underlying concepts already encouraging learning of more robust and general features. To conclude, few-shot learning appears to provides a substantial benefit when facing difficult OOD learning settings, providing an especially strong starting point for reliable improvement in settings where the shift is not known a priori.

5 Conclusion

In this work we analyzed zero-shot transfer of image representations learned by Deep Metric Learning (DML) models. We proposed a systematic construction of train-test data splits of increasing difficulty, as opposed to standard evaluation protocols that test out-of-distribution generalization only on single data splits of fixed difficulty. Based on this, we presented the novel benchmark ooDML and thoroughly assessed current DML methods. Our study reveals the following main findings:

Standard evaluation protocols are insufficient to probe general out-of-distribution transfer: Prevailing train-test splits in DML are often close to i.i.d. evaluation settings. Hence, they only provide limited insights into the impact of train-test distribution shift on generalization performance. Our benchmark ooDML alleviates this issue by evaluating a large, controllable and measurable spectrum of problem difficulty to facilitate future research.

Larger distribution shifts show the impact of conceptual differences in DML approaches: Our study reveals that generalization performance degrades consistently with increasing problem difficulty for all DML methods. However, certain concepts underlying the approaches are shown to be more robust to shifts than others, such as semantic feature diversity and knowledge-distillation.

Generic, self-supervised representations without finetuning can surpass dedicated data adaptation: When facing large distribution shifts, representations learned only by self-supervision on large amounts of of unlabelled data are competitive to explicit DML training without any finetuning. However, their performance is heavily dependent on the data distribution and the models themselves.

Few-shot adaptation consistently improves out-of-distribution generalization in DML: Even very few examples from a target data distribution effectively help to adapt DML representations. The benefit becomes even more prominent with increasing train-test distribution shifts, and encourages further research into few-shot adaptation in DML.

Acknowledgements

The research has been funded by the German Federal Ministry for Economic Affairs and Energy within the project “KI-Absicherung – Safe AI for automated driving” and by the German Research Foundation (DFG) within project 421703927. Moreover, it was funded in part by a CIFAR AI Chair at the Vector Institute, Microsoft Research, and an NSERC Discovery Grant. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute

www.vectorinstitute.ai/#partners. We thank the International Max Planck Research School for Intelligent Systems (IMPRS-IS) for supporting K.R; K.R. acknowledges his membership in the European Laboratory for Learning and Intelligent Systems (ELLIS) PhD program.

References

  • B. Brattoli, J. Tighe, F. Zhdanov, P. Perona, and K. Chalupka (2020) Rethinking zero-shot video classification: end-to-end training for realistic applications. Cited by: §1, §2.
  • T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 1877–1901. External Links: Link Cited by: §4.4.
  • M. Caron, I. Misra, J. Mairal, P. Goyal, P. Bojanowski, and A. Joulin (2020) Unsupervised learning of visual features by contrasting cluster assignments. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 9912–9924. External Links: Link Cited by: Appendix B, §4.4.
  • T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. pp. 1597–1607. External Links: Link Cited by: §1.
  • W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Cited by: §2.
  • Y. Chen, X. Wang, Z. Liu, H. Xu, and T. Darrell (2020) A new meta-baseline for few-shot learning. CoRR abs/2003.04390. External Links: Link, 2003.04390 Cited by: §4.5.
  • J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei (2009) ImageNet: A Large-Scale Hierarchical Image Database. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.4, §4.4.
  • J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2019)

    ArcFace: additive angular margin loss for deep face recognition

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table S3, §1, §1, §2, §3, §4.2, §4.2, §4.5.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §4.4.
  • A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby (2021) An image is worth 16x16 words: transformers for image recognition at scale. External Links: Link Cited by: §4.4.
  • Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou (2018) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry (2019) Exploring the landscape of spatial robustness. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 1802–1811. External Links: Link Cited by: §1.
  • I. Fehervari, A. Ravichandran, and S. Appalaraju (2019) Unbiased evaluation of deep metric learning algorithms. External Links: 1911.12528 Cited by: §4.
  • C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, pp. 1126–1135. External Links: Link Cited by: §4.5.
  • W. Ge (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: §2.
  • J. Goldberger, G. E. Hinton, S. Roweis, and R. R. Salakhutdinov (2005) Neighbourhood components analysis. In Advances in Neural Information Processing Systems, L. Saul, Y. Weiss, and L. Bottou (Eds.), Vol. 17, pp. . External Links: Link Cited by: §2.
  • R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • B. Harwood, B. Kumar, G. Carneiro, I. Reid, T. Drummond, et al. (2017) Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2829. Cited by: §2.
  • K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2020) Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.4.
  • D. Hendrycks and T. Dietterich (2019) Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations, External Links: Link Cited by: §1.
  • A. Hermans, L. Beyer, and B. Leibe (2017) In Defense of the Triplet Loss for Person Re-Identification. arXiv e-prints, pp. arXiv:1703.07737. External Links: 1703.07737 Cited by: §2.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. External Links: 1706.08500 Cited by: §1, §3.1.
  • P. Jacob, D. Picard, A. Histace, and E. Klein (2019) Metric learning with horde: high-order regularizer for deep embeddings. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §3.2, §3.
  • H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.1.
  • S. Kim, D. Kim, M. Cho, and S. Kwak (2020) Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Appendix B, Table S3, §1, §1, §2, §3.1, §4.2, §4.2.
  • W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
  • D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Cited by: Appendix B.
  • P. W. Koh, S. Sagawa, H. Marklund, S. M. Xie, M. Zhang, A. Balsubramani, W. Hu, M. Yasunaga, R. L. Phillips, I. Gao, T. Lee, E. David, I. Stavness, W. Guo, B. A. Earnshaw, I. S. Haque, S. Beery, J. Leskovec, A. Kundaje, E. Pierson, S. Levine, C. Finn, and P. Liang (2021) WILDS: a benchmark of in-the-wild distribution shifts. External Links: 2012.07421 Cited by: §1, §2.
  • A. Kolesnikov, L. Beyer, X. Zhai, J. Puigcerver, J. Yung, S. Gelly, and N. Houlsby (2020) Big transfer (bit): general visual representation learning. Cham, pp. 491–507. External Links: ISBN 978-3-030-58558-7 Cited by: §4.4.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: Appendix B, §1, §4.1, footnote 1.
  • D. Krueger, E. Caballero, J. Jacobsen, A. Zhang, J. Binas, D. Zhang, R. L. Priol, and A. Courville (2021) Out-of-distribution generalization via risk extrapolation (rex). External Links: 2003.00688 Cited by: §1.
  • K. Lee, S. Maji, A. Ravichandran, and S. Soatto (2019) Meta-learning with differentiable convex optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.5.
  • X. Lin, Y. Duan, Q. Dong, J. Lu, and J. Zhou (2018) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2, §3.3.
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. Note: cite arxiv:1907.11692 External Links: Link Cited by: §4.4.
  • S. Marcel and Y. Rodriguez (2010)

    Torchvision the machine-vision package of torch

    .
    In Proceedings of the 18th ACM International Conference on Multimedia, MM ’10, New York, NY, USA, pp. 1485–1488. External Links: ISBN 9781605589336, Link, Document Cited by: Appendix B.
  • L. McInnes, J. Healy, N. Saul, and L. Grossberger (2018) UMAP: uniform manifold approximation and projection.

    The Journal of Open Source Software

    3 (29), pp. 861.
    Cited by: Appendix B.
  • T. Milbich, M. Bautista, E. Sutter, and B. Ommer (2017) Unsupervised video understanding by reconciliation of posture similarities. In Proceedings of the International Conference on Computer Vision (ICCV), Cited by: §1.
  • T. Milbich, O. Ghori, F. Diego, and B. Ommer (2020a) Unsupervised representation learning by discovering reliable image relations. Pattern Recognition (PR). Cited by: §2.
  • T. Milbich, K. Roth, H. Bharadhwaj, S. Sinha, Y. Bengio, B. Ommer, and J. P. Cohen (2020b) DiVA: diverse visual feature aggregation for deep metric learning. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 590–607. External Links: ISBN 978-3-030-58598-3 Cited by: Appendix B, Table S3, §1, §1, §2, §3.1, §3.2, §3.3, §4.2, §4.3, §4.5, §4.5.
  • T. Milbich, K. Roth, B. Brattoli, and B. Ommer (2020c) Sharing matters for generalization in deep metric learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2, §4.
  • I. Misra and L. v. d. Maaten (2020)

    Self-supervised learning of pretext-invariant representations

    .
    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §2.
  • K. Musgrave, S. Belongie, and S. Lim (2020) A metric learning reality check. External Links: 2003.08505 Cited by: Appendix B, §1, §2, §3.2, §3, §4.1, §4.4, §4.
  • H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: Appendix B, §1, §2, §4.1.
  • M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2017) Bier-boosting independent embeddings robustly. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5189–5198. Cited by: §2.
  • M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017)

    Automatic differentiation in pytorch

    .
    In NIPS-W, Cited by: Appendix B.
  • Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin (2019) SoftTriple loss: deep metric learning without triplet sampling. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §2.
  • A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021) Learning transferable visual models from natural language supervision. CoRR abs/2103.00020. External Links: Link, 2103.00020 Cited by: Appendix B, §4.4.
  • A. Raghu, M. Raghu, S. Bengio, and O. Vinyals (2020) Rapid learning or feature reuse? towards understanding the effectiveness of maml. In International Conference on Learning Representations, External Links: Link Cited by: §4.5.
  • B. Recht, R. Roelofs, L. Schmidt, and V. Shankar (2019) Do ImageNet classifiers generalize to ImageNet?. In Proceedings of the 36th International Conference on Machine LearningComputer Vision – ECCV 2020International Conference on Learning RepresentationsComputer Vision – ECCV 2020Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data miningProceedings of the 37th International Conference on Machine LearningInternational Conference on Learning RepresentationsAdvances in Neural Information Processing SystemsProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)Computer Vision – ECCV 2020, K. Chaudhuri, R. Salakhutdinov, A. Vedaldi, H. Bischof, T. Brox, J. Frahm, A. Vedaldi, H. Bischof, T. Brox, J. Frahm, H. D. III, A. Singh, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, H. Lin, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Proceedings of Machine Learning ResearchProceedings of Machine Learning Research, Vol. 9711933, pp. 5389–5400. External Links: Link Cited by: §1.
  • T. Ridnik, E. B. Baruch, A. Noy, and L. Zelnik-Manor (2021) ImageNet-21k pretraining for the masses. CoRR abs/2104.10972. External Links: Link, 2104.10972 Cited by: §4.4.
  • K. Roth, B. Brattoli, and B. Ommer (2019) MIC: mining interclass characteristics for improved metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8000–8009. Cited by: §1, §2, §2, §3.1, §3.3.
  • K. Roth, T. Milbich, B. Ommer, J. P. Cohen, and M. Ghassemi (2020) S2SD: simultaneous similarity-based self-distillation for deep metric learning. CoRR abs/2009.08348. External Links: Link, 2009.08348 Cited by: Appendix B, §1, §2, §4.2, §4.5.
  • K. Roth, T. Milbich, and B. Ommer (2020) PADS: policy-adapted sampling for visual similarity learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, and J. P. Cohen (2020) Revisiting training strategies and generalization performance in deep metric learning. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 8242–8252. External Links: Link Cited by: Appendix B, Table S3, §1, §1, §2, §3.1, §3.2, §3, §4.1, §4.1, §4.2, §4.3, §4.3, §4.4, §4.
  • A. Sanakoyeu, V. Tschernezki, U. Buchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.1.
  • F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2, §3.
  • J. Seidenschwarz, I. Elezi, and L. Leal-Taixé (2021) Learning intra-batch connections for deep metric learning. External Links: 2102.07753 Cited by: §1, §3.1.
  • S. Sinha, K. Roth, A. Goyal, M. Ghassemi, H. Larochelle, and A. Garg (2020) Uniform priors for data-efficient transfer. External Links: 2006.16524 Cited by: §1, §4.2, §4.3.
  • J. Snell, K. Swersky, and R. Zemel (2017) Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §4.5.
  • K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §1, §2.
  • J. L. Suárez, S. García, and F. Herrera (2018) A tutorial on distance metric learning: mathematical foundations, algorithms and experiments. External Links: 1812.05944 Cited by: §2.
  • M. Tan and Q. Le (2019)

    EfficientNet: rethinking model scaling for convolutional neural networks

    .
    In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, pp. 6105–6114. External Links: Link Cited by: §4.4.
  • E. W. Teh, T. DeVries, and G. W. Taylor (2020) ProxyNCA++: revisiting and revitalizing proxy neighborhood component analysis. Cham, pp. 448–464. External Links: ISBN 978-3-030-58586-0 Cited by: §1, §2, §3.1.
  • Y. Tian, D. Krishnan, and P. Isola (2020a) Contrastive representation distillation. External Links: Link Cited by: §1.
  • Y. Tian, Y. Wang, D. Krishnan, J. B. Tenenbaum, and P. Isola (2020b) Rethinking few-shot image classification: a good embedding is all you need?. Cham, pp. 266–282. External Links: ISBN 978-3-030-58568-6 Cited by: §4.5.
  • E. Triantafillou, R. Zemel, and R. Urtasun (2017) Few-shot learning through an information retrieval lens. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §4.5.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: Appendix B, §1, §4.1, footnote 1.
  • J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §2.
  • X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table S3, §1, §2, §3.2, §4.2.
  • R. Wightman (2019) PyTorch image models. GitHub. Note: https://github.com/rwightman/pytorch-image-models External Links: Document Cited by: Appendix B.
  • C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: Appendix B, Table S3, §1, §1, §2, §3.1, §3.2, §3, §4.2, §4.5.
  • H. Xuan, A. Stylianou, and R. Pless (2020) Improved embeddings with easy positive triplet mining. Cited by: §2.
  • A. Zhai and H. Wu (2018) Classification is a strong baseline for deep metric learning. External Links: 1811.12649 Cited by: §2.
  • W. Zheng, Z. Chen, J. Lu, and J. Zhou (2019) Hardness-aware deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §1, §2.

Appendix A Analyzing the model bias for selecting train-test splits

Figure S1: Normalized FID progession for ooDML train-test splits using different training models and networks. Values are normalized for comparability of FID progression, as FID scores are not upper bounded and as such, absolute values for different networks and pretraining methods differ.
Figure S2: UMAPs for easy (split-id 1), medium (split-id 5) and hard (split-id 8/9) for all benchmarks using a ResNet50 backbone pretrained on ImageNet. As can be seen, our iterative class swapping (and removal) procedure (cf. Sec. 3.2 main paper) creates splits in which training and test distributions become increasingly disjoint. Note that while we shifted full classes for semantic consistency, each point corresponds to a single sample (for SOP, random subsampling to 20000 total points was performed).

To analyze the impact of the network architecture, pretraining method and training data, respectively the learned feature representations, on the construction of train-test splits and the entailed difficulties, we repeat our class swapping and removal procedure introduced in Section 3 in the main paper using different self-supervised models. Subsequently, we select train-test splits from the same iteration steps. Fig. S1 compares the progression of distribution shifts based on FID scores normalized to the interval for valid comparison. We observe that across all pretrained models, the general FID progressions and sampled train-test splits exhibit very similar learning problem difficulties, indicating that our sampling procedure is robust to the choice of readily available, state-of-the art self-supervised pretrained models.

Appendix B Further Details regarding the Experimental Setup

Datasets.

In total, we utilized three widely used Deep Metric Learning benchmarks: (1) CUB200-2011 Wah et al. (2011), which comprises a total of 11,788 images over 200 classes of birds, (2) CARS196 Krause et al. (2013) containing 16,185 images of cars distributed over 196 classes and (3) Stanford Online Products (SOP) Oh Song et al. (2016), which introduced 120,053 product images over 22,634 total classes. For CUB200-2011 and CARS196, default splits are simply generated by selecting the last half of the alphabetically sorted classes as test samples, whereas SOP provides a predefined split with 11318 training and 11316 test classes.

Training details.

For our implementation, we leveraged the PyTorch framework Paszke et al. (2017). For training, in all cases, training images were randomly resized and cropped to , whereas for testing images were resized to and center cropped to . Optimization was performed with Adam Kingma and Ba (2015) with learning rate of and weight decay of . Batchsizes where chosen within the range of [86, 112] depending on the size of the utilized backbone network. For default DML ResNet-architectures, we follow previous literature Wu et al. (2017); Kim et al. (2020); Roth et al. (2020); Musgrave et al. (2020)

and freeze Batch-Normalization layers during training. We consistently use an embedding dimensionality of

for comparability. For DiVA Milbich et al. (2020b), S2SD Roth et al. (2020) and ProxyAnchor Kim et al. (2020), parameter choices were set to default values given in the original publications, with small grid searches done to allow for adaptation to backbone changes. For all other remaining objectives, parameter choices were adapted from Roth et al. (2020)

, who provide a hyperparameter selection for best comparability of methods. All experiments were performed on GPU servers containing NVIDIA P100, T4 and Titan X, with results always averaged over multiple seeds - in the case of our objective study five random seeds were utilized, whereas for other ablation-type studies at least three seeds were utilized. These settings are used throughout our study. For the few-shot experiments, the same pipeline parameters were utilized with changes noted in the respective section.


Pretrained network weights for ResNet-architectures where taken from torchvision Marcel and Rodriguez (2010), EfficientNet and BiT weights from timm Wightman (2019) and SwAV and CLIP pretrained weights from the respective official repositories (Caron et al. (2020) and Radford et al. (2021)).

FID scores between ooDML data splits.

In Tab. S1 we show the measured FID scores between each train-test split of our ooDML for the CUB200-2011, CARS196 and SOP benchmarks, respectively.

Dataset split-ID 1 2 3 4 5 6 7 8 9
CUB200-2011 19.2 28.5 52.6 72.2 92.5 120.4 136.5 152.0 173.9
CARS196 8.6 14.3 32.2 43.6 63.3 86.5 101.2 123.0 -
SOP 3.4 24.6 53.5 99.4 135.5 155.3 189.8 235.1 -
Table S1: FID scores between train-test splits in our ooDML benchmark. For details on creating train-test splits constituting the ooDML benchmark, please see Sec. 3 in main paper.

Qualitative introspection of ooDML train-test splits using UMAP.

Fig. S2 visualizes the distribution shift between train-test splits from our proposed ooDML benchmark using the UMAP McInnes et al. (2018) algorithm. For each dataset we show examples for an easy, medium and hard train-test split. Indeed, the distributional shift train to test data is increasing consistently, as indicated by our monotonically increasing FID progressions.

Appendix C On the limits of OOD generalization in Deep Metric Learning

Figure S3: Out-of-Domain Generalization. Each plot showcases transfer performance from the training dataset (source) to a test dataset from a novel domain (test). The dashed line represents baseline performance achieved by ResNet50 pretrained on ImageNet.
Direction (traintest) CUBCARS CARSCUB CUBSOP SOPCUB CARSSOP SOPCARS Max. ooDML
FID 349 359 359 370 386 376 155 (235)
Table S2: FID scores between training and test sets across different datasets compared to the highest FID measured by our generated train-test splits.

To investigate how well representations learned by DML approaches transfer across benchmark datasets, we train our models on the default training dataset of one benchmark and evaluate them on the default test set of another. Tab. S2 first illustrates the FID scores for all pairwise combinations using the CUB200-2011, CARS196 and SOP datasets. We find all FID scores exceed the previously considered learning problems in our ooDML

benchmark by far. However, the fact that FID scores are relatively close to another despite large semantic differences between datasets may indicate that FID based on our utilised FID estimator (Sec.

3) may have reached its limit as a distributional shift indicator, thus not being sufficiently sensitive. Fig. S3 summarizes the generalization performances for different DML approaches on this experimental setting. As can be seen, there are only a few cases where offers a benefit over the ResNet50 ImageNet baseline, indicating that generalization of DML approaches is primarily limited to shifts within a data domains. Beyond these limits, generic representations learned by self-supervised learning may offer better zero-shot generalization, as also discussed on Sec. 4.4.

Appendix D Additional Experimental Results

d.1 Zero-shot generalization under varying distribution shifts

This section provides additional results for the experiments presented in Sec. . To this end, we provide the exact performance values used to visualize Fig. 2 in the main paper in Tab. S4-S6. For the comparison based on the Aggregated Generalization Score (AGS) introduced in Sec. D.1 in the main paper, Tab. S3 provides the empirical results both for AGS computed based on Recall@1 and mAP@1000. For the latter, Fig. S4 summarizes AGS results using a bar plot similar to Fig. 3 in the main paper.

Figure S4: Comparison of DML methods via AGS based on mAP@1000 across benchmarks. To compute AGS (cf. Sec. 4.2 main paper), we aggregate the mAP@1000 performances in Tab. S4-S6 across all train-test distribution shifts of our proposed ooDML benchmark using the Area-Under-the-Curve metric.
Benchmark CUB200-2011 CARS196 SOP
Approaches AUC R@1 mAP@1000 R@1 mAP@1000 R@1 mAP@1000
Margin (D) Wu et al. (2017)
Multisimilarity Wang et al. (2019)
ArcFace Deng et al. (2019)
ProxyAnchor Kim et al. (2020)
R-Margin Roth et al. (2020)
Uniform Prior
DiVA Milbich et al. (2020b)
S2SD
Table S3: Results for Aggregated Generalization Score (AGS) (cf. Sec. 4.2 main paper) based on Recall@1 and mAP@1000 computed on the ooDML benchmark. We show results for various DML methods averaged over over multiple runs.
Metric Method | Split (FID) 1 (19.2) 2 (28.5) 3 (52.6) 4 (72.2) 5 (92.5) 6 (120.4) 7 (136.5) 8 (152.0) 9 (173.9)
R@1 Margin (D)
Multisimilarity
ArcFace
ProxyAnchor
R-Margin (D)
Uniform Prior
S2SD
DiVA
mAP@1000 Margin (D)
Multisimilarity
ArcFace
ProxyAnchor
R-Margin (D)
Uniform Prior
S2SD
DiVA
Table S4: DML generalization performance measured by Recall@1 and mAP@1000 on each train-test split of our ooDML benchmark for the CUB200-2011 dataset.
Metric Method | Split (FID) 1 (8.6) 2 (14.3) 3 (32.2) 4 (43.6) 5 (63.3) 6 (86.5) 7 (101.2) 8 (123.0)
R@1 Margin (D)
Multisimilarity
ArcFace
ProxyAnchor
R-Margin (D)
Uniform Prior
S2SD
DiVA
mAP@1000 Margin (D)
Multisimilarity
ArcFace
ProxyAnchor
R-Margin (D)
Uniform Prior
S2SD
DiVA
Table S5: DML generalization performance measured by Recall@1 and mAP@1000 on each train-test split of our ooDML benchmark for the CARS196 dataset.
Metric Method | Split (FID) 1 (3.4) 2 (24.6) 3 (53.5) 4 (99.4) 5 (135.5) 6 (155.3) 7 (189.8) 8 (235.1)
R@1 Margin (D)
Multisimilarity
ArcFace
ProxyAnchor
R-Margin (D)
Unifor mPrior
S2SD
DiVA
mAP@1000 Margin (D)
Multisimilarity
ArcFace
ProxyAnchor
R-Margin (D)
Uniform Prior
S2SD
DiVA
Table S6: DML generalization performance measured by Recall@1 and mAP@1000 on each train-test split of our ooDML benchmark for the SOP dataset.

d.2 Influence of network capacity

In Fig. S5 we present all results for our study on the influence of network capacity in Sec. 4.4 in the main paper, in particular also for the remaining datasets CARS196 and SOP. Additionally, we show the differences in performances against the mean over all methods for each train-test split (Change against mean). As already discussed in Sec. 4.4 in the main paper, these experiments similarly show that network capacity has only limited impact on OOD generalization, with benefits saturating eventually.

Figure S5: Generalization performance for different backbone architectures for varying distribution shifts on full ooDML benchmark (CUB200-2011, CARS196, SOP). To reduce computational load, we only utilised two thirds of the studied splits. Overall, we show absolute Recall@1 performances averaged over 5 runs for each train-test split.

d.3 Measuring structural representation properties on ooDML

This section extends the results presented in Sec. 4.3. We show results for all datasets, i.e. CUB200-2011, CARS196 and SOP, for all metrics measuring structural representation properties discussed in Sec. 4.3 in the main paper. We analyze correlations of these metrics with generalization performance both based on Recall@1 (Fig. S6) and mAP@1000 (Fig. S7). As discussed in the main paper, independent of the underlying performance metric, none of the structural representation properties show consistent correlation with generalization performance across all datasets, suggesting further research into meaningful latent space properties that can be linked to zero-shot generalization independent of chosen objectives and shifts.

Figure S6: Generalization metrics computed on ooDML benchmark for all datasets measured against Recall@1. Each column plots one of the (normalized) measured structural representation property (cf. Sec. 4.3 main paper) over the corresponding Recall@1 performance for all examined DML methods and distribution shifts. Trendlines are computed as least squares fit over all datapoints (overall), respectively only those corresponding to default splits (default).
Figure S7: Generalization metrics computed on ooDML benchmark for all datasets measured against mAP@1000. Each column plots one of the (normalized) measured structural representation property (cf. Sec. 4.3 main paper) over the corresponding mAP@1000 performance for all examined DML methods and distribution shifts. Trendlines are computed as least squares fit over all datapoints (overall), respectively only those corresponding to default splits (default).

d.4 Few-Shot DML

In Sec. 4.5 in the main paper, we analyzed few-shot adaption of DML representations to novel test distributions as a remedy to bridge their distribution shift to the training data. This section extends showcased results: Fig. S8 presents all our results on both CUB200-2011 (a+b) and CARS196 (c+d) dataset based on both Recall@1 and mAP@1000. The results on CARS196 verify the consistent improvement of leveraging very few examples for embedding space adaption over strict zero-shot transfer based on the original DML representation that we already observed for the CUB200-2011 dataset, which holds disproportionally well for larger distribution shifts. The corresponding data basis for Fig. S8 is presented in Tab. S7 for the CUB200-2011 dataset and in Tab. S8 for the CARS196 dataset.

(a)
(b)
(c)
(d)
Figure S8: Few-Shot adaptation of DML representations on CUB200-2011 and CARS196. Columns show average Recall@1 performance over 10 episodes of 2- and 5-shot adaption as well as the baseline zero-shot DML results on the same train-test splits (based on ooDML benchmark) for various DML approaches (fewshot and zeroshot), highlighting a substantial benefit of few-shot adaptation for a priori unknown distribution shifts (see black line highlighting relative improvements). Relative improvements are computed as relative change of few-shot performance against respective zero-shot performance.
Metric Shot Use Method | Split 1 2 3 4 5 6 7 8 9
R@1 2 Zero ArcFace
Multisimilarity
S2SD
DiVA
Few ArcFace
Multisimilarity
S2SD
DiVA
5 Zero ArcFace
Multisimilarity
S2SD
DiVA
Few ArcFace
Multisimilarity
S2SD
DiVA
mAP@1000 2 Zero ArcFace
Multisimilarity
S2SD
DiVA
Few ArcFace
Multisimilarity
S2SD
DiVA
5 Zero ArcFace
Multisimilarity
S2SD
DiVA
Few ArcFace
Multisimilarity
S2SD
DiVA
Table S7: Evaluation of zero-generalization and subsequent few-shot adaptation measured by Recall@1 and mAP@1000 based on few-shot dataplits built from the train-test splits of the ooDML benchmark (CUB200-2011). Results are further summarized in Fig. S8 (a) and (b).
Metric Shot Use Method | Split 1 2 3 4 5 6 7 8
R@1 2 Zero ArcFace
Multisimilarity
S2SD