Log In Sign Up

DiVA: Diverse Visual Feature Aggregation for Deep Metric Learning

by   Timo Milbich, et al.

Visual Similarity plays an important role in many computer vision applications. Deep metric learning (DML) is a powerful framework for learning such similarities which not only generalize from training data to identically distributed test distributions, but in particular also translate to unknown test classes. However, its prevailing learning paradigm is class-discriminative supervised training, which typically results in representations specialized in separating training classes. For effective generalization, however, such an image representation needs to capture a diverse range of data characteristics. To this end, we propose and study multiple complementary learning tasks, targeting conceptually different data relationships by only resorting to the available training samples and labels of a standard DML setting. Through simultaneous optimization of our tasks we learn a single model to aggregate their training signals, resulting in strong generalization and state-of-the-art performance on multiple established DML benchmark datasets.


DiVA: Diverse Visual Feature Aggregation forDeep Metric Learning

Visual Similarity plays an important role in many computer vision applic...

Sharing Matters for Generalization in Deep Metric Learning

Learning the similarity between images constitutes the foundation for nu...

Semantic Granularity Metric Learning for Visual Search

Deep metric learning applied to various applications has shown promising...

Adaptive Hierarchical Similarity Metric Learning with Noisy Labels

Deep Metric Learning (DML) plays a critical role in various machine lear...

CliqueCNN: Deep Unsupervised Exemplar Learning

Exemplar learning is a powerful paradigm for discovering visual similari...

Visual Explanation for Deep Metric Learning

This work explores the visual explanation for deep metric learning and i...

Metric Learning and Adaptive Boundary for Out-of-Domain Detection

Conversational agents are usually designed for closed-world environments...

1 Introduction

Many applications in computer vision, such as image retrieval [36, 31, 59] and face verification [49, 50], rely on capturing visual similarity, where approaches are commonly driven by Deep Metric learning (DML) [50, 59, 36]. These models aim to learn an embedding space which meaningfully reflects similarity between training images and, more importantly, generalizes to test classes which are unknown

during training. Even though models are evaluated on transfer learning, the prevailing training paradigm in DML utilizes discriminative supervised learning. Consequently, the learned embedding space is specialized to features which help only in separating among training classes, and may not correctly translate to unseen test classes. Now, if supervised learning does not result in sufficient generalization for DML, how can we exploit the available training data and class labels to provide additional training signals beyond the standard discriminative task?

Recent breakthroughs in self-supervised learning have shown that contrastive image relations inferred from images themselves yield rich feature representations which even surpass the ability of supervised features to generalize to novel downstream task [40, 19, 7]. However, although DML typically also learns from image relations in the form of pairs [18], triplets [59, 49] or more general image tuples [39, 8], the complementary benefit of self-supervision in DML is largely unstudied. Moreover, the commonly available class assignments give rise to image relations aside from the standard, supervised learning task of ‘pulling’ samples with identical class labels together while ‘pushing’ away samples with different labels. As such ranking-based learning is not limited to discriminative training only, other relations can be exploited to learn beneficial data characteristics which so far have seen little coverage in DML literature.

Figure 1: DML using diverse learning tasks vs. direct incorporation of self-supervision. (Left) Generalization performance increases with each task added to training, independent of the exact combination of our proposed tasks (blue: one extra task, orange: two extra tasks, green: all tasks). (Right) Directly combining supervised learning with self-supervised learning techniques such as DeepC(luster) [5], Rot(Net) [16], Inst.Dis(crimination) [60] or Mo(mentum)Co(ntrast) [19] actually hurts DML generalization.

In this paper, we tackle the issue of generalization in DML by designing diverse learning tasks complementing standard supervised training, leveraging only the commonly provided training samples and labels. Each of these tasks aims at learning features representing different relationships between our training classes and samples: (i) features discriminating among classes, (ii) features shared across different classes, (iii) features capturing variations within classes and (iv) features contrasting between individual images. Finally, we present how to effectively incorporate them in a unified learning framework. In our experiments we study mutual benefits of these tasks and show that joint optimization of diverse representations greatly improves generalization performance as shown in Fig. 1 (left), outperforming the state-of-the-art in DML. Our contributions can be summarized as follows:

  • We design novel triplet learning tasks resulting in a diverse set of features and study their complementary impact on supervised DML.

  • We adopt recent contrastive self-supervised learning to the problem of DML and extend it to effectively support supervised learning, as direct incorporation of self-supervised learning does not benefit DML (cf. Fig1) (right).

  • We show how to effectively incorporate these learning tasks in a single model, resulting in state-of-the-art performance on standard DML benchmark sets.

2 Related Work

Deep Metric Learning. Deep Metric Learning is one of the primary frameworks for image retrieval [36, 31, 46, 59], zero-shot generalization [49, 46, 48] or face verification [22, 32, 10]. It is also closely related to recent successful unsupervised representation learning approaches employing contrastive learning [19, 35, 7]. Commonly, DML approaches are formulated as ranking tasks on data tuples such as image pairs[18], triplets[22, 59], quadruplets[8] or higher order relations[50, 39, 58]. Effective training of these methods is typically promoted by tuple mining strategies alleviating the high sampling complexity, such as distance-based [49, 59] or hierarchical [15]heuristics. Methods like ProxyNCA[36], Softtriple[10], Arcface[10] or Normalized Softmax[62] introduce learnable data proxies which represent entire subsets of the data, thus circumventing the tuple selection process. Orthogonally, DML research has started to pay more emphasis on the training process itself. This involves the generation of artificial training data [31, 64] or adversarial objectives [12]. MIC [46]

explains away intra-class variance to strengthen the discriminative embedding.

[48] propose to separate the input data space to learn subset-specific, yet still only class-discriminative representations similar to standard ensemble methods [61, 26, 41]. In contrast, we learn different embeddings on conceptually different tasks to capture as much data structure as possible.
Self-supervised Representation Learning. Commonly, self-supervised representation learning aims to learn transferable feature representations from unlabelled data, and is typically applied as pre-training for downstream tasks [21, 34]. Early methods on representation learning are based on sample reconstructions [53, 28]

which have been further extended by interpolation constraints

[3] and generative adversarial networks [9, 13, 2, 11]. Further, introducing manually designed surrogate objectives encourage self-supervised models to learn about data-related properties. Such tasks range from predicting image rotations [16], solving a Jigsaw puzzle [37, 38] to iteratively refining the initial network bias using clustering algorithms [5]. Recently, self-supervision approaches based on contrastive learning result in strong features performing close to or even stronger than supervised pretraining [19, 35, 7] by leveraging invariance to realistic image augmentations. As the these approaches are essentially defined on pairwise image relations, they share common ground with ranking-based DML. In our work, we extend such a contrastive objective to effectively complement supervised DML training.
Multi-task Learning. Concurrently solving different tasks is also employed by classical multi-task learning which and often based on a divide-and-conquer principle with multiple learner optimizing a given subtask. [4] utilizes additional training data and annotations to capture extra information, while our tasks are defined on standard training data only. [43]

learn different classifiers for groups of entire categories, thus following a similar motivation as some DML approaches 

[41, 48]. The latter aims at learning more fine-grained, yet only discriminative features by reducing the data variance for each learner, thus being related to standard hard-negative mining [49]. In contrast, our work formulates various specific learning tasks to target different data characteristics of the training data.

3 Method

Let be a -dimensional encoding of an image

represented by a deep neural network with parameters

. Based on , deep metric learning (DML) aims to learn image embeddings which allow to measure the similarity between images as under a predefined distance metric . Typically, is a linear layer on the features representation, parameterized by and normalized to the real hypersphere for regularization [59]. is usually chosen to be the Euclidean distance. In standard supervised DML, is then optimized to reflect semantic similarity between images defined by the corresponding class labels .
While there are many ways to define training objectives on , ranking losses, such as variants of the popular triplet loss [59, 50, 39], are a natural surrogate for the DML problem. Based on image triplets with defined as anchor, as a similar, positive and as a negative image, we minimize


where defines the hinge function which clips any negative value to zero. Hence, we maximize the gap between and as long as a margin is violated.
In supervised DML, is typically optimized to discriminate between classes. Thus, is trained to predominantly capture highly discriminative features while being invariant to image characteristics which do not facilitate training class separation. However, as we are interested in generalizing to unknown test distributions, we should rather aim at maximizing the amount of features captured from the training set , thereby increasing the potential of the embedding space to transfer to unseen images.

3.1 Diverse learning tasks for DML

We now introduce several tasks for learning a diverse set of features, resorting only to the standard training information provided in a DML problem. Each of these tasks is designed to learn features which are conceptually neglected by the others to be mutually complementary. First, we introduce the intuition behind each feature type, before describing how to learn it based on pairwise or triplet-based image relations.

Figure 2: Schematic description of each task. We learn four complementary tasks to capture features focusing on different data characteristics. The standard class-discriminative task which learning features separating between samples of different classes, the shared task which captures features relating samples across different classes, a sample-specific task to enforce image representations invariant to transformations and finally the intra-class task modelling data variations within classes.

Class-discriminative features These features are learned by standard class-discriminative optimization of and focus on data characteristics which allow to accurately separate one class from all others. It is the prevailing training signal of common classification-based [56, 10, 62], proxy-based [36, 44] or ranking-based [46, 41] approaches. For the latter, we can formulate the training task using Eq. 1 by means of triplets with and , as


thus minimizing embedding distances between samples of the same class while maximizing it for samples of different classes. Moreover, the discriminative signal is important to learn how to aggregate features into classes, following the intuition of “the whole is more than the sum of its parts” analyzed in Gestalt theory [54].
Class-shared features In contrast to discriminative features which look for characteristics separating classes, class-shared features capture commonalities, i.e variations, shared across classes. For instance, birds and cars share a similar variety in colors, which are of little help when separating between them. However, to learn about this characteristic is actually beneficial, when describing other colorful object classes like flowers or fish. Given suitable label information, learning such features would naturally follow the standard discriminative training setup. However, having only class labels available, we must resort to approximations. To this end, we exploit the hypothesis that for most arbitrarily sampled triplets with each constituent coming from mutually different classes, i.e. , the anchor and positive share some common pattern when compared the negative image . Commonalities which are frequently observed between classes , will occur more often than noisy patterns which are unique to few , which is commonly observed when learning on imbalanced data [6, 14, 29]. Learning is then performed by optimizing


As deep networks learn from frequent backpropagation of similar learning signals resulting in informative gradients, only prominent shared features are captured. Further, since shared features can be learned between any classes, we need to warrant diverse combinations of classes in our triplets

. Thus, enabling triplet constituents to be sampled from the whole embedding space using distance-based sampling [59] is crucial to avoid any bias towards samples which are mostly far (random sampling) or close (hard-negative sampling) to a given anchor .
Intra-class features The tasks defined so far model image relations across classes. In contrast, intra-class features describe variations within a given class. While these variations may also apply to other classes (thus exhibiting a certain overlap with class-shared features), more class-specific details are targeted. Hence, to capture such data characteristics by means of triplet constraints, we train this task following a similar intuition as for learning class-shared features: We define triplets with and minimize


Sample-specific features Recent approaches for self-supervised learning [40, 19, 1]

based on noise contrastive estimation (NCE) 

[17] show that features exhibiting strong generalization for transfer learning can be learned only from training images themselves. As NCE learns to increase the correlation between embeddings of an anchor sample and a similar positive sample by constrasting against a set of negative samples, it naturally translates to DML. He et al. [19] proposed an efficient self-supervised framework which first applies data augmentation to generate positive surrogates for a given anchor . Next, using NCE we contrast their embeddings against randomly sampled negatives by minimizing


where the temperature parameter is adjusted during optimization to control the training signal, especially during earlier stages of training. By contrasting each sample against many negatives, i.e. large sets , this task effectively yields a general, class-agnostic features description of our data. Moreover, as the contrastive objective explicitly increases the similarity of an anchor image with its augmentations, invariance against data transformations and scaling are learned. Fig. 2 summarizes and visually explains the different training objectives of each task.

3.2 Improved generalization by multi feature learning

Following we show how to efficiently incorporate the learning tasks introduced in the previous section into a single DML model. We first extend the objective Eq. 5 using established triplet sampling strategies for improved adjustment to DML, before we jointly train our learning tasks for maximal feature diversity.

Adapting noise contrastive estimation to DML Efficient strategies for mining informative negatives are a key factor [49] for successful training of ranking-based DML models. Since NCE essentially translates to a ranking between images , its learning signal is also impaired if are uninformative, i.e. being large. To this end, we control the contribution of each negative to by a weight factor . Here, is the distribution of pairwise distances on the -dimensional unit hypersphere and a cut-off parameter. Similar to [59], helps to equally weigh negatives from the whole range of possible distances in and, in particular, increases the impact of harder negatives. Thus, our distance-adapted NCE loss becomes


NCE-based objectives learn best using large sets of negatives [19]. However, naively utilizing only negatives from the current mini-batch constrains to the available GPU memory. To alleviate this limitation, we follow [19] and realize as a large memory queue, which is constantly updated with embeddings from training iteration by utilising the running-average network .

Figure 3: Architecture of our proposed model. Each task optimizes an individual embedding implemented as a linear layer with a shared underlying feature encoder . Pairwise decorrelation of the embeddings utilizing the mapping based on a two-layer MLP encourages each task to further emphasize on its targeted data characteristics. Gradient inversion is applied during the backward pass to each embedding head.

Joint optimization for maximal feature diversity The tasks presented in Sec. 3.1 are formulated to extract mutually complementary information from our training data. In order to capture their learned features in a single model to obtain a rich image representation, we now discuss how to jointly optimize these tasks.
While each task targets a semantically different concept of features, their driving learning signals are based on potentially contradicting ranking constraints on the learned embedding space. Thus, aggregating these signals to optimizing a joint, single embedding function may entail detrimental interference between them. In order to circumvent this issue, we learn a dedicated embedding space for each task, as often conducted in multi-task optimization [46, 45], i.e. and with (cf. Sec. 3). As all embeddings share the same feature extractor , each task still benefits from the aggregated learning signals. Additionally, as there may still be redundant overlap in the information captured by each task, we mutually decorrelate these representations, thus maximizing the diversity of the overall training signal. Similar to [41, 46] we minimize the mutual information of two embedding functions , by maximizing their correlation in the embedding space of , followed by a gradient reversal. For that, we learn a mapping from to given an image and compute the correlation with being the point-wise product. denotes a gradient reversal operation, which inverts the resulting gradients during backpropagation. Maximizing results in aiming to make and comparable. However, through the subsequent gradients reversal, we actually decorrelate the embedding functions. Joint training of all tasks is finally performed by minimizing


where denotes the pairs of embeddings to be decorrelated. We found


to work best, which decorrelates the auxiliary tasks with the class-discriminative task. Initial experiments showed that further decorrelation among the auxiliary tasks does not result in further benefit and is therefore disregarded. The weighting parameters adjusting the degree of decorrelation between the embeddings are set to the same, constant value in our implementation. Fig. 3 provides an overview of our model architecture. Finally, to also combine our learned embedding representations we concatenate the individual task embeddings, thus forming an ensemble representation, and perform subsequent re-normalization during testing before computing pairwise distances.
Computational costs We train all tasks using the same mini-batch to avoid computational overhead. While optimizing each learner on an individual batch can further alleviate training signal interference [48, 61], training time increases significantly. Using a single batch per iteration, we minimize the required extra computations to the extra forward pass through (however without computing gradients) for contrasting against negatives sampled from the memory queue as well as the small mapping networks . Across datasets, we measure an increase in training time by

per epoch compared to training a standard supervised DML task. This is comparable to or lower than other methods, which perform a full clustering on the dataset 

[46, 48] after each epoch, compute extensive embedding statistics [24] or simultaneously train generative models [31].

4 Experiments

Following we first present our implementation details and the benchmark datasets. Next, we evaluate our proposed model and study how our learning tasks complement each other and improve over baseline performances. Finally, we discuss our results in the context of the current state-of-the-art and conduct analysis and ablation experiments.
Implementation details We follow the common training protocol of [59, 46, 48] for implementations utilizing a ResNet50-backbone. The shorter image axis is resized to , followed by a random crop to and a random horizontal flip with . During evaluation, only a center crop is taken after resizing. The embedding dimension is set to

for each task embedding. For model variants using the Inception-V1 with Batch-Normalization

[23], we follow [58, 24] and use . Resizing, cropping and flipping is done in the same way as for ResNet50 versions. For training, we use Adam[27] with learning rate and a weight decay of

. For ablations, we use no learning rate scheduling, while our final model is trained using scheduling values determined by cross-validation. The implementation is done using the PyTorch framework

[42], and experiments are performed on compute clusters containing NVIDIA Titan X, Tesla V4, P100 and V100, always limited to 12GB VRAM following the standard training protocol [59]. For DiVA, we utilise the triplet-based margin loss [59] with fixed margin and

. The utilized hyperparameters for DiVA are listed in Tab. 

1. Training is run for 200 epochs on CUB200/CARS196 and 150 epochs on SOP.

Backbone Inception-V1 + BN ResNet50
Parameter CUB200 CARS196 SOP CUB200 CARS196 SOP
300, 0.1 100, 0.1 50, 0.1 1500, 0.3 100, 0.1 300, 0.2
0.01 0.01 0.01 0.01 0.01 0.01
70, 0.3 160, 0.3 100, 0.3 60, 0.3 70, 0.3 70, 0.3
Table 1: Hyperparameters. The parameter denotes the epoch at which the learning rate is annealed by . Decorrelation weights and weightings (eq. 7) are the same for all pairs in (see eq. 8). denotes the temperature in (eq. 6).

Datasets We evaluate the performance on three common benchmark datasets with standard training/test splits (see e.g. [59, 46, 48, 58]): CARS196[30], which contains 16,185 images from 196 car classes. The first 98 classes containing 8054 images are used for training, while the remaining 98 classes with 8131 images are used for testing. CUB200-2011[55] with 11,788 bird images from 200 classes. Training/test sets contain the first/last 100 classes with 5864/5924 images respectively. Stanford Online Products (SOP)[39] provides 120,053 images divided in 22,634 product classes. 11318 classes with 59551 images are used for training, while the remaining 11316 classes with 60502 images are used for testing.

Dataset CUB200-2011[55] CARS196[30] SOP[39]
Approach Dim R@1 R@2 NMI R@1 R@2 NMI R@1 R@10 NMI
Margin[59] (orig, R50) 128 63.6 74.4 69.0 79.6 86.5 69.1 72.7 86.2 90.7
Margin[59] (ours, IBN) 512 63.6 74.7 68.3 79.4 86.6 66.2 76.6 89.2 89.8
DiVA (IBN, D & Da) 512 64.5 76.0 68.8 80.4 87.7 67.2 77.0 89.4 90.1
DiVA (IBN, D & S) 512 65.1 76.4 69.0 81.5 88.3 66.8 77.2 89.6 90.0
DiVA (IBN, D & I) 512 64.9 75.8 68.4 80.6 87.9 67.4 76.9 89.4 89.9
DiVA (IBN, D & Da & I) 510 65.3 76.5 68.3 82.2 89.1 67.8 75.8 89.0 89.8
DiVA (IBN, D & S & I) 510 65.5 76.4 68.4 82.1 89.4 67.2 77.0 89.3 89.7
DiVA (IBN, D & Da & S) 510 65.9 76.7 68.9 82.6 89.6 68.0 77.4 89.6 90.1
DiVA (IBN, D & Da & S & I) 512 66.4 77.2 69.6 83.1 90.0 68.1 77.5 90.3 90.1
Table 2: Comparison of our proposed method using different combinations of learning tasks. IBN (Inception-V1 with Batch-Normalization), and R50(ResNet50) denote the backbone architecture. No learning rate scheduling is used. Our tasks are denoted by D(iscriminative), S(hared), I(ntra-Class) & and Da(NCE). For fair comparison, the dimensionality per task embedding depends on the number of tasks incorporated to ensure a total of . Two tasks therefore each use , three use and when four tasks are combined, each use .

4.1 Performance study of multi-feature DML

We now compare our model and the complementary benefit of our proposed feature learning tasks for supervised DML. Tab. 2 evaluates the performance of our model based on margin loss [59], a triplet based objective with an additionally learnable margin, and distance-weighted triplet sampling [59]. We use Inception-V1 with Batchnorm and a maximal aggregated embedding dimensionality of . Thus, if two tasks are utilized, each embedding has , in case of three tasks and four tasks result in . No learning rate scheduling is used. Evaluation is conducted on CUB200-2011 [55], CARS196 [30] and SOP [39]. Retrieval performance is measured through Recall@k[25] and clustering quality via Normalized Mutual Information (NMI) [33]. While our results vary between possible task combinations, we observe that the generalization of our model consistently increases with each task added to the joint optimization. Our strongest model including all proposed tasks improves the generalization performance by 2.8% on CUB200-2011, 3.7% on CARS196 and 0.9% on SOP. This highlights that (i) purely discriminative supervised learning disregards valuable training information and (ii) carefully designed learning tasks are able to capture this information for improved generalization to unknown test classes. We further analyze our observations in the ablation experiments.

Dataset CUB200-2011[55] CARS196[30] SOP[39]
Approach Dim R@1 R@2 NMI R@1 R@2 NMI R@1 R@2 NMI
HTG[63] 512 59.5 71.8 - 76.5 84.7 - - - -
HDML[64] 512 53.7 65.7 62.6 79.1 87.1 69.7 68.7 83.2 89.3
Margin[59] 128 63.6 74.4 69.0 79.6 86.5 69.1 72.7 86.2 90.8
HTL[15] 512 57.1 68.8 - 81.4 88.0 - 74.8 88.3 -
DVML[31] 512 52.7 65.1 61.4 82.0 88.4 67.6 70.2 85.2 90.8
MultiSim[58] 512 65.7 77.0 - 84.1 90.4 - 78.2 90.5 -
D&C[48] 128 65.9 76.6 69.6 84.6 90.7 70.3 75.9 88.4 90.2
MIC[46] 128 66.1 76.8 69.7 82.6 89.1 68.4 77.2 89.4 90.0
Significant increase in network parameter:
HORDE[24]+Contr.[18] 512 66.3 76.7 - 83.9 90.3 - - - -
Softtriple[44] 512 65.4 76.4 - 84.5 90.7 70.1 78.3 90.3 92.0
Ensemble Methods:
A-BIER[41] 512 57.5 68.7 - 82.0 89.0 - 74.2 86.9 -
Rank[57] 1536 61.3 72.7 66.1 82.1 89.3 71.8 79.8 91.3 90.4
DREML[61] 9216 63.9 75.0 67.8 86.0 91.7 76.4 - - -
ABE[26] 512 60.6 71.5 - 85.2 90.5 - 76.3 88.4 -
Ours (DiVA-IBN-512) 512 66.8 77.7 70.0 84.1 90.7 68.7 78.1 90.6 90.4
Ours (Margin[59]-R50-512) 512 64.4 75.4 68.4 82.2 89.0 68.1 78.3 90.0 90.1
Ours (DiVA-R50-512) 512 69.2 79.3 71.4 87.6 92.9 72.2 79.6 91.2 90.6
Table 3: Comparison to the state-of-the-art methods on CUB200-2011[55], CARS196[30] and SOP[39]. DiVA-Arch-Dim describes the backbone used with DiVA (IBN: Inception-V1 with Batchnorm, R50: ResNet50) and the total training and testing embedding dimensionality. For fair comparison, we also ran a standard ResNet50 with embedding dimensionality of 512. As can be seen, DiVA significantly outperforms other methods on CUB200 and CARS196 while achieving competitive performance on SOP. Even with the weaker IBN-backbone we reach state-of-the-art on CUB200 and comparable results on CARS196 and SOP.

4.2 Comparison to state-of-the-art approaches

Next, we compare our strongest model trained with the same hyperparameters and a fixed learning rate scheduling per benchmark (Tab. 1) to the current state-of-the-art approaches in DML. For fair comparison to the different methods, we report result both using Inception-BN (IBN) and ResNet50 (R50) as backbone architecture. As Inception-BN is typically trained with embedding dimensionality of 512, we restrict each embedding to for direct comparison with non-ensemble methods. Thus we deliberately impair the potential of our model due to a significantly lower capacity per task, compared to the standard . For comparison with ensemble approaches and maximal performance, we use a ResNet50 [59, 46, 48] architecture and the corresponding standard dimensionality per task. Fig. 3 summarizes our results. We significantly improve over methods with comparable backbone architectures and achieve new state-of-the-art results with our ResNet50-ensemble. In particular we outperform the strongest ensemble methods, including DREML [61] which utilize a much higher total embedding dimensionality. The large improvement is explained by the diverse and mutually complementary learning signals contributed by each task in our ensemble. In contrast, previous ensemble methods rely on the same, purely class-discriminative training signal for each learner. Note that some approaches strongly differ from the standard training protocols and architectures, resulting in more parameters and much higher GPU memory consumption, such as Rank [57] (32GB), ABE [26] (24GB), Softtriple [44] and HORDE [24]. Additionally, Rank [57] employs much larger batch-sizes to increase the number of classes per batch. This is especially crucial on the SOP dataset, which greatly benefits from higher class coverage due to its vast amount of classes, as shown by [47]. Nevertheless, our model outperforms these methods - in some cases even in its constrained version (IBN-512).

4.3 Ablation Studies

Figure 4: Analysis of complementary tasks for supervised learning. (left): Performance comparison between class-dicsriminative training only (Baseline), ensemble of class-discriminative learners (Discr. Ensemble) and our proposed DiVA, which exhibits a large boost in performance. (right): Evaluation of self-supervised learning approaches combined with standard discriminative DML.

In this section we conduct ablation experiments for various parts of our model. For every ablation we again use the Inception-BN network. The dimensionality setting follows the performance study in sec. 4.1. Again, we train each model with a fixed learning rate for fair comparison among ablations.
Influence of distance-adaption in DaNCE: To evaluate the benefit of our extension from  [19, 17] to , we compare both versions in combination with standard supervised DML (i.e. class-discriminative features) in Fig. 4 (right). Our experiment indicates two positive effects: (i) The training convergence with our extended objective is much faster and (ii) the performance differs greatly between employing and . In fact, using the standard NCE objective is even detrimental to learning, while our extended version improves over the only discriminatively trained baseline. We attribute this to both the slow convergence of which is not able to support the faster discriminative learning and to emphasizing harder negatives in . In particular the latter is an important factor in ranking based DML [49], as during training more and more negatives become uninformative. To tackle this issue, we also experimented with learning the temperature parameter . While convergence speed increases slightly, we find no significant benefit in final generalization performance.
Evaluation of self-supervision methods: Fig. 4 (right) compares DaNCE to other methods from self-supervised representation learning. For that purpose we train the discriminative task with either DeepCluster [5], RotNet [16] or Instance Discrimination [60]. We observe that neither of these tasks is able to provide complementary information to improve generalization. DeepCluster, trained with 300 pseudo classes for classification, actually aims at approximating the class-discriminative learning signal while RotNet is strongly dependent on the variance of the training classes and converges very slowly. Instance discrimination seems to provide a contradictory training signal to the supervised task. These results are in line with previous works [20] which report difficulties to directly combine both supervised and self-supervised learning for improved test performance. In contrast, we explicitly adapt NCE to DML in our proposed objective DaNCE.

Methods Baseline DiVA No De-correlation Separated models
Recall@1 63.6 66.4 65.6 48.7
Table 4: Ablation studies. We compare standard margin loss as baseline and DiVA performance against ablations of our model: no decorrelation between embeddings (No-Decorrelation.) and training an independent model for each task (Separated models). Total embedding dimensionality is 512.

Comparison to purely class-discriminative ensemble: We now compare DiVA to an ensemble of class-discriminative learner (Discr. Ensemble) based on the same model architecture using embedding decorrelation in Fig. 4 (left). While the discriminative ensemble improves over the baseline, the amount of captured data information eventually saturates and, thus, performs significantly worse compared to our multi-feature DiVA ensemble. Further, our ablation reveals that joint optimization of diverse learning tasks regularizes training and reduces overfitting effects which eventually occur during later stages of DML training.
Benefit of task decorrelation: The role of decorrelating the embedding representations of each task during learning is analyzed by comparison to a model trained without this constraint. Firstly, Tab. 4 demonstrates that omitting the decorrelation still outperforms the standard margin loss (’Baseline’) by while operating on the same total embedding dimensionality. This proves that learning diverse features significantly improves generalization. Adding the de-corralation constraint then additionally boosts performance by , as now each task is further encouraged to capture distinct data characteristics.
Learning without feature sharing: To highlight the importance of feature sharing among our learning tasks, we train an individual, independent model for the class-discriminative, class-shared, sample-specific and intra-class task. At testing time, we combine their embeddings similar to our proposed model. Tab. 4 shows a dramatic drop in performance to for the disconnected ensemble (’Separately Trained’), proving that sharing the complementary information captured from different data characteristics is a crucial element of our model and mutually benefits learning. Without the class-discriminative signal, the other tasks lack the concept of an object class, which hurts the aggregation of embeddings (cf. Sec. 3). In addition, the supervised task again suffers from strong overfitting to the training data.
Generalization and embedding space compression: Recent work [47] links improvements in DML generalization to a decreased compression [51] of the embedding space. Their findings report that the number of directions with significant variance [52, 47] of a representation correlates with the generalization ability in DML. To this end, we analyze our model using their proposed spectral decay

(lower is better) which is computed as the KL-divergence between the normalized singular value spectrum and a uniform distribution. Fig. 

5 compares the spectral decays of our model and a standard supervised baseline model. As expected, due to the diverse information captured, our model learns a more complex representation which results in a significantly lower value of and better generalization.

Figure 5: Singular Value Spectrum. We analyze the singular value spectrum of DiVA embeddings and that of a network trained with the standard discriminative task. Consistent with [47] we find that our improvements in generalization performance (as shown in Tab. 3 and Tab. 2) are reflected by a reduced spectral decay and more directions of variance in our learned representation space.

5 Conclusion

In this paper we propose several learning tasks which complement the class-discriminative training signal of standard, supervised Deep Metric Learning (DML) for improved generalization to unknown test distributions. Each of our tasks is designed to capture different characteristics of the training data: class-discriminative, class-shared, intra-class and sample-specific features. For the latter, we adapt contrastive self-supervised learning to the needs of supervised DML. Jointly optimizing all tasks results in a diverse overall training signal which is further amplified by mutual decorrelation between the individual tasks. Unifying these distinct representations greatly boosts generalization over purely discriminatively trained models. Our experiments and ablation studies show significantly boosted generalization performance, improving over existing state-of-the-art DML approaches.


  • [1] P. Bachman, R. D. Hjelm, and W. Buchwalter (2019) Learning representations by maximizing mutual information across views. arXiv preprint arXiv:1906.00910. Cited by: §3.1.
  • [2] M. I. Belghazi, S. Rajeswar, O. Mastropietro, N. Rostamzadeh, J. Mitrovic, and A. Courville (2018) Hierarchical adversarially learned inference. External Links: 1802.01071 Cited by: §2.
  • [3] D. Berthelot, C. Raffel, A. Roy, and I. Goodfellow (2018)

    Understanding and improving interpolation in autoencoders via an adversarial regularizer

    External Links: 1807.07543 Cited by: §2.
  • [4] B. Bhattarai, G. Sharma, and F. Jurie (2016) CP-mtml: coupled projection multi-task metric learning for large scale face retrieval. In

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    pp. 4226–4235. Cited by: §2.
  • [5] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    European Conference on Computer Vision. Cited by: Figure 1, §2, §4.3.
  • [6] N. V. Chawla, N. Japkowicz, and A. Kotcz (2004-06) Editorial: special issue on learning from imbalanced data sets. SIGKDD Explor. Newsl. 6 (1), pp. 1–6. External Links: ISSN 1931-0145, Link, Document Cited by: §3.1.
  • [7] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton (2020) A simple framework for contrastive learning of visual representations. arXiv preprint arXiv:2002.05709. Cited by: §1, §2.
  • [8] W. Chen, X. Chen, J. Zhang, and K. Huang (2017) Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.
  • [9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) InfoGAN: interpretable representation learning by information maximizing generative adversarial nets. External Links: 1606.03657 Cited by: §2.
  • [10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou (2018)

    ArcFace: additive angular margin loss for deep face recognition

    External Links: 1801.07698 Cited by: §2, §3.1.
  • [11] J. Donahue, P. Krähenbühl, and T. Darrell (2016) Adversarial feature learning. External Links: 1605.09782 Cited by: §2.
  • [12] Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou (2018-06) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [13] V. Dumoulin, I. Belghazi, B. Poole, O. Mastropietro, A. Lamb, M. Arjovsky, and A. Courville (2016) Adversarially learned inference. External Links: 1606.00704 Cited by: §2.
  • [14] L. Gautheron, E. Morvant, A. Habrard, and M. Sebban (2019) Metric learning from imbalanced data. External Links: 1909.01651 Cited by: §3.1.
  • [15] W. Ge (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: §2, Table 3.
  • [16] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. CoRR abs/1803.07728. External Links: Link, 1803.07728 Cited by: Figure 1, §2, §4.3.
  • [17] M. Gutmann and A. Hyvärinen (2010) Noise-contrastive estimation: a new estimation principle for unnormalized statistical models. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics

    Cited by: §3.1, §4.3.
  • [18] R. Hadsell, S. Chopra, and Y. LeCun (2006) Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2, Table 3.
  • [19] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick (2019) Momentum contrast for unsupervised visual representation learning. External Links: 1911.05722 Cited by: Figure 1, §1, §2, §3.1, §3.2, §4.3.
  • [20] D. Hendrycks, M. Mazeika, S. Kadavath, and D. Song (2019) Using self-supervised learning can improve model robustness and uncertainty. CoRR abs/1906.12340. External Links: Link, 1906.12340 Cited by: §4.3.
  • [21] K. Hsu, S. Levine, and C. Finn (2018) Unsupervised learning via meta-learning. External Links: 1810.02334 Cited by: §2.
  • [22] J. Hu, J. Lu, and Y. Tan (2014) Discriminative deep metric learning for face verification in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [23] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift.

    International Conference on Machine Learning

    Cited by: §4.
  • [24] P. Jacob, D. Picard, A. Histace, and E. Klein (2019) Metric learning with horde: high-order regularizer for deep embeddings. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2, §4.2, Table 3, §4.
  • [25] H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.1.
  • [26] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2, §4.2, Table 3.
  • [27] D. P. Kingma and J. Ba (2015) Adam: a method for stochastic optimization. Cited by: §4.
  • [28] D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §2.
  • [29] T. Konno and M. Iwazume (2018)

    Cavity filling: pseudo-feature generation for multi-class imbalanced data problems in deep learning

    External Links: 1807.06538 Cited by: §3.1.
  • [30] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §4.1, Table 2, Table 3, §4.
  • [31] X. Lin, Y. Duan, Q. Dong, J. Lu, and J. Zhou (2018-09) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2, §3.2, Table 3.
  • [32] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2.
  • [33] C. Manning, P. Raghavan, and H. Schütze (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §4.1.
  • [34] T. Milbich, O. Ghori, F. Diego, and B. Ommer (2020-06) Unsupervised representation learning by discovering reliable image relations. Pattern Recognition (PR) 102. Cited by: §2.
  • [35] I. Misra and L. van der Maaten (2019) Self-supervised learning of pretext-invariant representations. External Links: 1912.01991 Cited by: §2.
  • [36] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §1, §2, §3.1.
  • [37] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.
  • [38] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash (2018) Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §2.
  • [39] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §1, §2, §3, §4.1, Table 2, Table 3, §4.
  • [40] A. Oord, Y. Li, and O. Vinyals (2018) Representation learning with contrastive predictive coding. External Links: Link Cited by: §1, §3.1.
  • [41] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §3.1, §3.2, Table 3.
  • [42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.
  • [43] J. Pu, Y. Jiang, J. Wang, and X. Xue (2014) Which looks like which: exploring inter-class relationships in fine-grained visual categorization. In Proceedings of the IEEE European Conference on Computer Vision (ECCV), pp. 425–440. Cited by: §2.
  • [44] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin (2019) SoftTriple loss: deep metric learning without triplet sampling. Cited by: §3.1, §4.2, Table 3.
  • [45] S. Ren, K. He, R. B. Girshick, and J. Sun Faster r-cnn: towards real-time object detection with region proposal networks.. In NeuRips, Cited by: §3.2.
  • [46] K. Roth, B. Brattoli, and B. Ommer (2019-10) MIC: mining interclass characteristics for improved metric learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, §3.1, §3.2, §4.2, Table 3, §4, §4.
  • [47] K. Roth, T. Milbich, S. Sinha, P. Gupta, B. Ommer, and J. P. Cohen (2020) Revisiting training strategies and generalization performance in deep metric learning. External Links: 2002.08473, Link Cited by: Figure 5, §4.2, §4.3.
  • [48] A. Sanakoyeu, V. Tschernezki, U. Buchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §3.2, §4.2, Table 3, §4, §4.
  • [49] F. Schroff, D. Kalenichenko, and J. Philbin (2015) Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §1, §2, §3.2, §4.3.
  • [50] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §1, §2, §3.
  • [51] N. Tishby and N. Zaslavsky (2015) Deep learning and the information bottleneck principle. External Links: 1503.02406 Cited by: §4.3.
  • [52] V. Verma, A. Lamb, C. Beckham, A. Najafi, I. Mitliagkas, A. Courville, D. Lopez-Paz, and Y. Bengio (2018) Manifold mixup: better representations by interpolating hidden states. External Links: 1806.05236 Cited by: §4.3.
  • [53] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    Cited by: §2.
  • [54] J. Wagemans, J. H. Elder, M. Kubovy, S. E. Palmer, M. A. Peterson, M. Singh, and R. von der Heydt (2012) A century of gestalt psychology in visual perception: i. perceptual grouping and figure–ground organization.. Psychological bulletin 138 (6), pp. 1172. Cited by: §3.1.
  • [55] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §4.1, Table 2, Table 3, §4.
  • [56] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §3.1.
  • [57] X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, and N. M. Robertson (2019) Ranked list loss for deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §4.2, Table 3.
  • [58] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. CoRR abs/1904.06627. External Links: Link, 1904.06627 Cited by: §2, Table 3, §4, §4.
  • [59] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §1, §2, §3.1, §3.2, §3, §4.1, §4.2, Table 2, Table 3, §4, §4.
  • [60] Z. Wu, Y. Xiong, S. Yu, and D. Lin (2018) Unsupervised feature learning via non-parametric instance-level discrimination. External Links: 1805.01978 Cited by: Figure 1, §4.3.
  • [61] H. Xuan, R. Souvenir, and R. Pless (2018) Deep randomized ensembles for metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 723–734. Cited by: §2, §3.2, §4.2, Table 3.
  • [62] A. Zhai and H. Wu (2018) Classification is a strong baseline for deep metric learning. External Links: 1811.12649 Cited by: §2, §3.1.
  • [63] Y. Zhao, Z. Jin, G. Qi, H. Lu, and X. Hua (2018) An adversarial approach to hard triplet generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–517. Cited by: Table 3.
  • [64] W. Zheng, Z. Chen, J. Lu, and J. Zhou (2019) Hardness-aware deep metric learning. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2, Table 3.