The Group Loss for Deep Metric Learning

12/01/2019 ∙ by Ismail Elezi, et al. ∙ 46

Deep metric learning has yielded impressive results in tasks such as clustering and image retrieval by leveraging neural networks to obtain highly discriminative feature embeddings, which can be used to group samples into different classes. Much research has been devoted to the design of smart loss functions or data mining strategies for training such networks. Most methods consider only pairs or triplets of samples within a mini-batch to compute the loss function, which is commonly based on the distance between embeddings. We propose Group Loss, a loss function based on a differentiable label-propagation method that enforces embedding similarity across all samples of a group while promoting, at the same time, low-density regions amongst data points belonging to different groups. Guided by the smoothness assumption that "similar objects should belong to the same group", the proposed loss trains the neural network for a classification task, enforcing a consistent labelling amongst samples within a class. We show state-of-the-art results on clustering and image retrieval on several datasets, and show the potential of our method when combined with other techniques such as ensembles



There are no comments yet.


page 6

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Measuring object similarity is at the core of many important machine learning problems like clustering and object retrieval. For visual tasks, this means learning a distance function over images. With the rise of deep neural networks, the focus has rather shifted towards learning a feature embedding that is easily separable using a simple distance function, such as the Euclidean distance. In essence, objects of the same class (similar) should be close by in the learned manifold, while objects of a different class (dissimilar) should be far away.

Currently, the best performing approaches get deep feature embeddings from the so-called siamese networks

[2], which are typically trained using the contrastive loss [2] or the triplet loss [29, 40]. A clear drawback of these losses is that they only consider pairs or triplets of data points, missing key information about the relationships between all members of the mini-batch. On a mini-batch of size , despite that the number of pairwise relations between samples is , contrastive loss uses only pairwise relations, while triplet loss uses relations. Additionally, these methods consider only the relations between objects of the same class (positives) and objects of other classes (negatives), without making any distinction that negatives belong to different classes. This leads to not taking into consideration the global structure of the embedding space, and consequently results in lower clustering and retrieval performance. To compensate for that, researchers rely on other tricks to train neural networks for deep metric learning: intelligent sampling [16], multi-task learning [45] or hard-negative mining [28]. We advocate that we can bypass the use of such tricks by adopting a loss function that exploits, in a principled way, the global structure of the embedding space.

Figure 1:

A comparison between a neural model trained with the Group Loss (left) and the triplet loss (right). Given a mini-batch of images belonging to different classes, their embeddings are computed through a convolutional neural network. Such embeddings are then used to generate a similarity matrix that is fed to the Group Loss along with prior distributions of the images on the possible classes. The green contours around some mini-batch images refer to

anchors. It is worth noting that, differently from the triplet loss, the Group Loss considers multiple classes and the pairwise relations between all the samples. Numbers from \⃝raisebox{-1.0pt}{1} to \⃝raisebox{-1.0pt}{3} refer to the Group Loss steps, see Sec 3.1 for the details.

In this work, we propose a novel loss function for deep metric learning, called the Group Loss, which considers the similarity between all samples in a mini-batch. To create the mini-batch, we sample from a fixed number of classes, with samples coming from a class forming a group. Thus, each mini-batch consists of several randomly chosen groups, and each group has a fixed number of samples. An iterative, fully-differentiable label propagation algorithm is then used to build feature embeddings which are similar for samples belonging to the same group, and dissimilar otherwise.

At the core of our method lies an iterative process called replicator dynamics [39, 6]

, that refines the local information, given by the softmax layer of a neural network, with the global information of the mini-batch, given by the similarity between embeddings. The driving rationale is that the more similar two samples are, the more they affect each other in choosing their final label and tend to be grouped together in the same group (hence the name

Group Loss), while dissimilar samples do not affect each other on their choices. Neural networks optimized with the Group Loss learn to provide similar features for samples belonging to the same class, making clustering and image retrieval easier.

Our contribution in this work is four-fold:

  • We propose a novel loss function to train neural networks for deep metric embedding that takes into account the local information of the samples, as well as their similarity.

  • We propose a differentiable label-propagation iterative model to embed the similarity computation within backpropagation, allowing end-to-end training with our new loss function.

  • We perform a comprehensive robustness analysis showing the stability of our module with respect to the choice of hyperparameters.

  • We show state-of-the-art qualitative and quantitative results in several standard clustering and retrieval datasets.

To further facilitate research, our PyTorch

[22] code, hyperparameters and the trained models will be released upon the acceptance of the paper.

2 Related Work

The first attempt at using a neural network for feature embedding was done in the seminal work of Siamese Networks [2]. A cost function called contrastive loss was designed in such a way as to minimize the distances between pairs of images belonging to the same cluster, and maximize the distances between pairs of images coming from different clusters. In [3], researchers used the principle to successfully address the problem of face verification.

Another line of research on convex approaches for metric learning led to the triplet loss [29, 40], which was later combined with the expressive power of neural networks [28]. The main difference from the original Siamese network is that the loss is computed using triplets (an anchor, a positive and a negative data point). The loss is defined to make the distance between features of the anchor and the positive sample smaller than the distance between the anchor and the negative sample.

The approach was so successful in the field of face recognition and clustering, that soon many works followed. The majority of works on the Siamese architecture consist of finding better cost functions, resulting in better performances on clustering and retrieval. In

[30], the authors generalized the concept of triplet by allowing a joint comparison among negative examples instead of just one. [32]

designed an algorithm for taking advantage of the mini-batches during the training process by lifting the vector of pairwise distances within the batch to the matrix of pairwise distances, thus enabling the algorithm to learn feature embedding by optimizing a novel structured prediction objective on the lifted problem. The work was later extended in

[31], proposing a new metric learning scheme based on structured prediction that is designed to optimize a clustering quality metric, i.e., the normalized mutual information [17]. Better results were achieved on [36], where the authors proposed a novel angular loss, which takes angle relationship into account.

Knowing that the number of possible triplets is extremely large even for moderately-sized datasets, and having found that the majority of triplets are not informative [28], researchers also investigated sampling. In the original triplet loss paper [28], it was found that using semi-hard negative mining, the network can be trained to a good performance, but the training is computationally inefficient. The work of [16] found out that while the majority of research is focused on designing new loss functions, selecting training examples plays an equally important role. The authors proposed a distance-weighted sampling procedure, which selects more informative and stable examples than traditional approaches, achieving excellent results in the process. The authors of [18] proposed optimizing the triplet loss on a different space of triplets than the original samples, consisting of an anchor data point and similar and dissimilar proxy data points which are learned as well. These proxies approximate the original data points so that a triplet loss over the proxies is a tight upper bound of the original loss. A very different problem formulation was given by [14]

, where the authors used a spectral clustering-inspired approach to achieve deep embedding. A recent work presents several extensions of the triplet loss that reduce the bias in triplet selection by adaptively correcting the distribution shift on the selected triplets

[43]. The majority of recent works has been focused on complementary research directions such as intelligent sampling [16, 7, 4, 38, 41] or ensemble methods [42, 27, 11, 19, 44]. As we will show in the experimental section, these can be combined with our novel loss.

Our work significantly differs from the other works mentioned here. We move away entirely from the triplet loss formulation that considers only positive and negative pairs. Instead, we focus on grouping objects into groups using the aforementioned label-propagation iterative procedure, that allows us to compute the Group Loss and enforces consistent labeling across all members of all groups, showing superior results compared to other approaches.

3 Group Loss

Most loss functions used for deep metric learning [28, 32, 30, 31, 36, 38, 37, 14, 7, 16]

do not use a classification loss function, e.g., cross-entropy, but rather a loss function based on embedding distance. The rationale behind it is that what matters for a classification network is that the output is correct, not that the embeddings of samples belonging to the same class are similar. Since each sample is classified independently, it is entirely possible that two images of the same class have two distant embeddings that both allow for a correct classification. We argue that a classification loss can still be used for deep metric learning, if the decisions do not happen independently for each sample, but rather jointly for a whole

group, i.e., the set of images of the same class in a mini-batch. This is the goal of Group Loss, to take into consideration the global structure of a mini-batch, i.e., the overall class separation for all samples. To achieve this, we propose an iterative procedure that refines the local information, given by the softmax layer of a neural network, with the global information of the mini-batch, given by the similarity between embeddings. This iterative procedure categorizes samples into different groups, and enforces consistent labelling among the samples of a group. While softmax cross-entropy loss judges each sample in isolation, the Group Loss jointly takes into consideration all the samples in a mini-batch. Furthermore, in section 3.2, we show that the formulation of the Group Loss is very different from that of softmax cross-entropy and show that our loss has different mathematical properties.

3.1 Overview of Group Loss

Given a mini-batch consisting of images, consider the problem of assigning a class label to each image in . In the remainder of the manuscript, represents a (non-negative) matrix of image-label soft assignments. In other words, each row of

represents a probability distribution over the label set


The proposed model consists of the following steps (see also Fig. 1 and Algorithm 1):

  1. [label=\⃝raisebox{-1.1pt}{0}]

  2. Initialization: Initialize , the image-label assignment using the softmax outputs of the neural network. Compute the pairwise similarity matrix using the neural network embedding.

  3. Refinement: Iteratively, refine considering the similarities between all the mini-batch images, as encoded in , as well as their labeling preferences.

  4. Loss computation: Compute the cross-entropy loss of the refined probabilities and update the weights of the neural network using backpropagation.

We now provide a more detailed description of the three steps of our method.

3.2 Initialization

Image-label assignment matrix. The initial assignment matrix denoted , comes from the softmax output of the neural network. We can replace some of the initial assignments in matrix with one-hot labelings of those samples. We call these samples anchors, as their assignments do not change during the iterative refine process and consequently do not directly affect the loss function. However, by using their correct label instead of the predicted label (coming from the softmax output of the NN), they guide the remaining samples towards their correct label.

Similarity matrix. A measure of similarity is computed among all pairs of embeddings (computed via a CNN) in to generate a similarity matrix . In this work, we compute the similarity measure using the Pearson’s correlation coefficient [23]:


for , and set to

. The choice of this measure over other options such as cosine layer, Gaussian kernels, or learned similarities, is motivated by the observation that the correlation coefficient uses data standardization, thus providing invariance to scaling and translation – unlike the cosine similarity, which is invariant to scaling only – and it does not require additional hyperparameters, unlike like Gaussian kernels

[5]. The fact that a measure of the linear relationship among features provides a good similarity measure can be explained by the fact that the computed features are actually a highly non-linear function of the inputs. Thus, the linear correlation among the embeddings actually captures a non-linear relationship among the original images.

3.3 Refinement

In this core step of the proposed algorithm, the initial assignment matrix is refined in an iterative manner, taking into account the similarity information provided by matrix . is updated in accordance with the smoothness assumption, which prescribes that similar objects should share the same label.

To this end, let us define the support matrix as


whose -component


represents the support that the current mini-batch gives to the hypothesis that the -th image in belongs to class . Intuitively, in obedience to the smoothness principle, is expected to be high if images similar to are likely to belong to class .

Figure 2:

A high-level illustration of the refinement procedure. Given two anchors A and B, and an unlabeled sample C, the goal of our procedure is to classify sample C based on its local information (in this case initialized as uniform distribution), and the similarity of it with samples A and B. We see that because the similarity of C is much higher with A than with B, it quickly (within 3 iterations) gets the same final label as A.

Given the initial assignment matrix , our algorithm refines it using the following update rule:


where the denominator represents a normalization factor which guarantees that the rows of the updated matrix sum up to one. This is known as multi-population replicator dynamics in evolutionary game theory

[39] and is equivalent to nonlinear relaxation labeling processes [25, 24].

In matrix notation, the update rule (4) can be written as:




and is the all-one -dimensional vector. as defined in (2), and denotes the Hadamard (element-wise) matrix product. In other words, the diagonal elements of represent the normalization factors in (4), which can also be interpreted as the average support that object obtains from the current mini-batch at iteration . Intuitively, the motivation behind our update rule is that at each step of the refinement process, for each image , a label will increase its probability if and only if its support is higher than the average support among all the competing label hypothesis .111This can be motivated by a Darwinian survival-of-the-fittest selection principle, see e.g. [39].

Thanks to the Baum-Eagon inequality [24], it is easy to show that the dynamical system defined by (4) has very nice convergence properties. In particular, it strictly increases at each step the following functional:


which represents a measure of “consistency” of the assignment matrix , in accordance to the smoothness assumption ( rewards assignments where highly similar objects are likely to be assigned the same label). In other words:


with equality if and only if is a stationary point. Hence, our update rule (4) is, in fact, an algorithm for maximizing the functional over the space of row-stochastic matrices. Note, that this contrasts with classical gradient methods, for which an increase in the objective function is guaranteed only when infinitesimal steps are taken, and determining the optimal step size entails computing higher-order derivatives. Here, instead, the step size is implicit and yet, at each step, the value of the functional increases.

3.4 Loss computation.

Once the labeling assignments converge (or in practice, a maximum number of iterations is reached), we apply the cross-entropy loss to quantify the classification error and backpropagate the gradients. Recall, the refinement procedure is optimized via replicator dynamics, as shown in the previous section. By studying Equation (5), it is straightforward to see that it is composed of fully differentiable operations (matrix-vector and scalar products), and so it can be easily integrated within backpropagation. Although the refining procedure has no parameters to be learned, its gradients can be backpropagated to the previous layers of the neural network, producing, in turn, better embeddings for similarity computation.

3.5 Summary of the Group Loss

In this section, we proposed the Group Loss function for deep metric learning. During training, the Group Loss works by grouping together similar samples based on both the similarity between the samples in the mini-batch and the local information of the samples. The similarity between samples is computed by the correlation between the embeddings obtained from a CNN, while the local information is computed with a softmax layer on the same CNN embeddings. Using an iterative procedure, we combine both sources of information and effectively bring together embeddings of samples that belong to the same class.

During inference, we simply forward pass the images through the neural network to compute their embeddings, which are directly used for image retrieval within a nearest neighbor search scheme. The iterative procedure is not used during inference, thus making the feature extraction as fast as that of any other competing method.

Input: input : Set of pre-processed images in the mini-batch , set of labels , neural network with learnable parameters , similarity function , number of iterations
1 Compute feature embeddings via the forward pass Compute the similarity matrix Initialize the matrix of priors from the softmax layer for t = 0, …, T-1 do
Compute the cross-entropy Compute the derivatives via backpropagation, and update the weights
Algorithm 1 The Group Loss
Figure 3: Retrieval results on a set of images from the CUB-200-2011 (left), Cars 196 (middle), and Stanford Online Products (right) datasets using our Group Loss model. The left column contains query images. The results are ranked by distance. The green square indicates that the retrieved image is from the same class as the query image, while the red box indicates that the retrieved image is from a different class.

4 Experiments

In this section, we compare the Group Loss with state-of-the-art deep metric learning models on both image retrieval and clustering tasks. Our method achieves state-of-the-art results in three public benchmark datasets.

4.1 Implementation details

We use the PyTorch [22] library for the implementation of the Group Loss. We choose GoogleNet [33]

with batch-normalization

[8] as the backbone feature extraction network. We pretrain the network on ILSVRC 2012-CLS dataset [26]. For pre-processing, in order to get a fair comparison, we follow the implementation details of [31]. The inputs are resized to pixels, and then randomly cropped to . Like other methods except for [30], we use only a center crop during testing time. We train all networks in the classification task for epochs. We then train the network in the Group Loss task for epochs using Adam optimizer [12]. After epochs, we lower the learning rate by multiplying it by . We find the hyperparameters using random search [1]. We use small mini-batches of size . As sampling strategy, on each mini-batch, we first randomly sample a fixed number of classes, and then for each of the chosen classes, we sample a fixed number of samples.

4.2 Benchmark datasets

We perform experiments on publicly available datasets, evaluating our algorithm on both clustering and retrieval metrics. For training and testing, we follow the conventional splitting procedure [32].

CUB-200-2011 [35] is a dataset containing species of birds with images, where the first species ( images) are used for training and the remaining species ( images) are used for testing.

Cars 196 [13] dataset is composed of images belonging to classes. We use the first classes ( images) for training and the other classes ( images) for testing.

Stanford Online Products dataset, as introduced in [32], contains classes with product images in total, where classes ( images) are used for training and the remaining classes ( images) are used for testing.

4.3 Evaluation metrics

Based on the experimental protocol detailed above, we evaluate retrieval performance and clustering quality on data from unseen classes of the aforementioned datasets. For the retrieval task, we calculate the percentage of the testing examples whose nearest neighbors contain at least one example of the same class. This quantity is also known as Recall@K [9] and is the most used metric for image retrieval evaluation.

Similar to all other approaches we compare with, we perform clustering using K-means algorithm

[15] on the embedded features. Like in other works, we evaluate the clustering quality using the Normalized Mutual Information measure (NMI) [17]. The choice of NMI measure is motivated by the fact that it is invariant to label permutation, a desirable property for cluster evaluation.

4.4 Results

We now show the results of our model and comparison to state-of-the-art methods. Our main comparison is with other loss functions, e.g., triplet loss, without including any other tricks such as sampling or ensembles. To compare with perpendicular research on intelligent sampling strategies or ensembles, and show the power of the Group Loss, we propose a simple ensemble version of our method. Our ensemble network is built by training independent neural networks with the same hyperparameter configuration. During inference, their embeddings are concatenated. Note, that this type of an ensemble is much simpler that the works of [44, 42, 11, 20, 27] and is given only to show that, when optimized for performance, our method can be extended to ensembles giving higher clustering and retrieval performance than any other method in the literature. Nevertheless, we consider the main focus of comparison should be the other loss functions that do not use advanced sampling or ensemble methods.

Method NMI R@1 R@2 R@4 R@8 Loss Triplet [28] 55.3 42.5 55 66.4 77.2 Lifted Structure [32] 56.5 43.5 56.5 68.5 79.6 Npairs [30] 60.2 51.9 64.3 74.9 83.2 Facility Location [31] 59.2 48.1 61.4 71.8 81.9 Angular Loss [36] 61.1 54.7 66.3 76 83.9 Proxy-NCA [18] 59.5 49.2 61.9 67.9 72.4 Deep Spectral [14] 59.2 53.2 66.1 76.7 85.2 Bias Triplet [43] - 46.6 58.6 70.0 - Ours 67.9 64.3 75.8 84.1 90.5 Loss + Sampling/Mining Samp. Matt. [16] 69.0 63.6 74.4 83.1 90.0 Hier. triplet [7] - 57.1 68.8 78.7 86.5 DAMLRRM [41] 61.7 55.1 66.5 76.8 85.3 DE-DSP [4] 61.7 53.6 65.5 76.9 GPW [38] - 65.7 77.0 86.3 91.2 Teacher-Student RKD [21] - 61.4 73.0 81.9 89.0 Loss+Ensembles BIER 6 [19] - 55.3 67.2 76.9 85.1 HDC 3 [44] - 54.6 66.8 77.6 85.9 DRE 48 [42] 62.1 58.9 69.6 78.4 85.6 ABE 2 [11] - 55.7 67.9 78.3 85.5 ABE 8 [11] - 60.6 71.5 79.8 87.4 A-BIER 6 [20] - 57.5 68.7 78.3 86.2 D and C 8 [27] 69.6 65.9 76.6 84.4 90.6 RLL 3 [37] 66.1 61.3 72.7 82.7 89.4 Ours 2-ensemble 68.5 65.8 76.7 85.2 91.2 Ours 5-ensemble 70.0 66.9 77.1 85.4 91.5 Method NMI R@1 R@2 R@4 R@8 Loss Triplet [28] 53.4 51.5 63.8 73.5 82.4 Lifted Structure [32] 56.9 53.0 65.7 76.0 84.3 Npairs [30] 62.7 68.9 78.9 85.8 90.9 Facility Location [31] 59.0 58.1 70.6 80.3 87.8 Angular Loss [36] 63.2 71.4 81.4 87.5 92.1 Proxy-NCA [18] 64.9 73.2 82.4 86.4 88.7 Deep Spectral [14] 64.3 73.1 82.2 89.0 93.0 Bias Triplet [43] - 79.2 86.7 91.4 - Ours 70.7 83.7 89.9 93.7 96.3 Loss + Sampling/Mining Samp. Matt. [16] 69.1 79.6 86.5 91.9 95.1 Hier. triplet [7] - 81.4 88.0 92.7 95.7 DAMLRRM [41] 64.2 73.5 82.6 89.1 93.5 DE-DSP [4] 64.4 72.9 81.6 88.8 - GPW [38] - 84.1 90.4 94.0 96.5 Teacher-Student RKD [21] - 82.3 89.8 94.2 96.6 Loss+Ensembles HDC 6 [44] - 75.0 83.9 90.3 94.3 BIER 3 [19] - 78.0 85.8 91.1 95.1 DRE 48 [42] 71 84.2 89.4 93.2 95.5 ABE 2 [11] - 76.8 84.9 90.2 94.0 ABE 8 [11] - 85.2 90.5 94.0 96.1 A-BIER 6 [20] - 82.0 89.0 93.2 96.1 D and C 8 [27] 70.3 84.6 90.7 94.1 96.5 RLL 3 [37] 71.8 82.1 89.3 93.7 96.7 Ours 2-ensemble 72.6 86.2 91.6 95.0 97.1 Ours 5-ensemble 74.2 88.0 92.5 95.7 97.5
Table 1: Retrieval and Clustering performance on the CUB-200-2011. Bold indicates best results.
Table 2: Retrieval and Clustering performance on the Cars 196 dataset. Bold indicates best results.

4.4.1 Quantitative results

Loss comparison. In Tables 2 - 3, we present the results of our method and compare them with the results of other approaches. On the CUB-200-2011 dataset, we outperform the other approaches by a large margin, with the second-best model (Angular Loss [36]) having circa percentage points() lower absolute accuracy in Recall@1 metric. On the NMI metric, our method achieves a score of which is almost higher than the second best method. Similarly, on Cars 196, our method achieves best results on Recall@1, with Bias Triplet [43] coming second with a lower score. Proxy-NCA [18] is second on the NMI metric, with a lower score. On Stanford Online Products, our method outperforms all the other loss functions on the Recall@1 metric. On the same dataset, when evaluated on the NMI score, our loss outperforms any other method, be those methods that exploit advanced sampling, or ensemble methods.

Loss with ensembles. Our ensemble method (using neural networks) is the highest performing model in CUB-200-2011 and Cars 196 datasets, outperforming all other methods on both Recall@1 and NMI metrics. In Stanford Online Products, our ensemble reaches the third highest result on the Recall@1 metric (after Ranked List Loss [37] and GPW [38]) and reaches the best overall result on the NMI metric.

4.4.2 Qualitative results

Fig. 3 shows qualitative results on the retrieval task in all three datasets. In all cases, the query image is given on the left, with the four nearest neighbors given on the right. Green boxes indicate the cases where the retrieved image is of the same class as the query image, and red boxes indicate a different class. As we can see, our model is able to perform well even in cases where the images suffer from occlusion and rotation. On the Cars 196 dataset, we see a successful retrieval even when the query image is taken indoors and the retrieved image outdoors, and viceversa. The first example of Cars 196 dataset is of particular interest. Despite the fact that the query image contains cars, all four nearest neighbors which have been retrieved have the same class as the query image, showing the robustness of the algorithm to uncommon input image configurations. We provide the results of t-SNE [34] projection in the supplementary material.

4.5 Robustness analysis

Method NMI R@1 R@10 R@100
Triplet [28] 89.5 66.7 82.4 91.9
Lifted Structure [32] 88.7 62.5 80.8 91.9
Npairs [30] 87.9 66.4 82.9 92.1
Facility Location [31] 89.5 67.0 83.7 93.2
Angular Loss [36] 88.6 70.9 85.0 93.5
Proxy-NCA [18] 90.6 73.7 - -
Deep Spectral [14] 89.4 67.6 83.7 93.3
Bias Triplet [43] - 63.0 79.8 90.7
Ours 90.8 75.1 87.5 94.2
Loss + Sampling/Mining
Samp. Matt [16] 90.7 72.7 86.2 93.8
Hier. triplet [7] - 74.8 88.3 94.8
DAMLRRM [41] 88.2 69.7 85.2 93.2
DE-DSP [4] 89.2 68.9 84.0 92.6
GPW [38] - 78.2 90.5 96.0
RKD [21] - 75.1 88.3 95.2
HDC 6 [44] - 70.1 84.9 93.2
BIER 3 [19] - 72.7 86.5 94.0
DRE 48 [42] - - - -
ABE 2 [11] - 75.4 88.0 94.7
ABE 8 [11] - 76.3 88.4 94.8
A-BIER 6 [20] - 74.2 86.9 94.0
D and C 8 [27] 90.2 75.9 88.4 94.9
RLL 3 [37] 90.4 79.8 91.3 96.3
Ours 2-ensemble 91.1 75.9 88.0 94.5
Ours 5-ensemble 91.1 76.3 88.3 94.6
Table 3: Retrieval and Clustering performance on Stanford Online Products. Bold indicates best results.
Figure 4: The effect of the number of anchors and the number of samples per class.
Figure 5: The effect of the number of classes per mini-batch.
Figure 6: Recall@1 as a function of training epochs on Cars196 dataset. Figure adapted from [18].
Number of anchors.

In Fig. 6, we show the effect of the number of anchors with respect to the number of samples per class. We do the analysis on CUB-200-2011 dataset and give a similar analysis for CARS dataset in the supplementary material. The results reported are the percentage point differences in terms of Recall@1 with respect to the the best performing set of parameters (see in Tab. 2). The number of anchors ranges from 0 to 4, while the number of samples per class varies from 5 to 10. It is worth noting that our best setting considers 1 or 2 anchors over 9 samples (the two zeros in the Fig. 6). Moreover, even when we do not use any anchor, the difference in Recall@1 is no more than .

Number of classes per mini-batch.

In Fig. 6, we present the change in Recall@1 on the CUB-200-2011 dataset if we increase the number of classes we sample at each iteration. The best results are reached when the number of classes is not too large. This is a welcome property, as we are able to train on small mini-batches, known to achieve better generalization performance [10].

Convergence rate.

In Fig. 6, we present the convergence rate of the model on the Cars 196 dataset. Within the first epochs, our model achieves state-of-the-art results, making our model significantly faster than other approaches. Note, that other models, with the exception of Proxy-NCA [18], need hundreds of epochs to converge. Additionally, we compare the training time with Proxy-NCA [18]. On a single Volta V100 GPU, the average running time of our method per epoch is seconds on CUB-200-2011 and seconds on Cars 196, compared to and of Proxy-NCA [18]. Hence, our method is faster than one one of the fastest methods in the literature. Note, the inference time of every method is the same, because the network is used only for feature embedding extraction during inference.

5 Conclusions and Future Work

In this work, we proposed the Group Loss, a new loss function for deep metric learning that goes beyond triplets. By considering the content of a mini-batch, it promotes embedding similarity across all samples of the same class, while enforcing dissimilarity for elements of different classes. This is achieved with a fully-differentiable layer which is used to train a convolutional network in an end-to-end fashion. We show that our model outperforms state-of-the-art methods on several datasets, and at the same time shows fast convergence.

In our work, we did not consider any advanced and intelligent sampling strategy. Instead, we randomly sample objects from a few classes at each iteration. Sampling has shown to have a very important role in feature embedding [16]

, therefore, we will explore in future work sampling techniques which can be suitable for our module. Additionally, we are going to investigate the applicability of Group Loss to other problems, such as person re-identification and deep semi-supervised learning.

Acknowledgements. This research was partially funded by the Humboldt Foundation through the Sofja Kovalevskaja Award. We thank Michele Fenzi, Maxim Maximov and Guillem Braso Andilla for useful discussions.

Appendix A Robustness analysis of CUB-200-2011 dataset

Figure 7: The effect of the number of anchors and the number of samples per class.
Figure 8: Training vs testing Recall@1 curves in Cars 196 dataset.
Figure 9: Training vs testing Recall@1 curves in Stanford Online Products dataset.

a.1 Number of anchors

In the main work, we showed the robustness analysis on the CUB-200-2011 [35] dataset (see Figure 4 in the main paper). Here, we report the same analysis for the Cars 196 [13] dataset. This leads us to the same conclusions as shown in the main paper.

As for the experiment in the paper, we do a grid search over the total number of elements per class versus the number of anchors. We increase the number of elements per class from to , and in each case, we vary the number of anchors from to . We show the results in Fig. 9. Note, the results decrease mainly when we do not have any labeled sample, i.e., when we use zero anchors. The method shows the same robustness as on the CUB-200-2011 [35] dataset, with the best result being only percentage points better at the Recall@1 metric than the worst result.

Appendix B Implicit regularization and less overfitting

In Figures 9 and 9, we compare the results of training vs. testing on Cars 196 [13] and Stanford Online Products [32] datasets. Note, the difference between Recall@1 at train and test time is small, especially on Stanford Online Products dataset. On Cars 196 the best results we get for the training set are circa in the Recall@1 measure, only percentage points () better than what we reach in the testing set. From the papers we compared the results with, the only one which reports the results on the training set is Deep Spectral Clustering Learning [14]. They reported results of over in all metrics for all three datasets (for the training sets), much above the test set accuracy which lies at on Cars 196 and on Stanford Online Products dataset. This clearly shows that our method is much less prone to overfitting.

We further implement the P-NCA [18] loss function and perform a similar experiment, in order to be able to compare training and test accuracies directly with our method. In Figure 9, we show the training and testing curves of P-NCA on the Cars 196 [13] dataset. We see that while in the training set, P-NCA reaches results of higher than our method, in the testing set, our method outperforms P-NCA for around .

Unfortunately, we were unable to reproduce the results of the paper [18] (on the testing set) for the Stanford Online Products dataset.

Furthermore, even when we turn off -regularization, the generalization performance does not drop at all. Our intuition is that, by taking into account the structure of the entire manifold of the dataset, our method introduces a form of regularization. We can clearly see a smaller gap between training and test results when compared to competing methods, indicating less overfitting. We plan to further investigate this phenomenon in future work.

Appendix C More implementation details

We first finetune all methods in classification task for epochs. We then train our networks on all three datasets [35, 13, 32] for epochs achieving state-of-the-art results. During training, we use a simple learning rate scheduling in which we divide the learning rate by after the first epochs.

We tune all hyperparameters using random search [1]. For the weight decay (-regularization) parameter, we search over the interval , while for learning rate we search over the interval . We achieve the best results with a regularization parameter set to for CUB-200-2011, for Cars 196 dataset, and for Stanford Online Products dataset. This further strengthens our intuition that the method is implicitly regularized and it does not require strong regularization.

Appendix D Dealing with negative similarities

Equation (4) in the main paper assumes that the matrix of similarity is non-negative. However, for similarity computation, we use a correlation metric (see Equation (1) in the main paper) which produces values in the range . In similar situations, different authors propose different methods to deal with the negative outputs. The most common approach is to shift the matrix of similarity towards the positive regime by subtracting the biggest negative value from every entry in the matrix [6]. Nonetheless, this shift has a side effect: If a sample of class has very low similarities to the elements of a large group of samples of class , these similarity values (which after being shifted are all positive) will be summed up. When the cardinality of class is very large, then summing up all these small values lead to a large value, and consequently affect the solution of the algorithm. What we want instead, is to ignore these negative similarities, hence we propose clamping. More concretely, we use a ReLUactivation function over the output of Equation (1).

We compare the results of shifting vs clamping. On the CARS 196 dataset, we do not see a significant difference between the two approaches. However, on the CUBS-200-2011 dataset, the Recall@1 metric is with shifting, much below the obtained when using clamping. We investigate the matrix of similarities for the two datasets, and we see that the number of entries with negative values for the CUBS-200-2011 dataset is higher than for the CARS 196 dataset. This explains the difference in behavior, and also verifies our hypothesis that clamping is a better strategy to use within Group Loss.

Appendix E t-SNE on CUB-200-2011 dataset

Figure 10: t-SNE [34] visualization of our embedding on the CUB-200-2011 [35] dataset, with some clusters highlighted. Best viewed on a monitor when zoomed in.

Figure 10 visualizes the t-distributed stochastic neighbor embedding (t-SNE) [34] of the embedding vectors obtained by our method on the CUB-200-2011 [35] dataset. The plot is best viewed on a high-resolution monitor when zoomed in. We highlight several representative groups by enlarging the corresponding regions in the corners. Despite the large pose and appearance variation, our method efficiently generates a compact feature mapping that preserves semantic similarity.


  • [1] J. Bergstra and Y. Bengio (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, pp. 281–305. External Links: Link Cited by: Appendix C, §4.1.
  • [2] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1994) Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744. Cited by: §1, §2.
  • [3] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. See DBLP:conf/cvpr/2005, pp. 539–546. External Links: Link, Document Cited by: §2.
  • [4] Y. Duan, L. Chen, J. Lu, and J. Zhou Deep embedding learning with discriminative sampling policy. In

    IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,

    Cited by: §2, Table 2, Table 2, Table 3.
  • [5] I. Elezi, A. Torcinovich, S. Vascon, and M. Pelillo (2018) Transductive label augmentation for improved deep network learning. See DBLP:conf/icpr/2018, pp. 1432–1437. External Links: Link, Document Cited by: §3.2.
  • [6] A. Erdem and M. Pelillo (2012) Graph transduction as a noncooperative game. Neural Computation 24 (3), pp. 700–723. External Links: Link, Document Cited by: Appendix D, §1.
  • [7] W. Ge, W. Huang, D. Dong, and M. R. Scott (2018) Deep metric learning with hierarchical triplet loss. See DBLP:conf/eccv/2018-6, pp. 272–288. External Links: Link, Document Cited by: §2, §3, Table 2, Table 2, Table 3.
  • [8] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. See DBLP:conf/icml/2015, pp. 448–456. External Links: Link Cited by: §4.1.
  • [9] H. Jégou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33 (1), pp. 117–128. External Links: Link, Document Cited by: §4.3.
  • [10] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang (2017)

    On large-batch training for deep learning: generalization gap and sharp minima

    In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §4.5.
  • [11] W. Kim, B. Goyal, K. Chawla, J. Lee, and K. Kwon (2018) Attention-based ensemble for deep metric learning. See DBLP:conf/eccv/2018-1, pp. 760–777. External Links: Link, Document Cited by: §2, §4.4, Table 2, Table 2, Table 3.
  • [12] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • [13] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §A.1, Appendix B, Appendix B, Appendix C, §4.2.
  • [14] M. T. Law, R. Urtasun, and R. S. Zemel (2017) Deep spectral clustering learning. See DBLP:conf/icml/2017, pp. 1985–1994. External Links: Link Cited by: Appendix B, §2, §3, Table 2, Table 2, Table 3.
  • [15] J. MacQueen (1967) Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Math. Statist. and Prob., Vol. 1, pp. 281–297. Cited by: §4.3.
  • [16] R. Manmatha, C. Wu, A. J. Smola, and P. Krähenbühl (2017) Sampling matters in deep embedding learning. See DBLP:conf/iccv/2017, pp. 2859–2867. External Links: Link, Document Cited by: §1, §2, §3, Table 2, Table 2, Table 3, §5.
  • [17] A. F. McDaid, D. Greene, and N. J. Hurley (2011) Normalized mutual information to evaluate overlapping community finding algorithms. CoRR abs/1110.2515. External Links: Link, 1110.2515 Cited by: §2, §4.3.
  • [18] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. See DBLP:conf/iccv/2017, pp. 360–368. External Links: Link, Document Cited by: Appendix B, Appendix B, §2, Figure 6, §4.4.1, §4.5, Table 2, Table 2, Table 3.
  • [19] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2017) BIER - boosting independent embeddings robustly. See DBLP:conf/iccv/2017, pp. 5199–5208. External Links: Link, Document Cited by: §2, Table 2, Table 2, Table 3.
  • [20] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with BIER: boosting independent embeddings robustly. CoRR abs/1801.04815. External Links: Link, 1801.04815 Cited by: §4.4, Table 2, Table 2, Table 3.
  • [21] W. Park, D. Kim, Y. Lu, and M. Cho (2019) Relational knowledge distillation. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Vol. abs/1904.05068. External Links: Link, 1904.05068 Cited by: Table 2, Table 2, Table 3.
  • [22] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. NIPS Workshops. Cited by: §1, §4.1.
  • [23] K. Pearson (1895) Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, pp. 240–242. Cited by: §3.2.
  • [24] M. Pelillo (1997) The dynamics of nonlinear relaxation labeling processes. Journal of Mathematical Imaging and Vision 7 (4), pp. 309–323. Cited by: §3.3, §3.3.
  • [25] A. Rosenfeld, R. A. Hummel, and S. W. Zucker (1976) Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybern. 6, pp. 420–433. Cited by: §3.3.
  • [26] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2014) ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. External Links: Link, 1409.0575 Cited by: §4.1.
  • [27] A. Sanakoyeu, V. Tschernezki, U. Büchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Vol. abs/1906.05990. External Links: Link, 1906.05990 Cited by: §2, §4.4, Table 2, Table 2, Table 3.
  • [28] F. Schroff, D. Kalenichenko, and J. Philbin (2015) FaceNet: A unified embedding for face recognition and clustering. See DBLP:conf/cvpr/2015, pp. 815–823. External Links: Link, Document Cited by: §1, §2, §2, §3, Table 2, Table 2, Table 3.
  • [29] M. Schultz and T. Joachims (2003) Learning a distance metric from relative comparisons. See DBLP:conf/nips/2003, pp. 41–48. External Links: Link Cited by: §1, §2.
  • [30] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. See DBLP:conf/nips/2016, pp. 1849–1857. External Links: Link Cited by: §2, §3, §4.1, Table 2, Table 2, Table 3.
  • [31] H. O. Song, S. Jegelka, V. Rathod, and K. Murphy (2017) Deep metric learning via facility location. See DBLP:conf/cvpr/2017, pp. 2206–2214. External Links: Link, Document Cited by: §2, §3, §4.1, Table 2, Table 2, Table 3.
  • [32] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. See DBLP:conf/cvpr/2016, pp. 4004–4012. External Links: Link, Document Cited by: Appendix B, Appendix C, §2, §3, §4.2, §4.2, Table 2, Table 2, Table 3.
  • [33] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. See DBLP:conf/cvpr/2015, pp. 1–9. External Links: Link, Document Cited by: §4.1.
  • [34] L. van der Maaten and G. E. Hinton (2012) Visualizing non-metric similarities in multiple maps. Machine Learning 87 (1), pp. 33–55. External Links: Link, Document Cited by: Figure 10, Appendix E, §4.4.2.
  • [35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §A.1, §A.1, Appendix C, Figure 10, Appendix E, §4.2.
  • [36] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. See DBLP:conf/iccv/2017, pp. 2612–2620. External Links: Link, Document Cited by: §2, §3, §4.4.1, Table 2, Table 2, Table 3.
  • [37] X. Wang, Y. Hua, E. Kodirov, G. Hu, R. Garnier, and N. M. Robertson (2019) Ranked list loss for deep metric learning. See DBLP:conf/cvpr/2019, pp. 5207–5216. External Links: Link Cited by: §3, §4.4.1, Table 2, Table 2, Table 3.
  • [38] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott (2019) Multi-similarity loss with general pair weighting for deep metric learning. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Vol. abs/1904.06627. External Links: Link, 1904.06627 Cited by: §2, §3, §4.4.1, Table 2, Table 2, Table 3.
  • [39] J.W. Weibull (1997) Evolutionary game theory. MIT Press. External Links: ISBN 9780262731218 Cited by: §1, §3.3, footnote 1.
  • [40] K. Q. Weinberger and L. K. Saul (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, pp. 207–244. External Links: Link Cited by: §1, §2.
  • [41] X. Xu, Y. Yang, C. Deng, and F. Zheng Deep asymmetric metric learning via rich relationship mining. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Cited by: §2, Table 2, Table 2, Table 3.
  • [42] H. Xuan, R. Souvenir, and R. Pless (2018) Deep randomized ensembles for metric learning. See DBLP:conf/eccv/2018-16, pp. 751–762. External Links: Link, Document Cited by: §2, §4.4, Table 2, Table 2, Table 3.
  • [43] B. Yu, T. Liu, M. Gong, C. Ding, and D. Tao (2018) Correcting the triplet selection bias for triplet loss. See DBLP:conf/eccv/2018-6, pp. 71–86. External Links: Link, Document Cited by: §2, §4.4.1, Table 2, Table 2, Table 3.
  • [44] Y. Yuan, K. Yang, and C. Zhang (2017) Hard-aware deeply cascaded embedding. See DBLP:conf/iccv/2017, pp. 814–823. External Links: Link, Document Cited by: §2, §4.4, Table 2, Table 2, Table 3.
  • [45] X. Zhang, F. Zhou, Y. Lin, and S. Zhang (2016) Embedding label structures for fine-grained feature representation. See DBLP:conf/cvpr/2016, pp. 1114–1123. External Links: Link, Document Cited by: §1.