Measuring object similarity is at the core of many important machine learning problems like clustering and object retrieval. For visual tasks, this means learning a distance function over images. With the rise of deep neural networks, the focus has rather shifted towards learning a feature embedding that is easily separable using a simple distance function, such as the Euclidean distance. In essence, objects of the same class (similar) should be close by in the learned manifold, while objects of a different class (dissimilar) should be far away.
Currently, the best performing approaches get deep feature embeddings from the so-called siamese networks, which are typically trained using the contrastive loss  or the triplet loss [29, 40]. A clear drawback of these losses is that they only consider pairs or triplets of data points, missing key information about the relationships between all members of the mini-batch. On a mini-batch of size , despite that the number of pairwise relations between samples is , contrastive loss uses only pairwise relations, while triplet loss uses relations. Additionally, these methods consider only the relations between objects of the same class (positives) and objects of other classes (negatives), without making any distinction that negatives belong to different classes. This leads to not taking into consideration the global structure of the embedding space, and consequently results in lower clustering and retrieval performance. To compensate for that, researchers rely on other tricks to train neural networks for deep metric learning: intelligent sampling , multi-task learning  or hard-negative mining . We advocate that we can bypass the use of such tricks by adopting a loss function that exploits, in a principled way, the global structure of the embedding space.
In this work, we propose a novel loss function for deep metric learning, called the Group Loss, which considers the similarity between all samples in a mini-batch. To create the mini-batch, we sample from a fixed number of classes, with samples coming from a class forming a group. Thus, each mini-batch consists of several randomly chosen groups, and each group has a fixed number of samples. An iterative, fully-differentiable label propagation algorithm is then used to build feature embeddings which are similar for samples belonging to the same group, and dissimilar otherwise.
, that refines the local information, given by the softmax layer of a neural network, with the global information of the mini-batch, given by the similarity between embeddings. The driving rationale is that the more similar two samples are, the more they affect each other in choosing their final label and tend to be grouped together in the same group (hence the nameGroup Loss), while dissimilar samples do not affect each other on their choices. Neural networks optimized with the Group Loss learn to provide similar features for samples belonging to the same class, making clustering and image retrieval easier.
Our contribution in this work is four-fold:
We propose a novel loss function to train neural networks for deep metric embedding that takes into account the local information of the samples, as well as their similarity.
We propose a differentiable label-propagation iterative model to embed the similarity computation within backpropagation, allowing end-to-end training with our new loss function.
We perform a comprehensive robustness analysis showing the stability of our module with respect to the choice of hyperparameters.
We show state-of-the-art qualitative and quantitative results in several standard clustering and retrieval datasets.
2 Related Work
The first attempt at using a neural network for feature embedding was done in the seminal work of Siamese Networks . A cost function called contrastive loss was designed in such a way as to minimize the distances between pairs of images belonging to the same cluster, and maximize the distances between pairs of images coming from different clusters. In , researchers used the principle to successfully address the problem of face verification.
Another line of research on convex approaches for metric learning led to the triplet loss [29, 40], which was later combined with the expressive power of neural networks . The main difference from the original Siamese network is that the loss is computed using triplets (an anchor, a positive and a negative data point). The loss is defined to make the distance between features of the anchor and the positive sample smaller than the distance between the anchor and the negative sample.
The approach was so successful in the field of face recognition and clustering, that soon many works followed. The majority of works on the Siamese architecture consist of finding better cost functions, resulting in better performances on clustering and retrieval. In, the authors generalized the concept of triplet by allowing a joint comparison among negative examples instead of just one. 
designed an algorithm for taking advantage of the mini-batches during the training process by lifting the vector of pairwise distances within the batch to the matrix of pairwise distances, thus enabling the algorithm to learn feature embedding by optimizing a novel structured prediction objective on the lifted problem. The work was later extended in, proposing a new metric learning scheme based on structured prediction that is designed to optimize a clustering quality metric, i.e., the normalized mutual information . Better results were achieved on , where the authors proposed a novel angular loss, which takes angle relationship into account.
Knowing that the number of possible triplets is extremely large even for moderately-sized datasets, and having found that the majority of triplets are not informative , researchers also investigated sampling. In the original triplet loss paper , it was found that using semi-hard negative mining, the network can be trained to a good performance, but the training is computationally inefficient. The work of  found out that while the majority of research is focused on designing new loss functions, selecting training examples plays an equally important role. The authors proposed a distance-weighted sampling procedure, which selects more informative and stable examples than traditional approaches, achieving excellent results in the process. The authors of  proposed optimizing the triplet loss on a different space of triplets than the original samples, consisting of an anchor data point and similar and dissimilar proxy data points which are learned as well. These proxies approximate the original data points so that a triplet loss over the proxies is a tight upper bound of the original loss. A very different problem formulation was given by 
, where the authors used a spectral clustering-inspired approach to achieve deep embedding. A recent work presents several extensions of the triplet loss that reduce the bias in triplet selection by adaptively correcting the distribution shift on the selected triplets. The majority of recent works has been focused on complementary research directions such as intelligent sampling [16, 7, 4, 38, 41] or ensemble methods [42, 27, 11, 19, 44]. As we will show in the experimental section, these can be combined with our novel loss.
Our work significantly differs from the other works mentioned here. We move away entirely from the triplet loss formulation that considers only positive and negative pairs. Instead, we focus on grouping objects into groups using the aforementioned label-propagation iterative procedure, that allows us to compute the Group Loss and enforces consistent labeling across all members of all groups, showing superior results compared to other approaches.
3 Group Loss
do not use a classification loss function, e.g., cross-entropy, but rather a loss function based on embedding distance. The rationale behind it is that what matters for a classification network is that the output is correct, not that the embeddings of samples belonging to the same class are similar. Since each sample is classified independently, it is entirely possible that two images of the same class have two distant embeddings that both allow for a correct classification. We argue that a classification loss can still be used for deep metric learning, if the decisions do not happen independently for each sample, but rather jointly for a wholegroup, i.e., the set of images of the same class in a mini-batch. This is the goal of Group Loss, to take into consideration the global structure of a mini-batch, i.e., the overall class separation for all samples. To achieve this, we propose an iterative procedure that refines the local information, given by the softmax layer of a neural network, with the global information of the mini-batch, given by the similarity between embeddings. This iterative procedure categorizes samples into different groups, and enforces consistent labelling among the samples of a group. While softmax cross-entropy loss judges each sample in isolation, the Group Loss jointly takes into consideration all the samples in a mini-batch. Furthermore, in section 3.2, we show that the formulation of the Group Loss is very different from that of softmax cross-entropy and show that our loss has different mathematical properties.
3.1 Overview of Group Loss
Given a mini-batch consisting of images, consider the problem of assigning a class label to each image in . In the remainder of the manuscript, represents a (non-negative) matrix of image-label soft assignments. In other words, each row of
represents a probability distribution over the label set().
Initialization: Initialize , the image-label assignment using the softmax outputs of the neural network. Compute the pairwise similarity matrix using the neural network embedding.
Refinement: Iteratively, refine considering the similarities between all the mini-batch images, as encoded in , as well as their labeling preferences.
Loss computation: Compute the cross-entropy loss of the refined probabilities and update the weights of the neural network using backpropagation.
We now provide a more detailed description of the three steps of our method.
Image-label assignment matrix. The initial assignment matrix denoted , comes from the softmax output of the neural network. We can replace some of the initial assignments in matrix with one-hot labelings of those samples. We call these samples anchors, as their assignments do not change during the iterative refine process and consequently do not directly affect the loss function. However, by using their correct label instead of the predicted label (coming from the softmax output of the NN), they guide the remaining samples towards their correct label.
Similarity matrix. A measure of similarity is computed among all pairs of embeddings (computed via a CNN) in to generate a similarity matrix . In this work, we compute the similarity measure using the Pearson’s correlation coefficient :
for , and set to
. The choice of this measure over other options such as cosine layer, Gaussian kernels, or learned similarities, is motivated by the observation that the correlation coefficient uses data standardization, thus providing invariance to scaling and translation – unlike the cosine similarity, which is invariant to scaling only – and it does not require additional hyperparameters, unlike like Gaussian kernels. The fact that a measure of the linear relationship among features provides a good similarity measure can be explained by the fact that the computed features are actually a highly non-linear function of the inputs. Thus, the linear correlation among the embeddings actually captures a non-linear relationship among the original images.
In this core step of the proposed algorithm, the initial assignment matrix is refined in an iterative manner, taking into account the similarity information provided by matrix . is updated in accordance with the smoothness assumption, which prescribes that similar objects should share the same label.
To this end, let us define the support matrix as
represents the support that the current mini-batch gives to the hypothesis that the -th image in belongs to class . Intuitively, in obedience to the smoothness principle, is expected to be high if images similar to are likely to belong to class .
Given the initial assignment matrix , our algorithm refines it using the following update rule:
where the denominator represents a normalization factor which guarantees that the rows of the updated matrix sum up to one. This is known as multi-population replicator dynamics in evolutionary game theory and is equivalent to nonlinear relaxation labeling processes [25, 24].
In matrix notation, the update rule (4) can be written as:
and is the all-one -dimensional vector. as defined in (2), and denotes the Hadamard (element-wise) matrix product. In other words, the diagonal elements of represent the normalization factors in (4), which can also be interpreted as the average support that object obtains from the current mini-batch at iteration . Intuitively, the motivation behind our update rule is that at each step of the refinement process, for each image , a label will increase its probability if and only if its support is higher than the average support among all the competing label hypothesis .111This can be motivated by a Darwinian survival-of-the-fittest selection principle, see e.g. .
Thanks to the Baum-Eagon inequality , it is easy to show that the dynamical system defined by (4) has very nice convergence properties. In particular, it strictly increases at each step the following functional:
which represents a measure of “consistency” of the assignment matrix , in accordance to the smoothness assumption ( rewards assignments where highly similar objects are likely to be assigned the same label). In other words:
with equality if and only if is a stationary point. Hence, our update rule (4) is, in fact, an algorithm for maximizing the functional over the space of row-stochastic matrices. Note, that this contrasts with classical gradient methods, for which an increase in the objective function is guaranteed only when infinitesimal steps are taken, and determining the optimal step size entails computing higher-order derivatives. Here, instead, the step size is implicit and yet, at each step, the value of the functional increases.
3.4 Loss computation.
Once the labeling assignments converge (or in practice, a maximum number of iterations is reached), we apply the cross-entropy loss to quantify the classification error and backpropagate the gradients. Recall, the refinement procedure is optimized via replicator dynamics, as shown in the previous section. By studying Equation (5), it is straightforward to see that it is composed of fully differentiable operations (matrix-vector and scalar products), and so it can be easily integrated within backpropagation. Although the refining procedure has no parameters to be learned, its gradients can be backpropagated to the previous layers of the neural network, producing, in turn, better embeddings for similarity computation.
3.5 Summary of the Group Loss
In this section, we proposed the Group Loss function for deep metric learning. During training, the Group Loss works by grouping together similar samples based on both the similarity between the samples in the mini-batch and the local information of the samples. The similarity between samples is computed by the correlation between the embeddings obtained from a CNN, while the local information is computed with a softmax layer on the same CNN embeddings. Using an iterative procedure, we combine both sources of information and effectively bring together embeddings of samples that belong to the same class.
During inference, we simply forward pass the images through the neural network to compute their embeddings, which are directly used for image retrieval within a nearest neighbor search scheme. The iterative procedure is not used during inference, thus making the feature extraction as fast as that of any other competing method.
In this section, we compare the Group Loss with state-of-the-art deep metric learning models on both image retrieval and clustering tasks. Our method achieves state-of-the-art results in three public benchmark datasets.
4.1 Implementation details
with batch-normalization as the backbone feature extraction network. We pretrain the network on ILSVRC 2012-CLS dataset . For pre-processing, in order to get a fair comparison, we follow the implementation details of . The inputs are resized to pixels, and then randomly cropped to . Like other methods except for , we use only a center crop during testing time. We train all networks in the classification task for epochs. We then train the network in the Group Loss task for epochs using Adam optimizer . After epochs, we lower the learning rate by multiplying it by . We find the hyperparameters using random search . We use small mini-batches of size . As sampling strategy, on each mini-batch, we first randomly sample a fixed number of classes, and then for each of the chosen classes, we sample a fixed number of samples.
4.2 Benchmark datasets
We perform experiments on publicly available datasets, evaluating our algorithm on both clustering and retrieval metrics. For training and testing, we follow the conventional splitting procedure .
CUB-200-2011  is a dataset containing species of birds with images, where the first species ( images) are used for training and the remaining species ( images) are used for testing.
Cars 196  dataset is composed of images belonging to classes. We use the first classes ( images) for training and the other classes ( images) for testing.
Stanford Online Products dataset, as introduced in , contains classes with product images in total, where classes ( images) are used for training and the remaining classes ( images) are used for testing.
4.3 Evaluation metrics
Based on the experimental protocol detailed above, we evaluate retrieval performance and clustering quality on data from unseen classes of the aforementioned datasets. For the retrieval task, we calculate the percentage of the testing examples whose nearest neighbors contain at least one example of the same class. This quantity is also known as Recall@K  and is the most used metric for image retrieval evaluation.
Similar to all other approaches we compare with, we perform clustering using K-means algorithm on the embedded features. Like in other works, we evaluate the clustering quality using the Normalized Mutual Information measure (NMI) . The choice of NMI measure is motivated by the fact that it is invariant to label permutation, a desirable property for cluster evaluation.
We now show the results of our model and comparison to state-of-the-art methods. Our main comparison is with other loss functions, e.g., triplet loss, without including any other tricks such as sampling or ensembles. To compare with perpendicular research on intelligent sampling strategies or ensembles, and show the power of the Group Loss, we propose a simple ensemble version of our method. Our ensemble network is built by training independent neural networks with the same hyperparameter configuration. During inference, their embeddings are concatenated. Note, that this type of an ensemble is much simpler that the works of [44, 42, 11, 20, 27] and is given only to show that, when optimized for performance, our method can be extended to ensembles giving higher clustering and retrieval performance than any other method in the literature. Nevertheless, we consider the main focus of comparison should be the other loss functions that do not use advanced sampling or ensemble methods.
|Method NMI R@1 R@2 R@4 R@8 Loss Triplet  55.3 42.5 55 66.4 77.2 Lifted Structure  56.5 43.5 56.5 68.5 79.6 Npairs  60.2 51.9 64.3 74.9 83.2 Facility Location  59.2 48.1 61.4 71.8 81.9 Angular Loss  61.1 54.7 66.3 76 83.9 Proxy-NCA  59.5 49.2 61.9 67.9 72.4 Deep Spectral  59.2 53.2 66.1 76.7 85.2 Bias Triplet  - 46.6 58.6 70.0 - Ours 67.9 64.3 75.8 84.1 90.5 Loss + Sampling/Mining Samp. Matt.  69.0 63.6 74.4 83.1 90.0 Hier. triplet  - 57.1 68.8 78.7 86.5 DAMLRRM  61.7 55.1 66.5 76.8 85.3 DE-DSP  61.7 53.6 65.5 76.9 GPW  - 65.7 77.0 86.3 91.2 Teacher-Student RKD  - 61.4 73.0 81.9 89.0 Loss+Ensembles BIER 6  - 55.3 67.2 76.9 85.1 HDC 3  - 54.6 66.8 77.6 85.9 DRE 48  62.1 58.9 69.6 78.4 85.6 ABE 2  - 55.7 67.9 78.3 85.5 ABE 8  - 60.6 71.5 79.8 87.4 A-BIER 6  - 57.5 68.7 78.3 86.2 D and C 8  69.6 65.9 76.6 84.4 90.6 RLL 3  66.1 61.3 72.7 82.7 89.4 Ours 2-ensemble 68.5 65.8 76.7 85.2 91.2 Ours 5-ensemble 70.0 66.9 77.1 85.4 91.5 Method NMI R@1 R@2 R@4 R@8 Loss Triplet  53.4 51.5 63.8 73.5 82.4 Lifted Structure  56.9 53.0 65.7 76.0 84.3 Npairs  62.7 68.9 78.9 85.8 90.9 Facility Location  59.0 58.1 70.6 80.3 87.8 Angular Loss  63.2 71.4 81.4 87.5 92.1 Proxy-NCA  64.9 73.2 82.4 86.4 88.7 Deep Spectral  64.3 73.1 82.2 89.0 93.0 Bias Triplet  - 79.2 86.7 91.4 - Ours 70.7 83.7 89.9 93.7 96.3 Loss + Sampling/Mining Samp. Matt.  69.1 79.6 86.5 91.9 95.1 Hier. triplet  - 81.4 88.0 92.7 95.7 DAMLRRM  64.2 73.5 82.6 89.1 93.5 DE-DSP  64.4 72.9 81.6 88.8 - GPW  - 84.1 90.4 94.0 96.5 Teacher-Student RKD  - 82.3 89.8 94.2 96.6 Loss+Ensembles HDC 6  - 75.0 83.9 90.3 94.3 BIER 3  - 78.0 85.8 91.1 95.1 DRE 48  71 84.2 89.4 93.2 95.5 ABE 2  - 76.8 84.9 90.2 94.0 ABE 8  - 85.2 90.5 94.0 96.1 A-BIER 6  - 82.0 89.0 93.2 96.1 D and C 8  70.3 84.6 90.7 94.1 96.5 RLL 3  71.8 82.1 89.3 93.7 96.7 Ours 2-ensemble 72.6 86.2 91.6 95.0 97.1 Ours 5-ensemble 74.2 88.0 92.5 95.7 97.5|
4.4.1 Quantitative results
Loss comparison. In Tables 2 - 3, we present the results of our method and compare them with the results of other approaches. On the CUB-200-2011 dataset, we outperform the other approaches by a large margin, with the second-best model (Angular Loss ) having circa percentage points() lower absolute accuracy in Recall@1 metric. On the NMI metric, our method achieves a score of which is almost higher than the second best method. Similarly, on Cars 196, our method achieves best results on Recall@1, with Bias Triplet  coming second with a lower score. Proxy-NCA  is second on the NMI metric, with a lower score. On Stanford Online Products, our method outperforms all the other loss functions on the Recall@1 metric. On the same dataset, when evaluated on the NMI score, our loss outperforms any other method, be those methods that exploit advanced sampling, or ensemble methods.
Loss with ensembles. Our ensemble method (using neural networks) is the highest performing model in CUB-200-2011 and Cars 196 datasets, outperforming all other methods on both Recall@1 and NMI metrics. In Stanford Online Products, our ensemble reaches the third highest result on the Recall@1 metric (after Ranked List Loss  and GPW ) and reaches the best overall result on the NMI metric.
4.4.2 Qualitative results
Fig. 3 shows qualitative results on the retrieval task in all three datasets. In all cases, the query image is given on the left, with the four nearest neighbors given on the right. Green boxes indicate the cases where the retrieved image is of the same class as the query image, and red boxes indicate a different class. As we can see, our model is able to perform well even in cases where the images suffer from occlusion and rotation. On the Cars 196 dataset, we see a successful retrieval even when the query image is taken indoors and the retrieved image outdoors, and viceversa. The first example of Cars 196 dataset is of particular interest. Despite the fact that the query image contains cars, all four nearest neighbors which have been retrieved have the same class as the query image, showing the robustness of the algorithm to uncommon input image configurations. We provide the results of t-SNE  projection in the supplementary material.
4.5 Robustness analysis
|Lifted Structure ||88.7||62.5||80.8||91.9|
|Facility Location ||89.5||67.0||83.7||93.2|
|Angular Loss ||88.6||70.9||85.0||93.5|
|Deep Spectral ||89.4||67.6||83.7||93.3|
|Bias Triplet ||-||63.0||79.8||90.7|
|Loss + Sampling/Mining|
|Samp. Matt ||90.7||72.7||86.2||93.8|
|Hier. triplet ||-||74.8||88.3||94.8|
|HDC 6 ||-||70.1||84.9||93.2|
|BIER 3 ||-||72.7||86.5||94.0|
|DRE 48 ||-||-||-||-|
|ABE 2 ||-||75.4||88.0||94.7|
|ABE 8 ||-||76.3||88.4||94.8|
|A-BIER 6 ||-||74.2||86.9||94.0|
|D and C 8 ||90.2||75.9||88.4||94.9|
|RLL 3 ||90.4||79.8||91.3||96.3|
Number of anchors.
In Fig. 6, we show the effect of the number of anchors with respect to the number of samples per class. We do the analysis on CUB-200-2011 dataset and give a similar analysis for CARS dataset in the supplementary material. The results reported are the percentage point differences in terms of Recall@1 with respect to the the best performing set of parameters (see in Tab. 2). The number of anchors ranges from 0 to 4, while the number of samples per class varies from 5 to 10. It is worth noting that our best setting considers 1 or 2 anchors over 9 samples (the two zeros in the Fig. 6). Moreover, even when we do not use any anchor, the difference in Recall@1 is no more than .
Number of classes per mini-batch.
In Fig. 6, we present the change in Recall@1 on the CUB-200-2011 dataset if we increase the number of classes we sample at each iteration. The best results are reached when the number of classes is not too large. This is a welcome property, as we are able to train on small mini-batches, known to achieve better generalization performance .
In Fig. 6, we present the convergence rate of the model on the Cars 196 dataset. Within the first epochs, our model achieves state-of-the-art results, making our model significantly faster than other approaches. Note, that other models, with the exception of Proxy-NCA , need hundreds of epochs to converge. Additionally, we compare the training time with Proxy-NCA . On a single Volta V100 GPU, the average running time of our method per epoch is seconds on CUB-200-2011 and seconds on Cars 196, compared to and of Proxy-NCA . Hence, our method is faster than one one of the fastest methods in the literature. Note, the inference time of every method is the same, because the network is used only for feature embedding extraction during inference.
5 Conclusions and Future Work
In this work, we proposed the Group Loss, a new loss function for deep metric learning that goes beyond triplets. By considering the content of a mini-batch, it promotes embedding similarity across all samples of the same class, while enforcing dissimilarity for elements of different classes. This is achieved with a fully-differentiable layer which is used to train a convolutional network in an end-to-end fashion. We show that our model outperforms state-of-the-art methods on several datasets, and at the same time shows fast convergence.
In our work, we did not consider any advanced and intelligent sampling strategy. Instead, we randomly sample objects from a few classes at each iteration. Sampling has shown to have a very important role in feature embedding 
, therefore, we will explore in future work sampling techniques which can be suitable for our module. Additionally, we are going to investigate the applicability of Group Loss to other problems, such as person re-identification and deep semi-supervised learning.
Acknowledgements. This research was partially funded by the Humboldt Foundation through the Sofja Kovalevskaja Award. We thank Michele Fenzi, Maxim Maximov and Guillem Braso Andilla for useful discussions.
Appendix A Robustness analysis of CUB-200-2011 dataset
a.1 Number of anchors
In the main work, we showed the robustness analysis on the CUB-200-2011  dataset (see Figure 4 in the main paper). Here, we report the same analysis for the Cars 196  dataset. This leads us to the same conclusions as shown in the main paper.
As for the experiment in the paper, we do a grid search over the total number of elements per class versus the number of anchors. We increase the number of elements per class from to , and in each case, we vary the number of anchors from to . We show the results in Fig. 9. Note, the results decrease mainly when we do not have any labeled sample, i.e., when we use zero anchors. The method shows the same robustness as on the CUB-200-2011  dataset, with the best result being only percentage points better at the Recall@1 metric than the worst result.
Appendix B Implicit regularization and less overfitting
In Figures 9 and 9, we compare the results of training vs. testing on Cars 196  and Stanford Online Products  datasets. Note, the difference between Recall@1 at train and test time is small, especially on Stanford Online Products dataset. On Cars 196 the best results we get for the training set are circa in the Recall@1 measure, only percentage points () better than what we reach in the testing set. From the papers we compared the results with, the only one which reports the results on the training set is Deep Spectral Clustering Learning . They reported results of over in all metrics for all three datasets (for the training sets), much above the test set accuracy which lies at on Cars 196 and on Stanford Online Products dataset. This clearly shows that our method is much less prone to overfitting.
We further implement the P-NCA  loss function and perform a similar experiment, in order to be able to compare training and test accuracies directly with our method. In Figure 9, we show the training and testing curves of P-NCA on the Cars 196  dataset. We see that while in the training set, P-NCA reaches results of higher than our method, in the testing set, our method outperforms P-NCA for around .
Unfortunately, we were unable to reproduce the results of the paper  (on the testing set) for the Stanford Online Products dataset.
Furthermore, even when we turn off -regularization, the generalization performance does not drop at all. Our intuition is that, by taking into account the structure of the entire manifold of the dataset, our method introduces a form of regularization. We can clearly see a smaller gap between training and test results when compared to competing methods, indicating less overfitting. We plan to further investigate this phenomenon in future work.
Appendix C More implementation details
We first finetune all methods in classification task for epochs. We then train our networks on all three datasets [35, 13, 32] for epochs achieving state-of-the-art results. During training, we use a simple learning rate scheduling in which we divide the learning rate by after the first epochs.
We tune all hyperparameters using random search . For the weight decay (-regularization) parameter, we search over the interval , while for learning rate we search over the interval . We achieve the best results with a regularization parameter set to for CUB-200-2011, for Cars 196 dataset, and for Stanford Online Products dataset. This further strengthens our intuition that the method is implicitly regularized and it does not require strong regularization.
Appendix D Dealing with negative similarities
Equation (4) in the main paper assumes that the matrix of similarity is non-negative. However, for similarity computation, we use a correlation metric (see Equation (1) in the main paper) which produces values in the range . In similar situations, different authors propose different methods to deal with the negative outputs. The most common approach is to shift the matrix of similarity towards the positive regime by subtracting the biggest negative value from every entry in the matrix . Nonetheless, this shift has a side effect: If a sample of class has very low similarities to the elements of a large group of samples of class , these similarity values (which after being shifted are all positive) will be summed up. When the cardinality of class is very large, then summing up all these small values lead to a large value, and consequently affect the solution of the algorithm. What we want instead, is to ignore these negative similarities, hence we propose clamping. More concretely, we use a ReLUactivation function over the output of Equation (1).
We compare the results of shifting vs clamping. On the CARS 196 dataset, we do not see a significant difference between the two approaches. However, on the CUBS-200-2011 dataset, the Recall@1 metric is with shifting, much below the obtained when using clamping. We investigate the matrix of similarities for the two datasets, and we see that the number of entries with negative values for the CUBS-200-2011 dataset is higher than for the CARS 196 dataset. This explains the difference in behavior, and also verifies our hypothesis that clamping is a better strategy to use within Group Loss.
Appendix E t-SNE on CUB-200-2011 dataset
Figure 10 visualizes the t-distributed stochastic neighbor embedding (t-SNE)  of the embedding vectors obtained by our method on the CUB-200-2011  dataset. The plot is best viewed on a high-resolution monitor when zoomed in. We highlight several representative groups by enlarging the corresponding regions in the corners. Despite the large pose and appearance variation, our method efficiently generates a compact feature mapping that preserves semantic similarity.
-  (2012) Random search for hyper-parameter optimization. Journal of Machine Learning Research 13, pp. 281–305. External Links: Cited by: Appendix C, §4.1.
-  (1994) Signature verification using a” siamese” time delay neural network. In Advances in Neural Information Processing Systems, pp. 737–744. Cited by: §1, §2.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. See DBLP:conf/cvpr/2005, pp. 539–546. External Links: Cited by: §2.
-  Deep embedding learning with discriminative sampling policy. In , Cited by: §2, Table 2, Table 2, Table 3.
-  (2018) Transductive label augmentation for improved deep network learning. See DBLP:conf/icpr/2018, pp. 1432–1437. External Links: Cited by: §3.2.
-  (2012) Graph transduction as a noncooperative game. Neural Computation 24 (3), pp. 700–723. External Links: Cited by: Appendix D, §1.
-  (2018) Deep metric learning with hierarchical triplet loss. See DBLP:conf/eccv/2018-6, pp. 272–288. External Links: Cited by: §2, §3, Table 2, Table 2, Table 3.
-  (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. See DBLP:conf/icml/2015, pp. 448–456. External Links: Cited by: §4.1.
-  (2011) Product quantization for nearest neighbor search. IEEE Trans. Pattern Anal. Mach. Intell. 33 (1), pp. 117–128. External Links: Cited by: §4.3.
On large-batch training for deep learning: generalization gap and sharp minima. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §4.5.
-  (2018) Attention-based ensemble for deep metric learning. See DBLP:conf/eccv/2018-1, pp. 760–777. External Links: Cited by: §2, §4.4, Table 2, Table 2, Table 3.
-  (2014) Adam: a method for stochastic optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), Cited by: §4.1.
-  (2013) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §A.1, Appendix B, Appendix B, Appendix C, §4.2.
-  (2017) Deep spectral clustering learning. See DBLP:conf/icml/2017, pp. 1985–1994. External Links: Cited by: Appendix B, §2, §3, Table 2, Table 2, Table 3.
-  (1967) Some methods for classification and analysis of multivariate observations. In Proc. Fifth Berkeley Symp. on Math. Statist. and Prob., Vol. 1, pp. 281–297. Cited by: §4.3.
-  (2017) Sampling matters in deep embedding learning. See DBLP:conf/iccv/2017, pp. 2859–2867. External Links: Cited by: §1, §2, §3, Table 2, Table 2, Table 3, §5.
-  (2011) Normalized mutual information to evaluate overlapping community finding algorithms. CoRR abs/1110.2515. External Links: Cited by: §2, §4.3.
-  (2017) No fuss distance metric learning using proxies. See DBLP:conf/iccv/2017, pp. 360–368. External Links: Cited by: Appendix B, Appendix B, §2, Figure 6, §4.4.1, §4.5, Table 2, Table 2, Table 3.
-  (2017) BIER - boosting independent embeddings robustly. See DBLP:conf/iccv/2017, pp. 5199–5208. External Links: Cited by: §2, Table 2, Table 2, Table 3.
-  (2018) Deep metric learning with BIER: boosting independent embeddings robustly. CoRR abs/1801.04815. External Links: Cited by: §4.4, Table 2, Table 2, Table 3.
-  (2019) Relational knowledge distillation. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Vol. abs/1904.05068. External Links: Cited by: Table 2, Table 2, Table 3.
-  (2017) Automatic differentiation in pytorch. NIPS Workshops. Cited by: §1, §4.1.
-  (1895) Notes on regression and inheritance in the case of two parents. Proceedings of the Royal Society of London 58, pp. 240–242. Cited by: §3.2.
-  (1997) The dynamics of nonlinear relaxation labeling processes. Journal of Mathematical Imaging and Vision 7 (4), pp. 309–323. Cited by: §3.3, §3.3.
-  (1976) Scene labeling by relaxation operations. IEEE Trans. Syst. Man Cybern. 6, pp. 420–433. Cited by: §3.3.
-  (2014) ImageNet large scale visual recognition challenge. CoRR abs/1409.0575. External Links: Cited by: §4.1.
-  (2019) Divide and conquer the embedding space for metric learning. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Vol. abs/1906.05990. External Links: Cited by: §2, §4.4, Table 2, Table 2, Table 3.
-  (2015) FaceNet: A unified embedding for face recognition and clustering. See DBLP:conf/cvpr/2015, pp. 815–823. External Links: Cited by: §1, §2, §2, §3, Table 2, Table 2, Table 3.
-  (2003) Learning a distance metric from relative comparisons. See DBLP:conf/nips/2003, pp. 41–48. External Links: Cited by: §1, §2.
-  (2016) Improved deep metric learning with multi-class n-pair loss objective. See DBLP:conf/nips/2016, pp. 1849–1857. External Links: Cited by: §2, §3, §4.1, Table 2, Table 2, Table 3.
-  (2017) Deep metric learning via facility location. See DBLP:conf/cvpr/2017, pp. 2206–2214. External Links: Cited by: §2, §3, §4.1, Table 2, Table 2, Table 3.
-  (2016) Deep metric learning via lifted structured feature embedding. See DBLP:conf/cvpr/2016, pp. 4004–4012. External Links: Cited by: Appendix B, Appendix C, §2, §3, §4.2, §4.2, Table 2, Table 2, Table 3.
-  (2015) Going deeper with convolutions. See DBLP:conf/cvpr/2015, pp. 1–9. External Links: Cited by: §4.1.
-  (2012) Visualizing non-metric similarities in multiple maps. Machine Learning 87 (1), pp. 33–55. External Links: Cited by: Figure 10, Appendix E, §4.4.2.
-  (2011) The Caltech-UCSD Birds-200-2011 Dataset. Technical report Technical Report CNS-TR-2011-001, California Institute of Technology. Cited by: §A.1, §A.1, Appendix C, Figure 10, Appendix E, §4.2.
-  (2017) Deep metric learning with angular loss. See DBLP:conf/iccv/2017, pp. 2612–2620. External Links: Cited by: §2, §3, §4.4.1, Table 2, Table 2, Table 3.
-  (2019) Ranked list loss for deep metric learning. See DBLP:conf/cvpr/2019, pp. 5207–5216. External Links: Cited by: §3, §4.4.1, Table 2, Table 2, Table 3.
-  (2019) Multi-similarity loss with general pair weighting for deep metric learning. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Vol. abs/1904.06627. External Links: Cited by: §2, §3, §4.4.1, Table 2, Table 2, Table 3.
-  (1997) Evolutionary game theory. MIT Press. External Links: Cited by: §1, §3.3, footnote 1.
-  (2009) Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research 10, pp. 207–244. External Links: Cited by: §1, §2.
-  Deep asymmetric metric learning via rich relationship mining. In IEEE Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, USA,, Cited by: §2, Table 2, Table 2, Table 3.
-  (2018) Deep randomized ensembles for metric learning. See DBLP:conf/eccv/2018-16, pp. 751–762. External Links: Cited by: §2, §4.4, Table 2, Table 2, Table 3.
-  (2018) Correcting the triplet selection bias for triplet loss. See DBLP:conf/eccv/2018-6, pp. 71–86. External Links: Cited by: §2, §4.4.1, Table 2, Table 2, Table 3.
-  (2017) Hard-aware deeply cascaded embedding. See DBLP:conf/iccv/2017, pp. 814–823. External Links: Cited by: §2, §4.4, Table 2, Table 2, Table 3.
-  (2016) Embedding label structures for fine-grained feature representation. See DBLP:conf/cvpr/2016, pp. 1114–1123. External Links: Cited by: §1.