Log In Sign Up

MIC: Mining Interclass Characteristics for Improved Metric Learning

by   Karsten Roth, et al.

Metric learning seeks to embed images of objects suchthat class-defined relations are captured by the embeddingspace. However, variability in images is not just due to different depicted object classes, but also depends on other latent characteristics such as viewpoint or illumination. In addition to these structured properties, random noise further obstructs the visual relations of interest. The common approach to metric learning is to enforce a representation that is invariant under all factors but the ones of interest. In contrast, we propose to explicitly learn the latent characteristics that are shared by and go across object classes. We can then directly explain away structured visual variability, rather than assuming it to be unknown random noise. We propose a novel surrogate task to learn visual characteristics shared across classes with a separate encoder. This encoder is trained jointly with the encoder for class information by reducing their mutual information. On five standard image retrieval benchmarks the approach significantly improves upon the state-of-the-art.


page 1

page 2

page 3

page 5

page 6


Sharing Matters for Generalization in Deep Metric Learning

Learning the similarity between images constitutes the foundation for nu...

Deep Metric Learning Beyond Binary Supervision

Metric Learning for visual similarity has mostly adopted binary supervis...

Intra-class Adaptive Augmentation with Neighbor Correction for Deep Metric Learning

Deep metric learning aims to learn an embedding space, where semanticall...

Energy Confused Adversarial Metric Learning for Zero-Shot Image Retrieval and Clustering

Deep metric learning has been widely applied in many computer vision tas...

DiVA: Diverse Visual Feature Aggregation forDeep Metric Learning

Visual Similarity plays an important role in many computer vision applic...

Vehicle Re-identification with Viewpoint-aware Metric Learning

This paper considers vehicle re-identification (re-ID) problem. The extr...

How Shift Equivariance Impacts Metric Learning for Instance Segmentation

Metric learning has received conflicting assessments concerning its suit...

Code Repositories


(ICCV 2019) This repo contains code for "MIC: Mining Interclass Characteristics for Improved Metric Learning", which proposes an auxiliary training task to explain away intra-class variations.

view repo

1 Introduction

Images live in a high dimensional space rich of structured information and unstructured noise. Therefore an image can be described by a finite combination of latent characteristics. The goal of computer vision is then to learn the relevant latent characteristics needed to solve a given task. Particularly in object classification, discriminative characteristics (e.g. car shape) are used to group the images according to predefined classes. To tackle the intra-class variability, modern classifiers can easily learn to be invariant to unstructured noise (e.g. random clutter, occlusion, image brightness). However, a considerable part of the variability is due to structured information shared among classes (e.g. view points and notions of color)

Figure 1:

(Left) Images can be described by combinations of latent characteristics and white noise. (Green) Standard metric learning encoders extract class-discriminative information

while disregarding object-specific properties (e.g. color, orientation). Achieving invariance to such characteristics requires substantial training data. (Brown) Instead, the model can explain them away by learning their structure explicitly. Our novel approach explicitly separates class-specific and shared properties during training to boost the performance of the discriminative encoding.
Figure 2: Overview of our approach. We aim to learn two separate encoding spaces s.t. class information extracted by is free from shared properties by explicitly describing them through an auxiliary encoder . Given a set of image/label pairs , their CNN feature representation groups images by both class specific (car model) and shared (orientation, color) characteristics. We separate these by training the class-discriminative encoder with ground-truth labels (boundary color). Simultaneously, an auxiliary encoder is trained on labels from a surrogate task (right) to explain away interclass features. The required surrogate labels are generated by standardizing the embedded training data per class and performing clustering. This recovers labels representing the shared structures (contour line-styles). Training both tasks together, learns a robust, -free encoding, which is now explicitly explained by .

For metric learning this becomes especially important. As metric learning approaches project images into a high-dimensional feature space to measure similarities between images, every learned feature contributes. This means that finding a strong set latent characteristics is crucial. Learning the characteristics shared across classes should therefore benefit the model [20]

, as it can better explain the object variance within a class. Take for example a model trained only on white cars of a certain category. This model will very likely not be able to recognize a blue car of the same category (Fig.

1 top-right). In this example, the encoder ignores the concept of ”color” for that particular class, even though it can be learned from the data as a latent variable shared across all cars (Fig.1 bottom-right). This is a typical generalization problem and is traditionally solved by providing more labeled data. However, besides being a costly solution, metric learning models need to also generalize to unknown classes, a task which should work independently from the amount of labels provided.

Explicitly modeling intra-class variation has already proven successful[20, 15, 1], such as spatial transformer layers [15], which explicitly learn the possible rotations and translations of an object category.

We therefore propose a model to discriminate between classes while simultaneously learning the shared properties of the objects. To strip intra-class characteristics away from our primary class encoder, thereby facilitating the task of learning good discriminative features, we utilize an auxiliary encoder. While the class encoder can be trained using ground-truth labels, the auxiliary encoder is learned through a novel surrogate task which extracts class-independent information without any additional annotations. Finally, an additional mutual information loss further purifies the class encoder from non-discriminative characteristics by eliminating the information learned from the auxiliary encoder.

This solution can be utilized with any standard metric learning loss, as shown in the result section. Our approach is evaluated on three standard benchmarks for zero-shot learning, CUB200-2011 [38], CARS196 [19] and Stanford Online Products [28], as well as two more recent datasets, In-Shop Clothes [36] and PKU VehicleID [21]. The results show that the proposed approach consistently enhances the performances of existing methods.

Input: data , full encoder , inter-/intra class encoders , CNN , class targets , batchsize , clusternumber , update frequency , (adversarial) mutual information loss and weight , projection network , gradient reversal op

, metric learning loss functions for

Cluster(Stand(Embed(, , )), ) 0 while Not Converged do
             GetBatch(, , , ) Embed(, , ) (, ) (, R()) Backward() Embed(, , ) (, ) (, R()) Backward()

end of epoch

      if epoch  then
             Cluster(Embed(X,,), )
       end if
end while
Algorithm 1 Training a model via MIC

2 Related Work

After the success of deep learning in object classification, many researchers have been investigating neural networks for metric learning. A network for classification extracts only the necessary features for discrimination between classes. Instead, metric learning encodes the images into an euclidean space where semantically similar ones are grouped much closer together. This makes metric learning effective in various computer vision applications, such as object retrieval

[28, 40], zero-shot learning [40] and face verification [7, 34]. The triplet paradigm [34] is the standard in the field and much work has been done to improve upon the original approach. As an exponential number of possible triplets makes the computation infeasible, many papers propose solutions for mining triplets more efficiently [40, 34, 12, 11, 14]. Recently, Duan . [8] have proposed a generative model to directly produce hard negatives. ProxyNCA [24] generates a set of class proxies and optimizes the distance of the anchor to said proxies, solving the triplet complexity problem. Others have explored orthogonal directions by extending the triplet paradigm, e.g. making use of every sample in the (specifically constructed) batch at once [28, 35], enforcing an angular triplet constraint [39], minimizing a cluster quality surrogate [27] or optimizing the overlap between positive and negative similarity histograms [37]. In addition, ensembles have been quite successfully used by combining multiple encoding spaces [29, 30, 42, 9] to maximize their efficiency.

Our work makes use of class-agnostic grouping of our data (see e.g. [3, 2]) and shares similarities with proposals from Liu . [20], who explicitly decompose images into class-specific and intra-class embeddings using a generative model, as well as Bai . [1], who, before training, divide each image class into subgroups to find an approximator for intra-class variances that can be included into the loss. However, unlike [1] and [20], we explicitly search for structures shared between classes instead of modelling the intra-class variance per sample [20] or class [1]. In addition, unlike [1], we assume class-independent intra-class variance and iteratively train a second encoder to model intra-class features, thereby purifying the main encoder from non-discriminative features and achieving significantly better results.

Finally, some works have exploited the latent structure of the data as a supervisory signal [25, 26, 6, 4, 5, 33, 32]. In particular, Caron . [6] learn an unsupervised image representation by clustering the data, starting from a Sobel filter prior initialization. Our approach includes such latent data structures in a similar way, however we use it as auxiliary information to improve upon the metric learning task.

3 Improving Metric Learning

The main idea behind our method is the inclusion of class-shared characteristics into the metric learning process to help the model explain them away. In doing so, we would gain robustness to intrinsic, non-disciminative properties of the data, which is contrary to the common approach of simply forcing invariance towards them. However, three main problems arise with this approach, namely: (i) Extracting both class and class-independent characteristics using a single encoder is infeasible and detrimental to the main goal. (ii) We lack the labels for extracting these latent properties. (iii) We need to explicitly remove unwanted properties from the class embedding. We propose solutions to each of these problems in sections 3.2, 3.3 and 3.4.

Figure 3: Example of clustering the data based on (see Sec3.3) for two datasets: CARS196[19] and SOP[28]. We group the dataset into 5 clusters (rows) and select the first 5 classes (columns) with at least one sample per cluster. For each entry, we selected the sample closest to the centroid per class. On the left is our interpretation of the cluster structure. The results show that subtraction of the class-specific features by standardization helps to group images based on more generic properties, like car orientation and bike parts.

3.1 Preliminaries

Metric learning encodes the characteristics that discriminate between classes into an embedding vector, with the goal of training an encoder

such that images from the same class are nearby in the encoding space and samples from different classes are far apart, given a standard distance in the embedding space.

In deep metric learning, image features are extracted using a neural network producing an image representation vector , which is used as input for the encoder of the embedding . The latter is implemented as a fully connected layer generating an embedding vector of dimension used for computing similarities. The features and the encoder can then be trained jointly by standard back-propagation.

With defining the euclidean distance between the images and , we require that if and . Given a triplet with and , the loss is then defined as where is a margin parameter. Many variants of this loss have been proposed recently, with margin loss[40] (adding an additionally learnable margin ) proving to be best.

R@k Dim 1 2 4 NMI
DVML[20] 512 52.7 65.1 75.5 61.4
BIER[29] 512 55.3 67.2 76.9 -
HTL[11] 512 57.1 68.8 78.7 -
A-BIER[30] 512 57.5 68.7 78.3 -
HTG[43] - 59.5 71.8 81.3 -
DREML[41] 9216 63.9 75.0 83.1 67.8
Semihard[34] - 42.6 55.0 66.4 55.4
Semihard* 128 57.2 69.4 79.9 63.9
MIC+semih 128 58.8 70.8 81.2 66.0
ProxyNCA[24] 64 49.2 61.9 67.9 64.9
ProxyNCA* 128 57.4 69.2 79.1 62.5
MIC+ProxyNCA 128 60.6 72.2 81.5 64.9
Margin[40] 128 63.6 74.4 83.1 69.0
Margin* 128 62.9 74.1 82.9 66.3
MIC+margin 128 66.1 76.8 85.6 69.7
Table 1: Recall@k for k nearest neighbor and NMI on CUB200-2011 [38]. Our model outperforms all previous approaches, even those using a larger number of parameters. (*) indicates our best re-implementation with ResNet50.

3.2 Auxiliary Encoder

To separate the process of extracting both inter- and intra-class (shared) characteristics, we utilize two separate encodings: a class encoder which aims to extract class-discriminative features and an auxiliary encoder to find shared properties. These encoders are trained together (Fig.2). To efficiently train the underlying deep neural network, the two encoders share the same image representation which is updated by both during training. In the first training task, the class encoder is trained using the provided ground truth labels associated with each image with the number of samples. A respective, metric-based loss function can be selected arbitrarily (such as a standard triplet loss or the aforementioned margin loss), as this part follows the generic training setup for metric learning problems. Because labels are not provided for the training of our auxiliary encoder, we define an automatic process to mine shared latent structure information from the original data. This information is then used to provide a new set of training labels to train our auxiliary encoder (Fig.2 right). As the training scheme is now equivalent to the primary task, we may choose from the same set of loss functions.

R@k Dim 1 2 4 NMI
HTG[43] - 76.5 84.7 90.4 -
BIER[29] 512 78.0 85.8 91.1 -
HTL[11] 512 81.4 88.0 92.7 -
DVML[20] 512 82.0 88.4 93.3 67.6
A-BIER[30] 512 82.0 89.0 93.2 -
DREML[41] 9216 86.0 91.7 95.0 76.4
Semihard[34] - 51.5 63.8 73.5 53.4
Semihard* 128 65.5 76.9 85.2 58.3
MIC+semih 128 70.5 80.5 87.4 61.6
ProxyNCA[24] 64 73.2 82.4 86.4 -
ProxyNCA* 128 73.0 81.3 87.9 59.5
MIC+ProxyNCA 128 75.9 84.1 90.1 60.5
Margin[40] 128 79.6 86.5 90.1 69.1
Margin* 128 80.0 87.7 92.3 66.3
MIC+margin 128 82.6 89.1 93.2 68.4
Table 2: Recall@k for k nearest neighbor and NMI on CARS196 [19]. DREML[41] is not comparable given the large embedding dimension. (*) indicates our ResNet50 re-implementation.

3.3 Extracting Inter-class Characteristics

We seek a task which, without human supervision, spots structured characteristics within the data while ignoring class-specific information. As structured properties are generally defined by characteristics shared among several images, they create homogeneous groups. To find these, clustering offers a well established solution. This algorithm associates images to surrogate labels with and being the predefined number of clusters. However, applied directly to the data, this method is biased towards class-specific structures since images from the same class share many common properties, like color, context and shape, mainly injected through the data collection process (e.g. a class may be composed of pictures of the same object from multiple angles).

To remove the characteristics shared within the class, we apply normalization guided by the ground truth classes. For each class we compute the mean

and standard deviation

based on the features . Then we obtain the new standardized image representation with , where the class influence is now reduced. Afterwards, the auxiliary encoder can be trained using the surrogate labels produced by clustering the space .

For that to work as intended, a strong prior is needed. It is standard procedure for deep metric learning to initialize the representation backend

with weights pretrained on ImageNet. This provides a sufficiently good starting point for clustering, which is then reinforced through training


Fig.3 shows some examples of clusters detected using our surrogate task. This task and the encoder training are summarized in Fig.2.

R@k Dim 1 10 100 NMI
DVML[20] 512 70.2 85.2 93.8 90.8
BIER[29] 512 72.7 86.5 94.0 -
ProxyNCA[24] 64 73.7 - - -
A-BIER[30] 512 74.2 86.9 94.0 -
HTL[11] 512 74.8 88.3 94.8 -
Margin[40] 128 72.7 86.2 93.8 90.7
Margin* 128 74.4 87.2 94.0 89.4
MIC+margin 128 77.2 89.4 95.6 90.0
Table 3: Recall@k for k nearest neighbor and NMI on Stanford Online Products [28]. (*) indicates our ResNet50 re-implementation.
R@k Dim 1 10 30 50
BIER[29] 512 76.9 92.8 96.2 97.1
HTG[43] - 80.3 93.9 96.6 97.1
HTL[11] 512 80.9 94.3 97.2 97.8
A-BIER[30] 512 83.1 95.1 97.5 98.0
DREML[41] 9216 78.4 93.7 96.7 -
Margin* 128 84.5 95.7 97.6 98.3
MIC+margin 128 88.2 97.0 98.0 98.8
Table 4: Recall@k for k nearest neighbor and NMI on In-Shop [36]. (*) indicates our best re-implementation with ResNet50
Test Splits Small Large
R@k Dim 1 5 1 5
MixDiff+CCL[21] - 49.0 73.5 38.2 61.6
GS-TRS[1] - 75.0 83.0 73.2 81.9
BIER[29] 512 82.6 90.6 76.0 86.4
A-BIER[30] 512 86.3 92.7 81.9 88.7
DREML[41] 9216 88.5 94.8 83.1 92.4
Margin* 128 85.1 92.4 80.4 88.9
MIC+margin 128 86.9 93.4 82.0 91.0
Table 5: Recall@k for k nearest neighbor and NMI on PKU VehicleID[21]. DREML[41] is not comparable given the large embedding dimension. (*) our best ResNet50 re-implementation
Figure 4: Qualitative nearest neighbor evaluation for CUB200-2011, CARS196 and SOP based on and encodings and their combination. The results show that leverages class-independent information (posture,parts) while becomes independent to those features and focuses on the class detection. The combination of the two reintroduces both.

3.4 Minimizing Mutual Information

The class encoder and auxiliary encoder can then be trained using the respective labels. As we utilize two different learning tasks, and learn distinct characteristics. However, as both share the same input, the image features , a dependency between the encoders can be induced, therefore leading to both encoders learning some similar properties. To reduce this effect and to constrain the discriminative and shared characteristics into their respective encoding space, we introduce a mutual information loss, which we compute through an adversarial setup


with being a learned, small two-layered fully-connected neural network with normalized output projecting to the encoding space of . stands for an elementwise product, while the superscript notes a gradient reversal layer [10] which flips the gradient sign s.t. when trying to minimize , i.e. maximizing correlation, the similarity between both encoders is actually decreased. A similar method has been adopted by [30], where shared information is minimized between an ensemble of encoders. In contrast, our goal is to transfer non-discriminate characteristics to an auxiliary encoder. Finally, as scales with , we avoid trivial solutions (e.g. ) by enforcing to have unit length, similar to and .

Finally, the total loss to train our two encoders and the representation is computed by , where weights the contribution of the mutual information loss with respect to the class triplet loss and the auxiliary triplet loss . The full training is described in Alg. 1.

4 Experiments

In this section we offer a quantitative and qualitative analysis of our method, also in comparison to previous work. After providing technical information for reproducing the results of our model, we give some information regarding the standard benchmarks for metric learning and provide comparisons to previous methods. Finally, we offer insights into the model by studying its key components.

Figure 5: UMAP projection of for CARS196. Seven clusters are selected, showing six images near the centroid and their ground-truth labels. We see that the encoding extracts class-specific information and ignores other (e.g. orientation).

4.1 Implementation details

We implement our method using the PyTorch framework

[31]. As baseline architecture, we utilize ResNet50 [13] due to its widespread use in recent metric learning work. All experiments use a single NVIDIA GeForce Titan X. Practically, class and auxiliary encoders and use the same training protocol (following [40] with embedding dimensions of ) with alternating iterations to maximize the usable batch-size. The dimensionality of the auxiliary encoder is fixed (except for ablations in sec. 5) to the dimensionality of to ensure similar computational efficiency compared to previous work. However, due to GPU memory limitations, we use a batchsize of instead of a proposed , with no relevant changes in performance.

During training, we randomly crop images of size after resizing to , followed by random horizontal flips. For all experiments, we use the original images without bounding boxes. We train the model using Adam [18] with a learning rate of and set the other parameters to default. We set the triplet parameters following [40], initializing for the margin loss and as fixed triplet margin. Per mini-batch, we sample images per class for a random set of classes, until the batch size is reached. For (Sec. 3.4 eq.) we utilize dataset-dependent values in determined via cross-validation.

After class standardization, the clustering is performed via standard k-means using the faiss framework


. Using the hyperparameters proposed in this paragraph, the computational cost introduced by our approach is 10-20% of total training time. For efficiency, the clustering can be computed on GPU using faiss

[17]. The number of clusters is set before training to a fixed, problem-specific value: for CUB200-2011 [38], for CARS196 [19], for Stanford Online Products [28], for In-Shop Clothes [36] and for PKU VehicleID [21]. We update the cluster labels every other epoch. Notably, however, our model is robust to both parameters since a large range of parameters give comparable results. Later in section 5

we study the effect of cluster numbers and cluster label update frequencies for each dataset in more detail to motivate the chosen numbers. Finally, class assignments by clustering, especially in the initial training stages, becomes near arbitrary for samples further away from cluster centers. To ensure that we do not reinforce such a strong initial bias, we found it beneficial to ease the class constraint by randomly switching samples with samples from different cluster classes (with probability


Figure 6: UMAP projection of for CARS196. Seven clusters are selected, showing six images near the centroid and their GT labels. The result shows that the encoding extracts intrinsic characteristics of the object (car) independent from GT classes.

4.2 Datasets

Our model is evaluated on five standard benchmarks for image retrieval typically used in deep metric learning. We report the Recall@k metric [16] to evaluate image retrieval and the normalized mutual information score (NMI) [22] for the clustering quality. The training and evaluation procedure follows the standard setup as used in [40].
CARS196[19] with 196 car models over 16,185 images. We use the first 98 classes ( images) for training and the remaining 98 ( images) for testing.
Stanford Online Products[28] with 120,053 product images in 22,634 classes. 59,551 images (11,318 classes) are used for training, 60,502 (11,316 classes) for testing.
CUB200-2011[38] with 200 bird species over 11,788 images. Train and Test Sets contain the first and last 100 classes (5,864/5,924 images) respectively.
In-Shop Clothes[36] with 72,712 clothing images in 7,986 classes. 3,997 classes are used for training and 3,985 classes for evaluation. The test set is divided into a query set (14,218 images) and a gallery set (12,612 images).
PKU VehicleID[21] with 221,736 surveillance images of 26,267 vehicles with shared car models. We follow [21] and use 13,134 classes (110,178 images) for training. Testing is done on a predefined small and large testing subset with 7,332 (small) and 20,038 (large) images respectively.

Figure 7: Evaluation of as a function of the capacity. For CARS196 [19] and CUB200-2011 [38], we plot Recall@1 against the dimension during training. The results show that the increase in capacity of and thus the ability to learn properties shared among classes directly benefits the class encoder .

4.3 Quantitative and Qualitative Results

In this section we compare our approach with existing models from recent literature. Our method is applied on three different losses, the standard triplet loss with semi-hard negative mining [34], Proxy-NCA [24] and the state-of-the-art margin loss with weighted sampling [40]. For full transparency, we also provide results with our re-implementation of the baselines.

The results show a consistent gain over the state of the art for all datasets, see tables 1, 2, 3, 4 and 5. In particular, our approach achieves better results than more complex ensembles. On CUB200-2011, we outperform even DREML [41] which trains ResNet models in parallel.

Qualitative results are shown in Fig.4: the class encoder retrieves images sharing class-specific characteristics, while the auxiliary encoder finds intrinsic, class-independent object properties (e.g. posture, context). The combination retrieves images with both characteristics.

5 Ablations

In this section, we investigate the properties of our model and evaluate its components. We qualitatively examine the proposed encoder properties by checking recalled images for both and study the influence of on the recall performance, see Section 5.1. In Section 5 we measure the relation between the intra-class variance and the capacity of our auxiliary encoder . In addition, ablation studies are performed to examine the relevance of each pipeline component and hyper-parameter. We primarily utilize the most common benchmarks CUB200-2011, CARS196 and SOP.

Figure 8: Measure of the intra-class variance in the class embedding as function of the auxiliary encoder dimension. The result shows that the intra-class variance decreases with an increase in capacity. This points towards making it easier for to disregard class-independent information.

5.1 Embedding Properties

Firstly, we visualize the characteristics of the class encoder (Fig.5) and auxiliary encoder (Fig.6) by projecting the embedded test data to two dimensions using UMAP[23]. The figures show extracting class-discriminative information while encodes characteristics shared across classes (e.g. car orientation).

To evaluate the effect of the auxiliary encoder on the class encoder , we study the properties of the class encoding as function of the capability of to learn shared characteristics. First, we study the performance of on CARS196[19] and CUB200-2011[38] relative to the auxiliary encoder dimension. Utilizing varying dimensionalities, Fig.7 shows a direct relation between capacity and the retrieval capability. with dimension indicates the baseline method [40]. For all other evaluations, the dimension is equal to to keep the computational cost comparable to the baseline [40] (see Sec.4.1).

To examine our initial assumption that learning shared characteristics produces more compact classes, we study the intra-class variance by computing the mean pairwise distances per class, averaged over all classes. These distances are normalized by the average inter-class distance, approximated by the distance between two class centers.Summarized in fig.8 we see higher intra-class variance for basic margin loss ( dimension equal to ). But more importantly, the class compactness is directly related to the capacity of the auxiliary encoder .

We also offer a qualitative evaluation of the surrogate task in Fig.3. After class-standardization, the clustering recognizes latent structures of the data shared across classes.

Figure 9: Ablation study: influence of the number of clusters on Recall@1. A fixed cluster label update period of 1 was used with equal learning rate and consistent scheduling.
Clust Stand MutInfo CARS CUB SOP
- - - 80.0 62.9 73.2
+ - - 79.2 59.1 71.9
+ + - 81.3 64.9 75.8
+ + + 82.6 66.1 77.2
Table 6: Ablation study: Relevance of different contributions. Each component is crucial for reaching the best performance. (Clust: training with clusters, Stand: standardization before clustering (Sec3.3), MutInfo: mutual information loss (Sec3.4))

5.2 Testing Components and Parameters

In order to analyze our modules, we evaluate different models, each lacking one of the proposed contribution, see tab. 6. The table shows how each component is needed for the best performance. Comparing to the baseline in the first line, we see that simply introducing an additional task based on clustering the data deteriorates the performance, as we add another class-discriminative training signal that introduces worse or even contradictory information. However, by utilizing standardization, we allow our second encoder to explicitly learn new features to support the class encoder instead of working against it, giving a significant performance boost. A final mutual information loss emphasises the feature separation to improve the results further.

Our approach can be combined with most existing metric learning losses, which we evaluate on ProxyNCA[24] and triplet loss with semihard sampling[34] in Tab.1 and 2. On both CARS196 and CUB200-2011, we see improved image retrieval performance.

To examine the newly introduced hyper-parameters, Fig.9 compares the performances on the three benchmarks using a range of cluster numbers. The plot shows how the number of clusters influences the final performances, meaning the quality of the latent structure extracted by the auxiliary encoder is crucial for a better classification. At the same time, an optimal performance, within a range of Recall@1, is reached by a large set of cluster values, making the model robust to this hyper-parameter. For these cumulative tests, a higher learning rate and less training epochs were used to both reduce computation time and avoid overfitting to the test set. Based on these examinations, we set a fixed, but dataset-dependent cluster number for all other training runs, see Sec. 4.1.

A similar evaluation has been performed on the update frequency for the auxiliary labels (Fig.10). Updating the cluster frequently clearly provides a boost to our model, suggesting that the auxiliary encoder improves upon the initial clustering. However, within a reasonable range of values (between an update every 1 to 10 epochs) the model has no significant drop in performance. Thus we fix this parameter to update every two epochs for all the experiments.

Figure 10: Ablation study: influence of the cluster label update frequency on Recall@1. An optimal number of clusters (see Sec. 4.1) and consistent scheduling was used.

6 Conclusion

In this paper we have introduced a novel extension for standard metric learning methods to incorporate structured intra-class information into the learning process. We do so by separating the encoding space into two distinct subspaces. One incorporates information about class-dependent characteristics, with the remaining encoder handling shared, class-independent properties. While the former is trained using standard metric learning setups, we propose a new learning task for the second encoder to learn shared characteristics and explain a combined training setup. Experiments on several standard image retrieval datasets show that our method consistently boost standard approaches, outperforming the current state-of-the-art methods and reducing intra-class variance.

Acknowledgements. This work has been supported by Bayer and hardware donations by NVIDIA corporation.


  • [1] Y. Bai, F. Gao, Y. Lou, S. Wang, T. Huang, and L. Duan (2017) Incorporating intra-class variance to fine-grained visual recognition. CoRR abs/1703.00196. External Links: Link Cited by: §1, §2, Table 5.
  • [2] M. A. Bautista, A. Sanakoyeu, E. Tikhoncheva, and B. Ommer (2016) Cliquecnn: deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems, pp. 3846–3854. Cited by: §2.
  • [3] M. Á. Bautista, A. Sanakoyeu, and B. Ommer (2017) Deep unsupervised similarity learning using partially ordered sets. CoRR abs/1704.02268. External Links: Link, 1704.02268 Cited by: §2.
  • [4] B. Brattoli, U. Büchler, A. Wahl, M. E. Schwab, and B. Ommer (2017) LSTM self-supervision for detailed behavior analysis. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • [5] U. Büchler, B. Brattoli, and B. Ommer (2018)

    Improving spatiotemporal self-supervisionby deep reinforcement learning

    In IEEE Conference on European Conference on Computer Vision (ECCV), Cited by: §2.
  • [6] M. Caron, P. Bojanowski, A. Joulin, and M. Douze (2018)

    Deep clustering for unsupervised learning of visual features

    CoRR abs/1807.05520. External Links: Link, 1807.05520 Cited by: §2.
  • [7] S. Chopra, R. Hadsell, and Y. LeCun (2005) Learning a similarity metric discriminatively, with application to face verification. pp. 539–546. Cited by: §2.
  • [8] Y. Duan, W. Zheng, X. Lin, J. Lu, and J. Zhou (2018-06) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [9] Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §2.
  • [10] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016-01) Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17 (1), pp. 2096–2030. External Links: ISSN 1532-4435, Link Cited by: §3.4.
  • [11] W. Ge (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: §2, Table 1, Table 2, Table 3, Table 4.
  • [12] B. Harwood, B. Kumar, G. Carneiro, I. Reid, T. Drummond, et al. (2017) Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2829. Cited by: §2.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
  • [14] A. Iscen, G. Tolias, Y. Avrithis, and O. Chum (2018) Mining on manifolds: metric learning without labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7642–7651. Cited by: §2.
  • [15] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.
  • [16] H. Jegou, M. Douze, and C. Schmid (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.2.
  • [17] J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §4.1.
  • [18] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
  • [19] J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §1, Figure 3, Table 2, Figure 7, §4.1, §4.2, §5.1.
  • [20] X. Lin, Y. Duan, Q. Dong, J. Lu, and J. Zhou (2018-09) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, Table 1, Table 2, Table 3.
  • [21] H. Liu, Y. Tian, Y. Wang, L. Pang, and T. Huang (2016) Deep relative distance learning: tell the difference between similar vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2167–2175. Cited by: §1, Table 5, §4.1, §4.2.
  • [22] C. Manning, P. Raghavan, and H. Schütze (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §4.2.
  • [23] L. McInnes, J. Healy, and J. Melville (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §5.1.
  • [24] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §2, Table 1, Table 2, Table 3, §4.3, §5.2.
  • [25] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.
  • [26] M. Noroozi, A. Vinjimoor, P. Favaro, and H. Pirsiavash (2018)

    Boosting self-supervised learning via knowledge transfer

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §2.
  • [27] H. Oh Song, S. Jegelka, V. Rathod, and K. Murphy (2017) Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390. Cited by: §2.
  • [28] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §1, §2, Figure 3, Table 3, §4.1, §4.2.
  • [29] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2017) Bier-boosting independent embeddings robustly. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5189–5198. Cited by: §2, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [30] M. Opitz, G. Waltner, H. Possegger, and H. Bischof (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §3.4, Table 1, Table 2, Table 3, Table 4, Table 5.
  • [31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.1.
  • [32] A. Sanakoyeu, V. Tschernezki, U. Büchler, and B. Ommer (2019) Divide and conquer the embedding space for metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
  • [33] N. Sayed, B. Brattoli, and B. Ommer (2018) Cross and learn: cross-modal self-supervision. In German Conference on Pattern Recognition (GCPR), Cited by: §2.
  • [34] F. Schroff, D. Kalenichenko, and J. Philbin (2015)

    Facenet: a unified embedding for face recognition and clustering

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §2, Table 1, Table 2, §4.3, §5.2.
  • [35] K. Sohn (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §2.
  • [36] X. Tang (2016-06) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table 4, §4.1, §4.2.
  • [37] E. Ustinova and V. Lempitsky (2016) Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, pp. 4170–4178. Cited by: §2.
  • [38] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §1, Table 1, Figure 7, §4.1, §4.2, §5.1.
  • [39] J. Wang, F. Zhou, S. Wen, X. Liu, and Y. Lin (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §2.
  • [40] C. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §2, §3.1, Table 1, Table 2, Table 3, §4.1, §4.1, §4.2, §4.3, §5.1.
  • [41] H. Xuan, R. Souvenir, and R. Pless (2018) Deep randomized ensembles for metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 723–734. Cited by: Table 1, Table 2, Table 4, Table 5, §4.3.
  • [42] Y. Yuan, K. Yang, and C. Zhang (2017) Hard-aware deeply cascaded embedding. In Proceedings of the IEEE international conference on computer vision, pp. 814–823. Cited by: §2.
  • [43] Y. Zhao, Z. Jin, G. Qi, H. Lu, and X. Hua (2018) An adversarial approach to hard triplet generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–517. Cited by: Table 1, Table 2, Table 4.