Images live in a high dimensional space rich of structured information and unstructured noise. Therefore an image can be described by a finite combination of latent characteristics. The goal of computer vision is then to learn the relevant latent characteristics needed to solve a given task. Particularly in object classification, discriminative characteristics (e.g. car shape) are used to group the images according to predefined classes. To tackle the intra-class variability, modern classifiers can easily learn to be invariant to unstructured noise (e.g. random clutter, occlusion, image brightness). However, a considerable part of the variability is due to structured information shared among classes (e.g. view points and notions of color)
For metric learning this becomes especially important. As metric learning approaches project images into a high-dimensional feature space to measure similarities between images, every learned feature contributes. This means that finding a strong set latent characteristics is crucial. Learning the characteristics shared across classes should therefore benefit the model 
, as it can better explain the object variance within a class. Take for example a model trained only on white cars of a certain category. This model will very likely not be able to recognize a blue car of the same category (Fig.1 top-right). In this example, the encoder ignores the concept of ”color” for that particular class, even though it can be learned from the data as a latent variable shared across all cars (Fig.1 bottom-right). This is a typical generalization problem and is traditionally solved by providing more labeled data. However, besides being a costly solution, metric learning models need to also generalize to unknown classes, a task which should work independently from the amount of labels provided.
Explicitly modeling intra-class variation has already proven successful[20, 15, 1], such as spatial transformer layers , which explicitly learn the possible rotations and translations of an object category.
We therefore propose a model to discriminate between classes while simultaneously learning the shared properties of the objects. To strip intra-class characteristics away from our primary class encoder, thereby facilitating the task of learning good discriminative features, we utilize an auxiliary encoder. While the class encoder can be trained using ground-truth labels, the auxiliary encoder is learned through a novel surrogate task which extracts class-independent information without any additional annotations. Finally, an additional mutual information loss further purifies the class encoder from non-discriminative characteristics by eliminating the information learned from the auxiliary encoder.
This solution can be utilized with any standard metric learning loss, as shown in the result section. Our approach is evaluated on three standard benchmarks for zero-shot learning, CUB200-2011 , CARS196  and Stanford Online Products , as well as two more recent datasets, In-Shop Clothes  and PKU VehicleID . The results show that the proposed approach consistently enhances the performances of existing methods.
2 Related Work
After the success of deep learning in object classification, many researchers have been investigating neural networks for metric learning. A network for classification extracts only the necessary features for discrimination between classes. Instead, metric learning encodes the images into an euclidean space where semantically similar ones are grouped much closer together. This makes metric learning effective in various computer vision applications, such as object retrieval[28, 40], zero-shot learning  and face verification [7, 34]. The triplet paradigm  is the standard in the field and much work has been done to improve upon the original approach. As an exponential number of possible triplets makes the computation infeasible, many papers propose solutions for mining triplets more efficiently [40, 34, 12, 11, 14]. Recently, Duan .  have proposed a generative model to directly produce hard negatives. ProxyNCA  generates a set of class proxies and optimizes the distance of the anchor to said proxies, solving the triplet complexity problem. Others have explored orthogonal directions by extending the triplet paradigm, e.g. making use of every sample in the (specifically constructed) batch at once [28, 35], enforcing an angular triplet constraint , minimizing a cluster quality surrogate  or optimizing the overlap between positive and negative similarity histograms . In addition, ensembles have been quite successfully used by combining multiple encoding spaces [29, 30, 42, 9] to maximize their efficiency.
Our work makes use of class-agnostic grouping of our data (see e.g. [3, 2]) and shares similarities with proposals from Liu . , who explicitly decompose images into class-specific and intra-class embeddings using a generative model, as well as Bai . , who, before training, divide each image class into subgroups to find an approximator for intra-class variances that can be included into the loss. However, unlike  and , we explicitly search for structures shared between classes instead of modelling the intra-class variance per sample  or class . In addition, unlike , we assume class-independent intra-class variance and iteratively train a second encoder to model intra-class features, thereby purifying the main encoder from non-discriminative features and achieving significantly better results.
Finally, some works have exploited the latent structure of the data as a supervisory signal [25, 26, 6, 4, 5, 33, 32]. In particular, Caron .  learn an unsupervised image representation by clustering the data, starting from a Sobel filter prior initialization. Our approach includes such latent data structures in a similar way, however we use it as auxiliary information to improve upon the metric learning task.
3 Improving Metric Learning
The main idea behind our method is the inclusion of class-shared characteristics into the metric learning process to help the model explain them away. In doing so, we would gain robustness to intrinsic, non-disciminative properties of the data, which is contrary to the common approach of simply forcing invariance towards them. However, three main problems arise with this approach, namely: (i) Extracting both class and class-independent characteristics using a single encoder is infeasible and detrimental to the main goal. (ii) We lack the labels for extracting these latent properties. (iii) We need to explicitly remove unwanted properties from the class embedding. We propose solutions to each of these problems in sections 3.2, 3.3 and 3.4.
Metric learning encodes the characteristics that discriminate between classes into an embedding vector, with the goal of training an encodersuch that images from the same class are nearby in the encoding space and samples from different classes are far apart, given a standard distance in the embedding space.
In deep metric learning, image features are extracted using a neural network producing an image representation vector , which is used as input for the encoder of the embedding . The latter is implemented as a fully connected layer generating an embedding vector of dimension used for computing similarities. The features and the encoder can then be trained jointly by standard back-propagation.
With defining the euclidean distance between the images and , we require that if and .
Given a triplet with and , the loss is then defined as
where is a margin parameter. Many variants of this loss have been proposed recently, with margin loss (adding an additionally learnable margin ) proving to be best.
3.2 Auxiliary Encoder
To separate the process of extracting both inter- and intra-class (shared) characteristics, we utilize two separate encodings: a class encoder which aims to extract class-discriminative features and an auxiliary encoder to find shared properties. These encoders are trained together (Fig.2). To efficiently train the underlying deep neural network, the two encoders share the same image representation which is updated by both during training. In the first training task, the class encoder is trained using the provided ground truth labels associated with each image with the number of samples. A respective, metric-based loss function can be selected arbitrarily (such as a standard triplet loss or the aforementioned margin loss), as this part follows the generic training setup for metric learning problems. Because labels are not provided for the training of our auxiliary encoder, we define an automatic process to mine shared latent structure information from the original data. This information is then used to provide a new set of training labels to train our auxiliary encoder (Fig.2 right). As the training scheme is now equivalent to the primary task, we may choose from the same set of loss functions.
3.3 Extracting Inter-class Characteristics
We seek a task which, without human supervision, spots structured characteristics within the data while ignoring class-specific information. As structured properties are generally defined by characteristics shared among several images, they create homogeneous groups. To find these, clustering offers a well established solution. This algorithm associates images to surrogate labels with and being the predefined number of clusters. However, applied directly to the data, this method is biased towards class-specific structures since images from the same class share many common properties, like color, context and shape, mainly injected through the data collection process (e.g. a class may be composed of pictures of the same object from multiple angles).
To remove the characteristics shared within the class, we apply normalization guided by the ground truth classes. For each class we compute the meanbased on the features . Then we obtain the new standardized image representation with , where the class influence is now reduced. Afterwards, the auxiliary encoder can be trained using the surrogate labels produced by clustering the space .
For that to work as intended, a strong prior is needed. It is standard procedure for deep metric learning to initialize the representation backend
with weights pretrained on ImageNet. This provides a sufficiently good starting point for clustering, which is then reinforced through training.
3.4 Minimizing Mutual Information
The class encoder and auxiliary encoder can then be trained using the respective labels. As we utilize two different learning tasks, and learn distinct characteristics. However, as both share the same input, the image features , a dependency between the encoders can be induced, therefore leading to both encoders learning some similar properties. To reduce this effect and to constrain the discriminative and shared characteristics into their respective encoding space, we introduce a mutual information loss, which we compute through an adversarial setup
with being a learned, small two-layered fully-connected neural network with normalized output projecting to the encoding space of . stands for an elementwise product, while the superscript notes a gradient reversal layer  which flips the gradient sign s.t. when trying to minimize , i.e. maximizing correlation, the similarity between both encoders is actually decreased. A similar method has been adopted by , where shared information is minimized between an ensemble of encoders. In contrast, our goal is to transfer non-discriminate characteristics to an auxiliary encoder. Finally, as scales with , we avoid trivial solutions (e.g. ) by enforcing to have unit length, similar to and .
Finally, the total loss to train our two encoders and the representation is computed by , where weights the contribution of the mutual information loss with respect to the class triplet loss and the auxiliary triplet loss . The full training is described in Alg. 1.
In this section we offer a quantitative and qualitative analysis of our method, also in comparison to previous work. After providing technical information for reproducing the results of our model, we give some information regarding the standard benchmarks for metric learning and provide comparisons to previous methods. Finally, we offer insights into the model by studying its key components.
4.1 Implementation details
We implement our method using the PyTorch framework. As baseline architecture, we utilize ResNet50  due to its widespread use in recent metric learning work. All experiments use a single NVIDIA GeForce Titan X. Practically, class and auxiliary encoders and use the same training protocol (following  with embedding dimensions of ) with alternating iterations to maximize the usable batch-size. The dimensionality of the auxiliary encoder is fixed (except for ablations in sec. 5) to the dimensionality of to ensure similar computational efficiency compared to previous work. However, due to GPU memory limitations, we use a batchsize of instead of a proposed , with no relevant changes in performance.
During training, we randomly crop images of size after resizing to , followed by random horizontal flips. For all experiments, we use the original images without bounding boxes. We train the model using Adam  with a learning rate of and set the other parameters to default. We set the triplet parameters following , initializing for the margin loss and as fixed triplet margin. Per mini-batch, we sample images per class for a random set of classes, until the batch size is reached. For (Sec. 3.4 eq.) we utilize dataset-dependent values in determined via cross-validation.
After class standardization, the clustering is performed via standard k-means using the faiss framework
. Using the hyperparameters proposed in this paragraph, the computational cost introduced by our approach is 10-20% of total training time. For efficiency, the clustering can be computed on GPU using faiss. The number of clusters is set before training to a fixed, problem-specific value: for CUB200-2011 , for CARS196 , for Stanford Online Products , for In-Shop Clothes  and for PKU VehicleID . We update the cluster labels every other epoch. Notably, however, our model is robust to both parameters since a large range of parameters give comparable results. Later in section 5
we study the effect of cluster numbers and cluster label update frequencies for each dataset in more detail to motivate the chosen numbers. Finally, class assignments by clustering, especially in the initial training stages, becomes near arbitrary for samples further away from cluster centers. To ensure that we do not reinforce such a strong initial bias, we found it beneficial to ease the class constraint by randomly switching samples with samples from different cluster classes (with probability).
Our model is evaluated on five standard benchmarks for image retrieval typically used in deep metric learning. We report the Recall@k metric  to evaluate image retrieval and the normalized mutual information score (NMI)  for the clustering quality. The training and evaluation procedure follows the standard setup as used in .
CARS196 with 196 car models over 16,185 images. We use the first 98 classes ( images) for training and the remaining 98 ( images) for testing.
Stanford Online Products with 120,053 product images in 22,634 classes. 59,551 images (11,318 classes) are used for training, 60,502 (11,316 classes) for testing.
CUB200-2011 with 200 bird species over 11,788 images. Train and Test Sets contain the first and last 100 classes (5,864/5,924 images) respectively.
In-Shop Clothes with 72,712 clothing images in 7,986 classes. 3,997 classes are used for training and 3,985 classes for evaluation. The test set is divided into a query set (14,218 images) and a gallery set (12,612 images).
PKU VehicleID with 221,736 surveillance images of 26,267 vehicles with shared car models. We follow  and use 13,134 classes (110,178 images) for training. Testing is done on a predefined small and large testing subset with 7,332 (small) and 20,038 (large) images respectively.
4.3 Quantitative and Qualitative Results
In this section we compare our approach with existing models from recent literature. Our method is applied on three different losses, the standard triplet loss with semi-hard negative mining , Proxy-NCA  and the state-of-the-art margin loss with weighted sampling . For full transparency, we also provide results with our re-implementation of the baselines.
The results show a consistent gain over the state of the art for all datasets, see tables 1, 2, 3, 4 and 5. In particular, our approach achieves better results than more complex ensembles. On CUB200-2011, we outperform even DREML  which trains ResNet models in parallel.
Qualitative results are shown in Fig.4: the class encoder retrieves images sharing class-specific characteristics, while the auxiliary encoder finds intrinsic, class-independent object properties (e.g. posture, context). The combination retrieves images with both characteristics.
In this section, we investigate the properties of our model and evaluate its components. We qualitatively examine the proposed encoder properties by checking recalled images for both and study the influence of on the recall performance, see Section 5.1. In Section 5 we measure the relation between the intra-class variance and the capacity of our auxiliary encoder . In addition, ablation studies are performed to examine the relevance of each pipeline component and hyper-parameter. We primarily utilize the most common benchmarks CUB200-2011, CARS196 and SOP.
5.1 Embedding Properties
Firstly, we visualize the characteristics of the class encoder (Fig.5) and auxiliary encoder (Fig.6) by projecting the embedded test data to two dimensions using UMAP. The figures show extracting class-discriminative information while encodes characteristics shared across classes (e.g. car orientation).
To evaluate the effect of the auxiliary encoder on the class encoder , we study the properties of the class encoding as function of the capability of to learn shared characteristics. First, we study the performance of on CARS196 and CUB200-2011 relative to the auxiliary encoder dimension. Utilizing varying dimensionalities, Fig.7 shows a direct relation between capacity and the retrieval capability. with dimension indicates the baseline method . For all other evaluations, the dimension is equal to to keep the computational cost comparable to the baseline  (see Sec.4.1).
To examine our initial assumption that learning shared characteristics produces more compact classes, we study the intra-class variance by computing the mean pairwise distances per class, averaged over all classes. These distances are normalized by the average inter-class distance, approximated by the distance between two class centers.Summarized in fig.8 we see higher intra-class variance for basic margin loss ( dimension equal to ). But more importantly, the class compactness is directly related to the capacity of the auxiliary encoder .
We also offer a qualitative evaluation of the surrogate task in Fig.3. After class-standardization, the clustering recognizes latent structures of the data shared across classes.
5.2 Testing Components and Parameters
In order to analyze our modules, we evaluate different models, each lacking one of the proposed contribution, see tab. 6. The table shows how each component is needed for the best performance. Comparing to the baseline in the first line, we see that simply introducing an additional task based on clustering the data deteriorates the performance, as we add another class-discriminative training signal that introduces worse or even contradictory information. However, by utilizing standardization, we allow our second encoder to explicitly learn new features to support the class encoder instead of working against it, giving a significant performance boost. A final mutual information loss emphasises the feature separation to improve the results further.
Our approach can be combined with most existing metric learning losses, which we evaluate on ProxyNCA and triplet loss with semihard sampling in Tab.1 and 2. On both CARS196 and CUB200-2011, we see improved image retrieval performance.
To examine the newly introduced hyper-parameters, Fig.9 compares the performances on the three benchmarks using a range of cluster numbers. The plot shows how the number of clusters influences the final performances, meaning the quality of the latent structure extracted by the auxiliary encoder is crucial for a better classification. At the same time, an optimal performance, within a range of Recall@1, is reached by a large set of cluster values, making the model robust to this hyper-parameter. For these cumulative tests, a higher learning rate and less training epochs were used to both reduce computation time and avoid overfitting to the test set. Based on these examinations, we set a fixed, but dataset-dependent cluster number for all other training runs, see Sec. 4.1.
A similar evaluation has been performed on the update frequency for the auxiliary labels (Fig.10). Updating the cluster frequently clearly provides a boost to our model, suggesting that the auxiliary encoder improves upon the initial clustering. However, within a reasonable range of values (between an update every 1 to 10 epochs) the model has no significant drop in performance. Thus we fix this parameter to update every two epochs for all the experiments.
In this paper we have introduced a novel extension for standard metric learning methods to incorporate structured intra-class information into the learning process. We do so by separating the encoding space into two distinct subspaces. One incorporates information about class-dependent characteristics, with the remaining encoder handling shared, class-independent properties. While the former is trained using standard metric learning setups, we propose a new learning task for the second encoder to learn shared characteristics and explain a combined training setup.
Experiments on several standard image retrieval datasets show that our method consistently boost standard approaches, outperforming the current state-of-the-art methods and reducing intra-class variance.
Acknowledgements. This work has been supported by Bayer and hardware donations by NVIDIA corporation.
-  (2017) Incorporating intra-class variance to fine-grained visual recognition. CoRR abs/1703.00196. External Links: Cited by: §1, §2, Table 5.
-  (2016) Cliquecnn: deep unsupervised exemplar learning. In Advances in Neural Information Processing Systems, pp. 3846–3854. Cited by: §2.
-  (2017) Deep unsupervised similarity learning using partially ordered sets. CoRR abs/1704.02268. External Links: Cited by: §2.
LSTM self-supervision for detailed behavior analysis.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
Improving spatiotemporal self-supervisionby deep reinforcement learning. In IEEE Conference on European Conference on Computer Vision (ECCV), Cited by: §2.
Deep clustering for unsupervised learning of visual features. CoRR abs/1807.05520. External Links: Cited by: §2.
-  (2005) Learning a similarity metric discriminatively, with application to face verification. pp. 539–546. Cited by: §2.
-  (2018-06) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §2.
-  (2016-01) Domain-adversarial training of neural networks. J. Mach. Learn. Res. 17 (1), pp. 2096–2030. External Links: Cited by: §3.4.
-  (2018) Deep metric learning with hierarchical triplet loss. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–285. Cited by: §2, Table 1, Table 2, Table 3, Table 4.
-  (2017) Smart mining for deep metric learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2821–2829. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2018) Mining on manifolds: metric learning without labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7642–7651. Cited by: §2.
-  (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §1.
-  (2011) Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33 (1), pp. 117–128. Cited by: §4.2.
-  (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §4.1.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2013) 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 554–561. Cited by: §1, Figure 3, Table 2, Figure 7, §4.1, §4.2, §5.1.
-  (2018-09) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2, Table 1, Table 2, Table 3.
-  (2016) Deep relative distance learning: tell the difference between similar vehicles. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2167–2175. Cited by: §1, Table 5, §4.1, §4.2.
-  (2010) Introduction to information retrieval. Natural Language Engineering 16 (1), pp. 100–103. Cited by: §4.2.
-  (2018) Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426. Cited by: §5.1.
-  (2017) No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pp. 360–368. Cited by: §2, Table 1, Table 2, Table 3, §4.3, §5.2.
-  (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp. 69–84. Cited by: §2.
Boosting self-supervised learning via knowledge transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9359–9367. Cited by: §2.
-  (2017) Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390. Cited by: §2.
-  (2016) Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4004–4012. Cited by: §1, §2, Figure 3, Table 3, §4.1, §4.2.
-  (2017) Bier-boosting independent embeddings robustly. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5189–5198. Cited by: §2, Table 1, Table 2, Table 3, Table 4, Table 5.
-  (2018) Deep metric learning with bier: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2, §3.4, Table 1, Table 2, Table 3, Table 4, Table 5.
-  (2017) Automatic differentiation in pytorch. In NIPS-W, Cited by: §4.1.
-  (2019) Divide and conquer the embedding space for metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2018) Cross and learn: cross-modal self-supervision. In German Conference on Pattern Recognition (GCPR), Cited by: §2.
Facenet: a unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 815–823. Cited by: §2, Table 1, Table 2, §4.3, §5.2.
-  (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §2.
-  (2016-06) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table 4, §4.1, §4.2.
-  (2016) Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems, pp. 4170–4178. Cited by: §2.
-  (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §1, Table 1, Figure 7, §4.1, §4.2, §5.1.
-  (2017) Deep metric learning with angular loss. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2593–2601. Cited by: §2.
-  (2017) Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2840–2848. Cited by: §2, §3.1, Table 1, Table 2, Table 3, §4.1, §4.1, §4.2, §4.3, §5.1.
-  (2018) Deep randomized ensembles for metric learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 723–734. Cited by: Table 1, Table 2, Table 4, Table 5, §4.3.
-  (2017) Hard-aware deeply cascaded embedding. In Proceedings of the IEEE international conference on computer vision, pp. 814–823. Cited by: §2.
-  (2018) An adversarial approach to hard triplet generation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 501–517. Cited by: Table 1, Table 2, Table 4.