Deep metric learning (DML) is an important, yet challenging task in the Computer Vision community, with numerous applications such as multi-modal retrieval[Carvalho_2018_SIGIR, Wehrmann_2018_CVPR], face verification [Schroff_2015_CVPR] or person re-identification [Liu_2016_CVPR]. DML methods intend to learn an embedding space, where visually-related images (e.g., two different birds from the same breed) have similar representations, while unrelated images (e.g.
, two different breeds of crows from North America and Europe) have dissimilar representations. To learn this embedding space, recent contributions focus on three main points: (1) loss functions to improve generalization[Wang_2019_CVPR], (2) ensemble methods to tackle the embedding space diversity [Opitz_2017_ICCV] and (3) hard example mining strategies to resume the training when randomly sampling informative tuples becomes nearly impossible [Xu_2019_CVPR].
Example generation has recently been proposed as a hard negative mining strategy. In this case, a generator and the metric learning network are trained together to provide informative tuples using either VAEs [Lin_2018_ECCV] or GANs [Duan_2018_CVPR, Zhao_2018_ECCV, Zheng_2019_CVPR]
. In the case of VAEs, a large amount of examples is generated by sampling with respect to the training sample distribution estimated from the data. Usually, this leads to sampling inside the class manifolds and rarely produces hard negative examples. Such variational approaches are interesting in the case of few training samples per class but they are not well suited for mining informative examples at later training stages. On the opposite, GAN-based approaches generate discriminative examples. However, adversarial generators are difficult to tune due to the contrary objectives of the DML network and the adversarial learning of the generator. On the one hand, if the adversarial loss is much lower than the DML loss, the generated examples tend to be at the center of the class manifold and the method faces the same problems as VAE generators. On the other hand, if the adversarial loss is much higher than the DML loss, some examples can be generated beyond the boundary of the class manifolds and lead to label ambiguity as illustrated onFigure a. The mining strategy then produces examples with incorrect labels with respect to the training classes.
As the main contribution of this paper, we propose MIRAGE, a method that leverages virtual classes composed solely of generated examples to tackle the problem of label ambiguity arising from hard negative generation. Virtual classes play the role of buffer areas as shown on Figure b. Hard negative examples that lie between the training class manifolds are generated inside these buffer areas, without any label ambiguity, by sampling the virtual classes. In addition to solving the problem of label ambiguity, virtual classes example generation leads to better generalization capabilities: The metric learning network has better results on unseen classes than other adversarial approaches [Duan_2018_CVPR, Zhao_2018_ECCV, Zheng_2019_CVPR].
The paper is organized as follow: in section 2, we present the related work in deep metric learning, the recent contributions in example generation and motivate the need for our method. In section 3, we expose the core aspects of MIRAGE and its simple implementation. In section section 4, we experimentally show that MIRAGE indeed produces buffer areas between the training classes and we perform an ablation study of the different aspects of the method. Finally, in section 5, we show that our method improves over other sample generation methods on four DML datasets (Cub-200-2011, Cars-196, Stanford Online Products and In-Shop Clothes Retrieval), and obtains results comparable to the state-of-the-art.
2 Related work
In deep metric learning, we train a deep network to provide representations and a corresponding metric to measure similarities. The training procedure of such network relies on three main points, namely: a loss function, a sampling strategy and optionally, an ensemble method. In the case of the loss function, original methods consider pairs [Chopra_CVPR_2005] or triplets [Schroff_2015_CVPR] of similar/dissimilar examples. These approaches have been enhanced by considering larger tuples [Chen_2017_CVPR, Song_2016_CVPR, Sohn_2016_NIPS, Ustinova_2016_NIPS] or by improving the properties of the loss functions [Rippel_2016_ICLR, Wang_2018_CVPR, Wang_2017_ICCV, Yu_2018_ECCV]. When randomly sampling informative tuples becomes too hard, sampling strategies can be exploited to resume the training. These methods can be based on efficient batch construction [Song_2016_CVPR, Sohn_2016_NIPS, Ustinova_2016_NIPS], scalable mining strategies [Harwood_2017_ICCV, Schroff_2015_CVPR] or proxy-based approximations [Movshovitz-Attias_2017_ICCV, Song_2017_CVPR, Rippel_2016_ICLR]. Finally, ensemble methods have become a popular way of improving the performances of DML architectures [Kim_2018_ECCV, Opitz_2017_ICCV, Opitz_toap_PAMI, Xuan_2018_ECCV, Yuan_2017_ICCV]. Our proposed MIRAGE is a complementary approach to loss functions and ensemble methods.
MIRAGE differs from other hard negative example generation methods, such as DAML [Duan_2018_CVPR], HTG [Zhao_2018_ECCV], HDML [Zheng_2019_CVPR] and DVML [Lin_2018_ECCV]. Both DAML and HTG rely on sampling a triplet, by feeding this triplet to a generator trained in an adversarial manner, and then producing a hard negative example to replace the original negative one. However, both methods suffer from the problem of label ambiguity illustrated in Figure 1 in that the generator can output an example inside another class manifold. Zheng et al. [Zheng_2019_CVPR] face the same issue with HDML. HDML tries to alleviate this effect by generating first an intermediate example that may be outside its class manifold; then a generator projects this example into the class manifold. In case of a failure, they use the DML loss over the real examples to weight the generator loss: if the triplet is an easy one, the generator only slightly modifies the example to avoid the generation of an intermediate example that would be too far from its class manifold. Moreover, the metric learning loss that is computed on the generated triplets is also weighted by the reconstruction loss: the worse the reconstruction is, the less they take into account the new triplet. In other words, to mitigate the effect of label ambiguity, HDML tends to discard really hard negative examples which limits its hard negative generation capabilities.
At the same time, DVML gets rid of the triplet constraints for the generator by considering the class manifold as a Gaussian distribution. By estimating the parameters of the distribution, the sampling of new examples is performed using a variational approach. Because there is no adversarial training, examples tend to be mostly sampled at the center of the Gaussian distribution. As such, they only slightly contribute to the DML loss.
To solve the problem of label ambiguity while generating hard negative, we propose to insert buffer areas between the training classes. To that end, we introduce virtual classes that we encourage to migrate between the training classes. Sampling hard negatives from the virtual classes allows us to use a generative sampling process similar to DVML [Lin_2018_ECCV] which is simpler than the triplet based adversarial methods. At the same time, it also removes the need to take into account the possible incorrectness of the labels since these generated examples do not correspond to existing classes.
3 Method overview
In this section, we start by giving an overview of MIRAGE. Then, we detail the core aspects involved in the approach. Finally, we describe the overall MIRAGE architecture.
3.1 Mirage overview
MIRAGE is designed to improve deep metric learning by using the following core aspects:
DML training. Like any other DML method, MIRAGE
uses a deep neural network to embed feature vectors into a latent representation space where visually-related images have similar representations and where unrelated images have dissimilar representations. We use the standard metric learning approach which extracts deep local features using a backbone network (e.g., GoogleNet [Szegedy_2015_CVPR] or BN-Inception [Ioffe_2015_ICML]), computes a feature vector (e.g., using an average pooling) and projects it into an embedding space in order to learn the metric.
Training class sample generation. As it is done in variational approaches, MIRAGE generates artificial examples from the training classes in order to provide a better sampling of each training class manifold. By doing so, class manifolds are filled with synthetic examples. These generated examples are added to the mini-batch along real examples in order to have larger batches from which informative tuples can be sampled. These additional tuples are then used to train the DML model.
Virtual class hard negative sample generation. Similarly to adversarial sample generation, MIRAGE also generates hard negative examples. However, current hard negative generation are prone to label ambiguity (see Figure a). To tackle this issue, we add virtual classes between training classes that play the role of buffer areas (see Figure b). Hard negative examples are consequently generated inside these buffer areas by sampling within these virtual class manifolds. Similarly to training class generation, these examples are added to the mini-batch. We experimentally show that it leads to better performances than other generation-based methods.
3.2 Deep metric learning
The first part of a DML network is to extract a feature vector . E.g., we use GoogleNet [Szegedy_2015_CVPR] followed by a global average pooling to compute . Then, we want to learn a Mahalanobis distance so that the distance between two feature vectors and is:
where is a low-rank approximation of . The feature vectors are projected into the embedding space with where their corresponding examples are denoted . In practice, all examples are -normalized to ease the optimization. We note the function that transforms a feature vector into an example using the following equation:
To train the network, we rely on standard metric learning loss functions such as the contrastive loss [Chopra_CVPR_2005], the triplet loss [Schroff_2015_CVPR] or the binomial loss [Ustinova_2016_NIPS]. As proposed by [Movshovitz-Attias_2017_ICCV], we use a class representation prototype to accelerate the training. and are trained together using a DML loss (triplet, contrastive, etc.) denoted .
3.3 Training class example generation
To generate examples from the training classes, MIRAGE relies on a conditional generator that is designed to produce an artificial example from the prototype of class and a Gaussian noise , as follows:
is then used as a training example to optimize the loss function described in the previous section.
To train the generator , we use a reconstruction loss by computing the ElasticNet loss between a feature vector extracted from a real image and a feature vector generated by feeding the generator G with , as follows:
3.4 Virtual class example generation
To generate hard negative examples, we consider a set of virtual classes associated with prototypes
. Examples are generated from these prototypes exactly like if they where training classes. To produce hard negative samples, we encourage the prototypes and the generator to output realistic samples between the training classes. To that end, we use a discriminative classifier. is trained using binary cross-entropy to distinguish between real and generated samples (with output ). is also trained using categorical cross-entropy to predict classes (with output ). The combined loss for training is a two head classification loss based on cross-entropy:
where and are the class labels of the prototypes and respectively.
To encourage the generator to output realistic samples that are between the training classes, is trained to maximize , with being fixed. By optimizing over , the generator is encouraged to output generated samples that are indistinguishable from real samples. By optimizing over , the generator is encouraged to output samples at the boundaries of the classes (i.e., in the buffer area described in Figure b). Just like the training class sample generation, virtual class sample generation is used to populate the mini-batches used for training using .
3.5 Mirage architecture
We describe the implementation of the MIRAGE architecture in Figure 2. A set of deep local features is first extracted from the image using a backbone network such as GoogleNet [Szegedy_2015_CVPR] or BN-Inception [Ioffe_2015_ICML]. These local features are then aggregated into feature vectors using an average pooling. They are followed by the encoder which is composed of a single fully-connected layer without bias followed by a normalization. A prototype is used for each class and is represented by a star, either in plain lines for training classes or in dashed lines for virtual classes. The generator
is composed of two fully-connected layers with ReLU activation. The discriminatoris composed of a fully-connected layer with ReLU activation which is followed by two fully-connected layers: One with sigmoid activation for the binary classification of real or virtual feature vectors and one with softmax activation for the class prediction.
To train MIRAGE, we generate mini-batches composed of training examples , generated examples from the training class and generated examples from virtual classes . The ratio of training examples and generated examples in the mini-batch corresponds to how much each aspect of MIRAGE is used. This ratio is investigated in the ablation studies. The backbone network, the encoder and the prototypes are trained together using the entire mini-batch minimizing the metric learning loss function from subsection 3.2. The generator is trained on minimizing from subsection 3.3 and on maximizing from subsection 3.4. Finally, the discriminator is trained on the entire batch minimizing from subsection 3.4.
4 Ablation study
In this section, we provide different ablation studies that include (1) a visualization of the learned embedding and the virtual classes as well as some relevant statistics, (2) the impact of the number of generated example in the mini-batches, (3) the impact of the virtual class generation and (3) the combination of training class generation and virtual class generation. In this section, we train a GoogleNet backbone network with a 512 dimensional embedding with the contrastive loss on the Cub-200-2011 dataset that is denoted as the Baseline using a fixed batch size of real examples for all experiments.
4.1 Prototype visualization
The first ablation considers an empirical analysis of the learned embedding space with the virtual classes. The objective is to verify that our architecture encourages the virtual classes to settle between the training classes. To that end, we show a t-SNE visualization of the training and virtual prototypes of the model trained on Cub-200-2011 on Figure a. As we can see, the virtual prototypes (in gray) are indeed in the middle of the training class prototypes. Quantitatively, we found that 80% of the training class prototypes have a virtual class prototype as nearest neighbor in the 512 dimensional latent space. This numerically shows that our architecture is able to produce virtual class as buffer areas between training classes.
To avoid the bias introduced by the 2D embedding performed by t-SNE, we also train a model on the popular MNIST dataset with a latent space of dimension 2. We plot the resulting prototypes as well as examples generated from these prototypes on Figure b. As we can see, the virtual prototypes (denoted and in pale colors) are indeed acting as buffer between the training classes, even with the high constraints of having such a low dimensional latent space.
4.2 Sample generation ablation
|Training class examples ratio||Virtual class ratio|
First, we evaluate the impact of the number of generated training class examples. For that purpose, we do not use virtual class prototypes. We vary the size of the generated example set with respect to a ratio of the real example set , such that: . We report Recall@1 on the Cub-200-2011 dataset in Table 1 for .
The reported value for means that no examples have been generated and obtains a strong Baseline of Recall@1. One can note that even a small amount of generated example, e.g., , increases the performances by in Recall@1 on the Cub-200-2011 dataset. Hence, this confirms the benefit of a generation-based mining strategy to improve DML. With a further increase of the size of the generated example set, we improve the performances of the Baseline from to Recall@1, a significant increase of nearly .
Next, we evaluate the impact of the number of virtual classes. We fix the size of the generated examples set to the size of the training examples batch , that is: . Then, we vary the number of the virtual class prototype as a ratio of the number of the training class , such that . We only generate examples from these virtual classes and not from the training classes. We report Recall@1 on the Cub-200-2011 dataset in Table 1 for . Interestingly, even a small number of additional classes, e.g. , already improves the Baseline by a significant increase of more than in Recall@1, from to . Increasing the number of virtual classes improves even more the performances, and leads to the best results for Recall@1 with - a significant increase of more than over the Baseline.
Finally, we evaluate the merging of both the training class example generation and the virtual class example generation. Results are reported in Table 2. Following the two previous ablations, we set and . To avoid any bias in the selection of these parameters, we report results on the Cars-196 dataset for three different DML losses; namely the contrastive loss [Chopra_CVPR_2005], the triplet loss [Schroff_2015_CVPR] and the binomial loss [Ustinova_2016_NIPS]. We also compare three different approaches, namely: the Baseline, the training class example generation only (denoted as TCG) and MIRAGE. For the three DML losses, both the training class example generation and MIRAGE lead to significant improvements over the Baseline. E.g., with the contrastive loss, the Baseline is improved from to which is nearly a improvement in Recall@1. Besides, the performances of the binomial loss and the triplet loss are improved from to and from to respectively, which is an absolute improvement of and in Recall@1. This improvements is achieved without tuning the parameters for the Cars-196 dataset and for all evaluated DML loss functions.
|Contrastive + TCG||76.3||85.2||90.8||94.6|
|Contrastive + MIRAGE||78.8||86.4||91.7||95.4|
|Triplet + TCG||72.0||81.4||88.1||93.2|
|Triplet + MIRAGE||73.6||82.2||88.5||93.2|
|Binomial + TCG||74.6||83.8||89.7||93.9|
|Binomial + MIRAGE||77.8||86.1||91.3||94.7|
5 Comparison to the state-of-the-art
|BN-Inception||MS loss [Wang_2019_CVPR]||65.7||77.0||86.3||91.2|
|BN-Inception||MS loss [Wang_2019_CVPR]||84.1||90.4||94.0||96.5|
|BN-Inception||MS loss [Wang_2019_CVPR]||78.2||90.5||96.0|
|BN-Inception||MS loss [Wang_2019_CVPR]||89.7||97.9||98.5||98.8|
In this section, we present the benefits of MIRAGE on four deep metric learning datasets named Cub-200-2011 [CUB_200_2011], Cars-196 [CARS_196], Stanford Online Products [Song_2016_CVPR] and In-Shop Clothes Retrieval [Liu_2016_CVPR_INSHOP]. We follow the standard splits from [Opitz_toap_PAMI] and Recall@K are reported for each dataset respectively in Table 3, Table 4, Table 5 and Table 6.
We first compare our architecture with recent sample generation approaches from the literature using the now standard GoogleNet Backbone to ensure all results are fairly comparable. As we can see, MIRAGE obtains very strong results on all datasets. We achieve best performances on Cub-200-2011 and Cars-196, and second best on Stanford Online Products. This shows the importance of combining in class sample generation, like in [Lin_2018_ECCV] with hard sample generation like [Zheng_2019_CVPR], which MIRAGE achieves with a simple architecture.
In order to compare MIRAGE with recent methods, we also report Recall@K using BN-Inception [Ioffe_2015_ICML] with the same hyper-parameters as the ones used for GoogleNet. MIRAGE obtains strong performances when compared to very recent state-of-the-art methods. On Cub-200-2011, we obtain second best performances, being only 0.4% behind HORDE [Jacob_2019_ICCV]. On Cars-196 and Stanford Online Products, we obtain performances comparable to that of Multi-Similarity loss [Wang_2019_CVPR] and SoftTriplet [Qian_2019_ICCV]. On In-Shop Clothes Retrieval, we obtain results comparable to MS loss [Wang_2019_CVPR] and better than recently proposed D&C [Sanakoyeu_2019_CVPR] and MIC [Roth_2019_ICCV] that use the stronger ResNet50 backbone network. We want to emphasize that the reported results were obtained using the constrastive loss function, and yet bring improvements to the baseline comparable to that of using a much more advance loss function such as [Wang_2019_CVPR], [Qian_2019_ICCV] or [Jacob_2019_ICCV]. We believe this demonstrates the soundness of our approach.
In this paper, we introduce MIRAGE, a generation-based strategy that naturally solves the generation of hard examples. MIRAGE naturally solves the problem of generating incorrectly labeled hard negative examples by relying on a set of virtual class prototypes solely composed of generated examples. Even when the generator produces examples beyond their class manifolds, the presence of virtual classes ensures that the examples are still generated with the correct labels regarding the training classes. We empirically show that MIRAGE outperforms the state-of-the-art mining strategies and leads to competitive results when compared to complementary approaches. This is validated on four deep metric learning datasets named Cub-200-2011, Cars-196, Stanford Online Products and In-Shop Clothes Retrieval.