Learning representative image embeddings has been the core problem for many computer vision applications like image search and retrieval, recommendations, image near-duplicate detection, face or logo recognition to name a few. Following the advances of deep learning, arguably the most popular technique to learn image representations is deep metric learning (DML). The idea is to learn a mapping function that projects images to a lower dimensional latent space where semantically similar instances are clustered together. Instead of learning discriminative characteristics of classes in the training data, DML models learn a function that enables a measure of similarity via distances (e.g. Euclidean or cosine) allowing them to generalize to unseen classes as well.
There have been several proposals on how to learn such embedding functions. The classic ones are based on losses that are computed on pairwise distances. For example, the contrastive loss contrastiveloss uses matching and non-matching image pairs, while the triplet-loss tripletloss operates on a tuple of two instances from the same class (anchor and positive ) and a third one from a different class (negative ), to exploit pairwise distance relationships. The fundamental problem of these approaches is that, ideally, one would have to sample every possible pair or triplet during training, which is computationally intractable. Instead, several sampling strategies were proposed for triplet-based losses schroff2015facenet; Harwood2017SmartMF; wu2017sampling; wang2019multi; wang2020xbm to ensure that only informative tuples are being used in the training batch. In fact, if the sample is too easy, meaning where defines the distance between sample and , the loss converges to 0, slowing down training. Too hard samples on the other hand, where , could destabilize training and cause the embeddings to collapse into a single point. Though such sampling strategies help in training DML models, they often require complex implementations that are hard to parallelize or to compute ahead of each batch.
In 2017 Movshovitz-Attias et al. movshovitz2017no introduced the idea of proxies that are learnable class centers substituting the positive and the negative instances in the loss function during training. Such approach promises faster converges and requires no sampling. A very similar idea was also proposed by Snell et al. snell2017prototypical where the class centers were derived from a larger batch instead. More recently a combination of the softmax cross-entropy loss and the learnable proxies was shown to achieve even better results normproxies, with low implementation complexity, quick convergence, and easily parallelizable operation.
Even though several novel approaches on metric learning losses are published every year, there is a new line of research suggesting that, with proper evaluation protocol and hyper-parameter tuning in place, the actual differences between the performances of these losses are negligible Fehrvri2019UnbiasedEO; Roth2020RevisitingTS; Musgrave2020AML. We hypothesize that a potential way to improve metric learning is by utilizing extra information existing in other modalities during training. A promising approach was presented last year, when Zhao et al. Zhao2019AWS used product titles as weak supervision to establish adaptive margins in the triplet loss function. Although this method achieves better retrieval performance compared to the triplet loss baseline and comparable results to a state of the art DML method on a very large Amazon fashion dataset, it suffers from slow convergence and requires complex unparalellizable sampling code. It is also unclear whether this method can be applied to state-of-the-art DML losses and whether it would achieve state of the art results on a smaller public dataset.
In this paper, we show how the weakly-supervised adaptive margins can be applied to cross-entropy-based losses by proposing an adaptive additive angular margin term in the softmax function. Our method is easy to implement, requires no sampling, and achieves better performance on the Amazon fashion retrieval benchmark dataset and state-of-the-art results on the public DeepFashion dataset, even with lower embedding dimensions.
The main contributions of this paper are the following:
We empirically demonstrate that an additive angular margin in the softmax loss is effective for proxy-based deep metric learning;
We show the advantages of using non-constant margins on the negative classes;
We demonstrate that the negative margins can be derived from representations of another modality, similarly as in Zhao2019AWS, and this improves retrieval performance;
We evaluate the proposed approach on the Amazon fashion retrieval dataset as well as on the public DeepFashion dataset and set new state-of-the-art results on both.
2 Adaptive Additive Softmax
Deep metric learning aims to learn an embedding space where the similarity between the input samples are preserved as distances in the latent space. For example, metric learning losses such as Contrastive Loss contrastiveloss, or Triplet-Loss tripletloss are designed to minimize the intra-class distances and maximize the inter-class distances. More recent approaches however consider the relationship of all the samples in the training batch to maximize efficiency wang2019multi. In fact, the key problem in DML is how to sample informative training samples that yield near-optimal convergence. Semi-hard mining proposed by Schroff et al. schroff2015facenet has been widely adopted for many tasks due to its ability to mine samples online. More recent approaches propose sampling by weighting distances wu2017sampling or dynamically building class hierarchies at training time hierarchical_triplet.
Softmax-based losses have been largely applied in face verification tasks, achieving state of the art results Wang2018CosFaceLM; wen_2016; Liu2017SphereFaceDH. The advantage of these losses is the decreased emphasis on the sampling technique at the cost of additional hyper-parameters and potentially worse generalizations. A theoretical connection between classification and deep metric learning has been investigated in movshovitz2017no, where the authors propose to use learnable class centers during training, which they call proxies. Let us consider the temperature-scaled normalized softmax loss function for proxy-based metric learning originally defined in normproxies:
where is an L2-normalized embedding corresponding to the output of the last linear layer of the model. y is the class label of of all possible classes , and is its respective proxy embedding. The temperature parameter
is used to scale the logits to emphasize the difference between classes, thus boosting the gradientsLiu2017SphereFaceDH; Wang2018CosFaceLM.
Similarly to works in the face classification domain Wang2018CosFaceLM; wang2018additive, an additive large margin term can be introduced on the positive pairs in the proxy-based softmax loss to improve class separation in the latent space, leading to the Large-Margin Cosine Loss (LMCL).
where indicates the set of all classes except y. As presented later in our experiments, a constant margin already improves the retrieval results.
However, we can also introduce additive margins on the negative pairs based on semantic class similarity in another modality, as demonstrated for the triplet-loss by Zhao et al. Zhao2019AWS. Our proposed loss with the adaptive additive large margin on the proxy-based classification loss is the following:
where is the Euclidean or cosine distance of the class representations of and in the other modality normalized between 0 and 1.
There are multiple ways of obtaining such class representations, for example natural language descriptions or sets of attributes of each class can be used. For the former, one can apply state-of-the-art pre-trained models like BERT Devlin2019BERTPO
. In fact, in this work we use a model that consists of BERT further trained on Amazon’s textual datasets such as product titles, description, bullet-points, and product reviews, we call this model AmaBERT. For the attributes, an average vector of all word embeddings of the attributes could be used. When no such data exists, a pre-trained image captioning model could be used as a form of knowledge distillation.
The extensions we propose introduce negligible extra computational and memory costs and, with the class distances pre-computed, one can enable much faster training when compared to sampling-based approaches. In subsection 3.3, we present and discuss concrete computational results and overview performance differences of different approaches on two benchmark datasets.
We evaluate the impact of our proposed extensions by comparing multiple variants trained and benchmarked against the same datasets liuLQWTcvpr16DeepFashion; Zhao2019AWS. Our experiments cover classification-loss-based variants with different temperatures, no-/constant-/adaptive-margin normproxies; Wang2018CosFaceLM; wang2018additive, and the adaptive-margin-based variant that introduces new modalities Zhao2019AWS. Moreover, we evaluate different feature extractors for both the image embedding backbone (RestNet50, EfficientNet-B0 and -B1) and the embeddings of textual modality (fastText bojanowski2016enriching, and the BERT-based AmaBERT). Models trained with softmax loss have been observed to produce sparse embeddings, hence with the introduction of a parameterless layer normalization one can easily threshold the embedding vectors without much loss on accuracy. In our experiments we also investigate how much the performance for each of these models suffer when such thresholding is applied.
We build our experimental evaluation on two datasets built from different sources and use to benchmark previous work Zhao2019AWS
. Both of these datasets were set up to assess image retrieval performance on product imagery from real world retailers, however they differ in size and content specifics which we summarize below.
Amazon Fashion Retrieval dataset.
The Amazon fashion retrieval benchmark dataset is consisting of 82,465 images sourced from 22,200 fashion products in the Amazon catalog. This dataset was built with the help of three fashion specialists to remove irrelevant imagery from a much larger initial collection consisting of 164K products of 84 different types totalling over 1.4 million images. Details of the steps taken in cleaning the original collection can be found in Zhao2019AWS. A key characteristic of this dataset is that it contains associated textual information in addition to the imagery, sourced from the Product Detail Page. We use the product title as the additional (textual) modality augmenting each product’s images through text embeddings computed using pre-trained fastText bojanowski2016enriching and AmaBERT.
DeepFashion’s In-Shop Clothes Retrieval dataset.
In addition to the larger Amazon dataset, we use the publicly-available DeepFashion benchmark liuLQWTcvpr16DeepFashion (sourced from Forever 21’s product catalog) to evaluate the performance of our models in cross-domain retrieval. Specifically, we leverage its In-Shop Clothes Retrieval version for its inclusion of textual attributes describing product characteristics in addition to imagery spanning 11,735 products and totaling 54,642 images. Mirroring the methodology used in Zhao2019AWS, our evaluation follows the approach detailed in liuLQWTcvpr16DeepFashion
and Top-K accuracy is reported as the main evaluation metric.
It is worth noting that, in both of these datasets, classes are associated with the products themselves, thus comprising of tens of thousands of classes in stark contrast to the popular CUBS WahCUB_200_2011 and Cars cars196 benchmarks with 100 and 98 classes in their test sets, respectively.
All experiments were conducted with a single Tesla V100 GPU in the PyTorch deep learning framework version 1.4. On the Amazon Fashion Retrieval Dataset we used SGD with batch size of 75, momentum of 0.9, and learning rate of 0.01 with exponential decay for 500,000 iterations. We used linear learning rate warmup for the first 3000 iterations. For feature extractors we used ImageNet pretrained ResNet50 and EfficientNet-B0 models with embedding size of 2048 and 1280 respectively. For all models, we used the same image input size of. The fully-connected embedding layers with non-parametric layer normalization were initialized randomly and were added right after the last pooling layer in each architecture. The output of the model was L2 normalized. Except where explicitly noted, we used a temperature scaling parameter of in the softmax-crossentropy loss. We set for all experiments which included margin that we derived via hyper-parameter tuning. During training we used class-balanced sampling with images per class. When training on the In-Shop dataset we used the same setup as in normproxies.
3.3 Results and Discussion
Amazon Fashion Retrieval dataset.
For our experiments on the Amazon fashion dataset, we set the triplet-based loss with adaptive margin and with embedding size of 128 as our baseline, as proposed in the respective paper. However, while reproducing the results presented in Zhao2019AWS
using the original codebase we obtained slightly better results which we are using here instead. We found that the original normalized softmax loss with a 2048-dimensional embedding easily outperforms this, lifting the Recall@1 from 88.46% to 91.61%. This advantage is retained even if we binarize these high-dimensional embeddings via thresholding at 0 into vectors that have the same memory footprint as 64-dimensional float embeddings. This further supports the superiority of the softmax-based losses over the traditional triplet loss. In fact, switching to an EfficientNet-B0 backbone, a state-of-the-art feature extractor with one fifth of trainable parameters compared to Resnet50, we achieve superior results even with its smaller embedding size of 1280. Furthermore, the lack of online sampling reduced our training times from 8 days down to 33 hours on the same machine, nearly a 6-fold improvement in computational performance with anincrease in the target quality metrics.
Effect of margin.
Our experiments with a constant margin introduced on the positive pair (LMCL) already improved the results compared to the vanilla normalized softmax function. In order to ensure that the witnessed results are attributed to the margin, we ran several experiments with various temperature scales and margin values, but found no improvements. This suggests that DML setups can benefit from large margins combined with temperature scaling similar to classification problems.
We tried two text-embedding models on the product titles to test the adaptive-margin method:
average of fastText word embeddings;
sentence embedding with pre-trained AmaBERT.
For both cases the performance was better compared to any previous experiment, with the fastText embeddings reaching 91.64% and the AmaBERT 92.11% Recall@1. Although these results are just marginally higher than the best LMCL setting, both outperform the baseline by a considerable margin. Again, this advantage remains present even when the embeddings are further binarized. For more details please refer to Table 1.
In order to test the generalization capability of our models across different datasets, we also performed evaluations on the DeepFashion retrieval set. Even though the adaptive margin with AmaBERT embeddings do perform slightly better than the baseline (from 77% to 78.08% Recall@1), all other configurations perform significantly worse. This suggests that the softmax-based approaches somewhat overfit on the training domain, which makes them perform worse on other datasets. Detailed results are presented in Table 2.
We trained the best performing models on the DeepFashion dataset to be able to gauge the performance of the proposed approach against the state of the art. This dataset however does not contain titles for the classes. Thus, with the fastText model, we computed the average of the word embeddings of all attributes per class. With AmaBERT we embedded the first bullet-point description. Our results show state of the art performance for both models compared to other recent DML approaches, even after binarizing the resulting embeddings. The difference between the two text embedders is however very small (91.79% vs. 91.9% Recall@1). We summarized the results on this dataset in Table 3.
We have shown how it is possible to introduce adaptive additive margins to a classification-based loss popular in deep metric learning. We leverage that to take advantage of additional available data in different modalities and show how to incorporate text from product titles and attributes during training using different sentence embedding methods like fastText and BERT. Moreover, we has demonstrated that this adaptive extension to the classification-loss is compatible with the use of proxies and that it not only inherits the computational and simplicity advantages of this combination but pushes it further, in that it allows us to set a new state-of-the-art for DML-based image retrieval in both the public DeepFashion In-Shop Clothes Retrieval benchmark and a larger Amazon-internal Fashion dataset. Our results are consistent across different image-feature extraction backbones and text embedding models, and still show improvements when large-dimensional feature vectors are binarized (allowing sparse and compact feature vectors for indexing).
The authors would like to thank Xiaonan Zhao for sharing the code, hyperparameters, and the Amazon dataset used for the paper that provided the basis of this work. We also would like to thank Sergey Sokolov for sharing his AmaBERT code and model which enabled us to experiment with state-of-the-art domain-specific text representations.