Deep Metric Learning (DML) is an important yet challenging topic in the Computer Vision community, that has a broad-spectrum in terms of applications such as: person or vehicle identification[Zhou_2017_ICCV], visual product search [Liu_2016_CVPR_INSHOP, Song_2016_CVPR] or multi-modal retrieval [Wehrmann_2018_CVPR, Carvalho_2018_SIGIR]. By learning the image representations and an embedding space together, DML methods produce compact representations where visually-related images (e.g., images of the same car model) are close to each other and dissimilar images (e.g., images of two cars from the same brand but from different models) are distant.
Recent contributions mainly address the training part of deep metric learning, proposing loss functions (e.g. Angular loss [Wang_2017_ICCV]), sampling strategies (e.g., DAMLRMM [Xu_2019_CVPR]) and ensemble methods (e.g., BIER [Opitz_2017_ICCV]). All of these methods are built upon a backbone network such as GoogleNet [Szegedy_2015_CVPR] to extract the local features from which the image representations and the corresponding metric are computed. Nowadays, global average pooling is the most used pooling strategy to compute the image representations. This is due to interesting properties such as full back-propagation of the gradient or localization ability without direct supervision [Zhou_2016_CVPR]
. In almost every method, these representations are computed using the mean of the deep features and used as they are.
However, the average is also known to be a non-robust representation because it is very sensitive to outliers and to sampling problems[Jacob_2019_ICCV]. A famous solution is to strengthen this representation in order to compute the average on only visually-related features, using a set of attention maps such as ABE [Kim_2018_ECCV]. ABE is based on what we next call dimension-wise selection with pre-attention - this method shows very good results with few additional parameters, through both pooling of visually-related features and feature denoising. Also, ABE is trained with a divergence loss which ensures that the attention maps are complementary by enforcing two attention maps to be dissimilar, even for visually-related images. Due to this optimization criterion and the trade-off parameter, both the training procedure and the parameterization become more complex. NetVLAD [Arandjelovic_2016_CVPR] is based on what we next call feature-wise selection with post-attention - this method aggregates visually-similar features using a structural constraint based on a dictionary strategy. However, the feature-wise selection does not contribute to feature denoising.
In this paper, we introduce the method DIABLO, a DIctionary-based Attention BLOck, that produces robust image representations by taking advantage of both ABE and NetVLAD benefits. We evaluate attention strategies named pre-attention and post-attention in the DML setup, together with two selection strategies named feature-wise and dimension-wise selection. We show in practice that DIABLO consistently improves the state-of-the-art on four DML datasets (Cub-200-2011, Cars-196, Stanford Online Products and In-Shop Clothes Retrieval).
The remaining of the paper is organized as follows: in the next section, we present the related work on deep metric learning and support our proposition. In section 3, we give an overview of the proposed architecture, then we detail the attention strategies and the feature-wise and dimension-wise selection methods. In section 4, we show ablation studies on the selection and attention strategies, the dictionary size and the comparison to ABE. Finally, in section 5
we compare our approach to the state-of-the-art methods on four image retrieval datasets (Cub-200-2011, Cars-196, Stanford Online Products and In-Shop Clothes Retrieval).
are two non-linear transformation of the deep features. The blockscorrespond to either Equation 7 or Equation 11. The blocks are illustrated in Figure 2.
2 Related work
In DML, we learn the image representations and an embedding together in such a way that the Euclidean distance corresponds with the semantic content of the images. The standard strategy is to extract deep features using a backbone network such as GoogleNet [Szegedy_2015_CVPR] and to learn the target representation with a linear projection on the average of these deep features. The whole network is fine-tuned to solve the metric learning task according to three criteria: a similarity-based loss function, a sampling strategy, and an ensemble method.
Standard loss functions rely on pairs or triplets of similar/dissimilar samples. Most recent ones extend these formulations by considering larger tuples [Ustinova_2016_NIPS] or by improving the design [Wang_2017_ICCV]. The batch construction can either be done by random sampling, mining strategies [Xu_2019_CVPR], proxy-based approximations [Movshovitz-Attias_2017_ICCV] or generation [Lin_2018_ECCV]. Finally, ensemble methods have recently become a popular way of improving the performances of DML architectures [Opitz_2017_ICCV, Opitz_toap_PAMI].
The latest DML approaches consider using a codebook strategy [Arandjelovic_2016_CVPR] or even attention maps [Kim_2018_ECCV]. [Kim_2018_ECCV] propose ABE, an attention-based ensemble method to enforce diversity within the embedding space. With this aim, they train a set of attention blocks to select several feature map entries per block. These blocks are then trained with a divergence loss function, which ensures that all attention maps are complementary. Their proposed design relies on what we call a dimension-wise selection with pre-attention. they select a set of dimensions from the feature maps using a given attention map. Then, these selected dimensions are further processed with Inception blocks. We call this selection schema a pre-attention strategy - and we argue that one of its properties results in denoising the deep features. Thus, this improves the image representations by post-processing features that contain only relevant information. However, there are two major drawbacks in using a divergence loss. On the one hand, we need to cross-validate the additional trade-off hyper-parameter. On the other hand, the training time is increased because the divergence loss plays the opposite role of the similarity-based loss function.
[Arandjelovic_2016_CVPR] propose NetVLAD which takes advantage of a dictionary strategy that avoids the divergence loss. In their case, the optimization constraint is replaced by a structural one, that simplifies the training procedure by directly optimizing the task-dependent loss function. However, NetVLAD’s design relies on a strategy that we call feature-wise selection with post-attention. Feature-wise selection ensures that only visually-related features are pooled together. This selection strategy does not have the denoising property of the dimension-wise selection. Also, in the case of NetVLAD, the pre-processing is a centering and an intra-projection, that can be improved using non-linear transformation learned by a CNN.
In DIABLO, we show the benefits of dimension-wise selection versus feature-wise selection with pre-attention or post-attention strategies. With this aim, we propose to replace the divergence loss in ABE by a structural constraint, which is based on a dictionary strategy. This dictionary is trained without direct supervision as it is the case with NetVLAD. We show that it leads to better results than ABE on four DML datasets.
3 Proposed method
In this section, we start by giving an overview of DIABLO. Then, we describe the attention maps computation and the two selection strategies named feature-wise and dimension-wise selection.
3.1 Method overview
We give an overview of the method illustrated in Figure 1. We start by extracting a deep feature map using the local feature extractor F where and are the height and width of the feature map and is the deep feature dimension. We further process the extracted feature map using a non-linear function
implemented by a convolutional neural network (CNN). In order to compute theattention maps where is the dictionary size, we pass the feature map into the selection block S, that is either the feature-wise selection (subsection 3.3) or the dimension-wise selection (subsection 3.4).
In the post-attention setup (Figure a), the feature map is further processed with a non-linear function (implemented by a CNN) and transformed into a feature map . It is then combined with the attention maps using the block M to produce the new feature maps. In the pre-attention setup (Figure b), the feature map is directly combined with the attention maps using the block M. In this case, we further process these feature maps using a non-linear function . The different blocks M used to combine the attention maps and the original feature map are illustrated in Figure 2.
These feature maps are pooled using a global average pooling with adding an embedding layer for each branch. So, the output representation is obtained by concatenating these branches.
3.2 Attention strategies
This section focuses on computing a set of attention maps using one of the selection blocks S and its corresponding dictionary as well as the feature map . For this purpose, we process the feature map with a non-linear function implemented by a CNN. Then, the set of attention maps are computed by a selection block S such that . is used with one of the following attention strategies.
In the post-attention strategy illustrated in Figure a, we process the feature map into an intermediate feature map :
Then, we combine this feature map with the attention maps using the merging block M:
with representing the set of feature maps obtained by the post-attention strategy. From these, we perform a spatial pooling of the local features and we add an embedding layer to generate the corresponding image representation.
The main idea of post-attention consists of aggregating only the related features, unlike global pooling that aggregates unrelated features. For example, features that describe the background are aggregated using a different attention block than those which represent the desired object. One can note that this approach is strongly related to NetVLAD for which the function is a centering. However, it differs from NetVLAD in two ways: First, because it learns a non-linear clustering of the local features using the function ; second because it learns a non-linear pre-processing of the deep features using the function .
In the pre-attention strategy illustrated in Figure b, we perform the operation in reverse order. In order to do so, we combine the feature map with the set of attention maps using the merging block M to produce the set of (one per attention block) feature maps :
Then, we process these feature maps using a non-linear function to generate the feature maps :
From these feature maps, we pool the local features and we add an embedding layer to generate the corresponding image representation.
The pre-attention strategy is similar to a refinement approach. Indeed, the attention selects dimensions or features, and a refinement function improves these extracted features before that they are aggregated. Hence, the attention maps role is to select only the relevant information for a given attention block. The function is trained to refine this information so that it improves generalization.
3.3 Feature-wise selection
In this section, we give details about the computation of the selection block S and the merging block M for the feature-wise selection illustrated in Figure a. We start from a feature map from which we compute a set of attention maps using a dictionary such that . Before computing the feature-wise selection, the feature map is processed by a non-linear function implemented by a CNN. We denote a feature from at spatial location and its transformation in that is in the same spatial location.
3.3.1 Selection block
In the feature-wise selection, the objective is to assign each feature from the feature map to one of the dictionary entries
. To do so, we consider the cosine similarity between the transformed featureand the dictionary entry:
Using the cosine similarity has a main advantage when compared to the Euclidean distance: during the training, it is more stable by means of the -normalization.
From this similarity measure, we compute , the weight for the -th feature-wise attention map and -th dimension of the feature in spatial location :
However, this one-hot encoder is not differentiable due to theoperator. We relax the constraint using the soft-max operator to train in an end-to-end way the deep network:
such that, is a hyper-parameter to control the hardness of the assignment. For this formulation, we rely on a given feature that is assigned to a dictionary entry according to the similarity between and . We show in section 5 that the feature-wise selection increases the performance of the attention-based models when compared to the baseline model (without attention map). Then, we detail the block M to merge the feature-wise selection based attention map with the raw features.
3.3.2 Merging block
The combination of the attention maps and the feature map is illustrated in Figure a. For the -th dimension of the -th feature map in spatial location , the corresponding entry in , is computed using the following equation:
where is composed by the -th assignment weights and is the feature for all spatial locations .
3.4 Dimension-wise selection
In this section, we extend the feature-wise selection from the subsection 3.3 to the dimension-wise selection. Similarly, we give details about the attention maps computation through the selection block S before that we explain the merging strategy with the block M.
3.4.1 Selection block
The selection block is composed of a set of directions per dictionary entry of size , on the contrary to the feature-wise selection that has a dictionary of size . Thus, for a given feature in spatial location , the cosine similarity is computed between the transformed feature and the -th direction of the -th dictionary entry :
Then, one entry from the attention map is computed using the following equation:
The attention map has an attention weight for each dimension of the input feature and for each of the attention blocks. section 5 highlights that the dimension-wise selection produces stronger image representations than the feature-wise selection, showing much higher performances.
3.4.2 Merging block
The merging block M is used in the case of dimension-wise selection as illustrated in Figure b. The entry in is computed for the -th dimension of the -th feature map in the spatial location using the following equation:
Note that depends on the value of . The computation can easily be re-written using the Kronecker product and the Hadamard product () as illustrated in Figure b. Using the Kronecker product, we duplicate times the local feature Then, is computed using the element-wise product between and the duplicated feature map:
3.5 Implementation details
For a fair comparison with other methods, we use a pre-trained GoogleNet [Szegedy_2015_CVPR]
on ImageNet. The embedding size is fixed to 512 for all models. As it is done in common practice, we set the triplet margin, the contrastive and the binomial margin and the negative sample weight in the binomial deviance. We follow the same training procedure as state-of-the-art methods [Opitz_2017_ICCV, Kim_2018_ECCV]: training hyper-parameters (learning rate, batch size, etc.
) are empirically chosen based on the training loss after a few epochs. Model hyper-parameters (number of layers,etc.) are set to comparable values as the ones of models reported in ABE [Kim_2018_ECCV]. We use the following data augmentation on the images: multi-resolution where the size is uniformly sampled in times the crop size as well as random crop and horizontal flip during the training. During testing, we re-scale the images to . We use Adam [Diederik_2015_ICLR] with the default parameters and a learning rate of . For all models, the function is shared across the attention maps to reduce the large number of parameters induced. In practice, it still leads to strong experimental results and avoids over-fitting on small datasets, for instance: Cub-200-2011 and Cars-196.
4 Ablation studies
In this section, we compare our approach to the baseline in terms of model complexity and computation. Then, we present the evaluation protocol of our ablation studies. Ablation studies are performed to evaluate the benefits of pre and post-attention, the assignment approaches (feature and dimension-wise), the dictionary strategy and the dictionary size.
|Angular loss [Wang_2017_ICCV]||54.7||66.3||76.0||83.9||-||-||71.4||81.4||87.5||92.1||-||-|
|Contrastive + DIABLO||62.3||73.6||82.6||89.2||94.0||96.9||84.8||90.5||94.3||96.6||98.1||98.9|
|Triplet + DIABLO||59.6||70.6||80.3||87.7||92.7||96.2||75.0||83.4||89.4||93.5||96.4||98.1|
|Binomial + DIABLO||62.8||73.9||82.4||89.3||94.0||96.7||85.0||90.8||94.0||96.4||98.0||98.9|
|Binomial + DIABLO||63.9||74.3||82.4||88.8||94.0||96.8||85.4||91.3||95.0||97.2||98.5||99.1|
|Stanford Online Products||In-Shop Clothes Retrieval|
|Angular loss [Wang_2017_ICCV]||70.9||85.0||93.5||98.0||-||-||-||-||-||-|
|Contrastive + DIABLO||77.8||89.5||95.3||98.4||91.3||98.1||98.7||99.0||99.1||99.1|
|Triplet + DIABLO||73.5||87.8||95.0||98.5||87.4||97.2||98.1||98.6||98.8||98.9|
4.1 Model complexity and computation cost
In this section, we give the architecture choices to compute the functions and for the post-attention and for the pre-attention architectures. Then, we analyze the induced computations and the complexity introduced by the dictionary approach.
In post-attention, the GoogleNet backbone extracts the feature map including and up to the max-pooling between the fourth and the fifth scales. To compute, we add the two inception blocks named ’5a’ and ’5b’ upon these features. The function is also composed by two independent inception blocks ’5a’ and ’5b’. The attention maps and the weighted features are computed using one of the two proposed strategies. Then, each branch is pooled using a global average pooling as well as an embedding layer of size and a -norm are added. The output representation is the concatenation of these branches which leads to a 512d representation.
In pre-attention, we use the GoogleNet features including and up to the max-pooling between the third and the fourth scales. The non-linear function is composed of the five inception blocks from the fourth scale of GoogleNet pre-trained on ImageNet. The refinement function is composed by the fourth and fifth scales of GoogleNet, they are shared for each map but they are independent from . The attention maps and the weighted feature maps are computed using either dimension-wise or feature-wise selection. As it the case with pre-attention, each branch is pooled using a global average pooling, an embedding layer of size and a -norm are added before the concatenation of all branches in order to produce the full 512d representation.
These choices directly follow ABE [Kim_2018_ECCV] and we refer the reader to the related paper for more ablations on the architecture, including the multi-head approach, the attention module and the sharing of parameters across the attention modules.
All additional parameters are included in the computation of the function . By using the five Inception blocks from the fourth scale of GoogleNet, this leads to 3.5M additional parameters. Note that these parameters are shared across the dictionary entries and this drastically reduces the number of parameters. Also, note that the function is already included in the number of parameters of GoogleNet.
In terms of computation, the most important additional computations come from the function
which is computed on each attention map. The computation of the fourth and fifth Inception scales are estimated to 0.7 Gflop (see[Kim_2018_ECCV], Table 1). In comparison, the whole GoogleNet requires 1.6 Gflop to produce the image embedding. Note that in the case of the pre-attention, all parameters of are shared across the attention maps, which leads to fewer additional parameters. Overall, these choices lead to higher experimental results for both ABE and DIABLO compared to the baseline.
4.2 Model selection protocol
In this section, we detail the evaluation protocol for all ablation studies on the Cub-200-2011 dataset that are performed to select the best model. We perform 10 random train-val splits on the training set of Cub-200-2011 for deep metric learning: We randomly choose 50 classes for the training set and we keep the rest of the classes for the validation set. Then, we train each model on the training set and we select the model that gives the best performances on the validation set. We then compute Recall@K on the testing set of Cub-200-2011 for each train-val split. All reported results in Table 4 and Table 3
are the average and the standard-deviation of Recall@K on the testing set for the ten runs.
4.3 Feature selection and attention strategy
First, we evaluate the benefit of the pre and post-attention strategies with respect to the assignment strategy. To that end, we fix the dictionary size to 8 and we use the training procedure from subsection 3.5. Results are reported in Table 3 for the dataset Cub-200-2011 with binomial loss. We remark that all strategies with the exception of the feature-wise selection with pre-attention improve over the baseline and this confirms the benefit of attention maps. This experiment also shows that feature-wise attention and dimension-wise attention impact differently the model.
In Feature-wise attention, the post-attention leads to stronger representations with on Recall@1 over the pre-attention. We argue that selecting features make better sense with post-attention than pre-attention: only related features are aggregated together with post-attention whereas in pre-attention, the refinement part mostly processes sparse feature maps. In the case of dimension-wise attention, both approach provide stronger results with and
over the best feature-wise strategy, even though it is still better to use the pre-attention approach. We argue that pre-attention is better with dimension-wise selection because the refinement part processes denoised features. Indeed, certain dimensions may be useless for a given dictionary entry and the dimension-wise approach can select only the relevant dimensions. Then, the refinement part is trained with sparse vectors which contain only the relevant information. Moreover, feature-wise selection with post-attention leads to aggregate sparse vector together, which might bring to sub-optimal results because some dimensions are rarely used.
|R@1||52.9 0.2||52.3 0.4||54.0 0.4||58.3 0.2||57.8 0.1|
|Feat. + Post-att.|
|R@1||53.4 0.3||53.3 0.3||54.0 0.4||48.0 0.4|
|Dim. + Pre-att.|
|R@1||57.1 0.3||57.4 0.3||58.3 0.2||58.9 0.3|
4.4 Comparison to ABE
In a second time, we want to evaluate the impact of the structural constraints imposed by the dictionary (Equation 7 and Equation 11) by comparing our method to ABE [Kim_2018_ECCV]. In ABE, the authors show that a M-head approach already provides strong results on the Cars-196 dataset with 76.1% (+8.9%) Recall@1 for . However, this architecture tends to overfit due to the large number of parameters. Then, they propose an enhanced version named ABE that takes advantage of attention maps. The divergence loss increases the performance from 69.7% without the divergence loss to 85.2% (+15.5%) in Recall@1 (see Table 2 in [Kim_2018_ECCV]).
Among the drawback of the divergence loss, despite of the additional hyper-parameter, figure that optimizing the loss generates gradients which are in opposition with the metric learning loss ones. Indeed, it is designed to reduce the similarity between different branches even when the images are similar. We solve this issue in DIABLO where this optimization constraint is replaced by a structural constraint (softmax in Equation 7 and Equation 11). The orthogonality is ensured by the design of DIABLO, which allows to simply remove the divergence loss at the price of a reduced expressiveness: feature map entries can be chosen independently in ABE whereas in DIABLO they are constrained to only one dictionary entry. In Table 1 and Table 2, DIABLO performs well compared to ABE with similar results on the Cars-196 dataset (-0.2% compared to ABE) and higher ones on other datasets such as Cub-200-2011 (+2.2%), Stanford Online Products (+1.5%) and Inshop Clothes Retrieval (+3.4%) for this set of parameters.
4.5 Dictionary size
In this section, we evaluate the impact of the dictionary size on DIABLO. We evaluate both the feature-wise and the dimension-wise with dictionary sizes in . Recall@1 on the Cub-200-2011 dataset are reported in Table 4.
In post-attention with feature-wise selection, extreme combinations (e.g., with 16 branches with 32 dimensions) lead to results under the baseline. Thus, to increase the performances of such attention strategy, there is a compromise between the representation size of each branch and the number of branches (+ over the baseline). In pre-attention with dimension-wise selection, all parameter combinations for this approach lead to better results than the baseline (+ to ). The number of branches increases the performance with the log of the dictionary size on the contrary to the previous strategy.
5 Comparison to the state-of-the-art
In this section, we compare DIABLO to the state-of-the-art on 4 DML datasets, named Cub-200-2011 [CUB_200_2011], Cars-196 [CARS_196], Stanford Online Products [Song_2016_CVPR] and In-Shop Clothes Retrieval [Liu_2016_CVPR_INSHOP]. For Cub-200-2011, Cars-196 and Stanford Online Products, we follow the standard splits from [Song_2016_CVPR] and for In-Shop Clothes Retrieval we follow the one from [Liu_2016_CVPR_INSHOP]. Especially, the Cub-200-2011 training set is composed of the first 100 classes (i.e., the training and the validation sets from the ablation studies) for a total of 5864 images and its testing set is composed of the last 100 classes for a total of 5924 images. The Cars-196 training set is composed of the first 98 classes for a total of 8054 images and its testing set is composed of the last 98 classes for a total of 8131 images. The Stanford Online Products training set is composed of 11318 classes for a total of 59551 images and its testing set is composed of 11316 classes for a total of 60502 images. Finally, the In-Shop Clothes Retrieval training set is composed of 3997 classes for total of 25882 images and its testing set is composed 3985 classes and it is split into query set of 17218 images and a collection set of 12612 images. We report the Recall@K which evaluates, for a given query, if there is at least one image with the same label in the top-K retrieved images. We use for Cub-200-2011 and Cars-196, for Stanford Online Products and for In-Shop Clothes Retrieval.
We report the results for Cub-200-2011 and Cars-196 in Table 1 and in Table 2 for Stanford Online Products and In-Shop Clothes Retrieval. DIABLO consistently improves the already strong baseline on the four datasets and for three different loss functions. E.g., using the binomial loss, the baseline is improved from 59.6% to 63.9% (+4.3%) on Cub-200-2011 and from 78.8% to 85.4% (+6.6%) on the Cars-196 dataset. The same observation is made for both the other loss functions on these datasets.
Moreover, DIABLO leads to better results when compared to the similar approach ABE. Their best reported approach, ABE-8, is outperformed by DIABLO with and (total dimension 512) by 2.2% in R@1 on Cub-200-2011, by 1.5% on Stanford Online Products and by 4% on In-Shop Clothes Retrieval. We also report results with and (total dimension 512) which are further improved on these datasets: 1.1% in Recall@1 on the Cub-200-2011 and by 0.4% on the Cars-196 dataset, leading to the state-of-the-art on the four deep metric learning datasets.
In this paper, we have presented a dictionary-based attention method named DIABLO which consistently improves DML models. An ablation study is undertaken to evaluate the benefits of the feature-wise and the dimension-wise selections for two attention strategies named pre-attention and post-attention. We show that DIABLO consistently outperforms the baseline for three different loss functions on four datasets (Cub-200-2011, Cars-196, Stanford Online Products, and Inshop Clothes Retrieval). Moreover, it outperforms the current state-of-the-art methods on the four datasets.
Authors would like to acknowledge the COMUE Paris Seine University, the Cergy-Pontoise University and M2M Factory for their financial and technical support.