We consider the problem of learning similarity predictors for metric learning and related applications. Given a query image of an object, our task is to retrieve, from a set of reference images, the object image that is most similar to the query image. This problem finds applications in a variety of tasks, including image retrieval , person re-identification (re-id) [9, 47], and even low-shot learning [28, 34, 4]. There has been substantial recent progress in learning distance functions for these similarity learning applications, with [35, 21, 10, 42] providing a representative overview.
Existing deep similarity predictors are trained in a distance learning fashion where the big-picture goal is to embed features of same-class data points close to each other in the learned embedding, while also pushing features of data from other classes further away. Consequently, most techniques distill this problem into optimizing a ranking objective that respects the relative ordinality of pairs , triplets [1, 36], or even quadruplets  of training examples. These methods are characterized by the specificity of how the similarity model is trained, e.g., data (pairs, triplets etc.) sampling [7, 44], sample weighting , and adaptive ranking , among others. However, a key limitation of these approaches is their lack of decision reasoning, i.e., explanations for why the model predicts the input set of images is similar or dissimilar. As we demonstrate in this work, our method not only offers model explainability, but such decision reasoning can also be infused into the model training process, in turn helping bootstrap and improve the generalizability of the trained similarity model.
Recent developments in convolutional neural network (CNN) visualization[46, 19, 52, 27] have led to a surge of interest in visual explainability. Some methods [15, 41] enforce attention constraints using gradient-based attention , resulting in improved attention maps as well as downstream model performance. These techniques essentially ask the following question: where is the object in the image? By design, this limits their applicability to scenarios involving object categorization. On the other hand, in this paper, we ask: what makes two images look similar (or dissimilar)? (see Figure 1) Existing work can explain classification/categorization models, but their extension to similarity models is not trivial. While one recent paper  proposed to use a binary classification term to compute network attention for a pair of person images for re-id, it has several application-specific algorithmic design choices (e.g., upright/standing pose, horizontal part-pooling) that limits its applicability to only re-id (e.g., vs. generic image retrieval, semantic segmentation, etc.). For instance, in low-shot semantic segmentation, the problem is to segment unseen object regions given very few annotated examples as support. If we are able to generate visual explanations for similarity models, we can exploit this explainability to discover regions in the test image that look most visually similar to the support images. Once such correspondences are established, assigning semantic labels to pixels in the test image is trivial.
In this work, we tackle these limitations in a principled manner. We propose a generic method to address the similarity explanation problem, generating network attention directly from similarity predictions. Note that this is substantially different from existing work [15, 48, 41] where an additional classification module is needed to compute the network attention. Furthermore, we show that the resulting similarity attention can be modeled as the output of a differentiable operation, thereby enabling its use in model training as an explicit trainable constraint, which we empirically show improves model generalizability. A key feature of our proposed technique is its generality, evidenced by two characteristics we demonstrate. First, our design is not limited to a particular type of similarity learning architecture; we show applicability to and results with three different types of architectures: Siamese, triplet, and quadruplet. Next, we demonstrate the versatility of our framework (Figure 2 shows a summary) in addressing problems different from image retrieval (e.g., low-shot semantic segmentation) by exploiting its decision reasoning functionality to discover (application-specific) regions of interest.
To summarize, our key contributions include:
To the best of our knowledge, we present the first technique to generate visual explanations, by means of network attention, from generic similarity metrics, equipping similarity models with visual explanation capability.
We show how one can model such similarity attention as a differentiable operation, enabling its use in enforcing trainable similarity constraints.
We propose the similarity mining learning objective, enabling a new similarity-attention-driven learning mechanism for training similarity predictors that results in improved model generalizability.
We demonstrate the versatility of our proposed framework by means of a diverse set of experiments on a variety of tasks (e.g., image retrieval, person re-identification, and low-shot semantic segmentation) and similarity model architectures.
2 Related Work
Our work is related to both the metric learning and visual explainability literature, and we briefly review closely-related methods along these directions, helping differentiate our work and put it in proper context.
Learning Distance Metrics.
Metric learning approaches attempt to learn a discriminative feature space to minimize intra-class variations, while also maximizing the inter-class variance. Traditionally, this translated to optimizing learning objectives based on the Mahalanobis distance function or its variants[43, 11, 22, 18]. Much recent progress with CNNs has focused on developing novel objective functions or data sampling strategies. Wu et al.  demonstrated the importance of careful data sampling, developing a weighted data sampling technique that resulted in reduced bias, more stable training, and improved model performance. On the other hand, Harwood et al.  showed that a smart data sampling procedure that progressively adjusts the selection boundary in constructing more informative training triplets can improve the discriminability of the learned embedding. Substantial effort has also been expended in proposing new objective functions for learning the distance metric. Some recent examples include the multi-class N-pair , lifted structured embedding , and proxy-NCA  losses. The goal of these and related objective functions is essentially to explore ways to penalize training data samples (pairs, triplets, quadtruplets, or even distributions ) so as to learn a discriminative embedding. In this work, we take a different approach. Instead of just optimizing a distance objective (e.g., triplet), we also explicitly consider and model network attention during training. This leads to two key innovations over existing work. First, we equip our trained model with decision reasoning functionality. Second, by means of trainable network attention, we guide the network to discover local regions in images that contribute the most to the final decision, thereby improving model generalizability.
Learning Visual Explanations. Dramatic performance improvements of vision algorithms driven by black-box CNNs have led to a recent surge in attempts [46, 19, 53, 39, 27, 37, 3, 6] to explain and interpret model decisions. To date, most CNN visual explanation techniques fall into either response-based or gradient-based categories. Class Activation Map (CAM)  used an additional fully-connected unit on top of the original deep model to generate attention maps, thereby requiring architectural modification during inference and limiting its utility. Grad-CAM , a gradient-based approach, solved this problem by generating attention maps using class-specific gradients of predictions with respect to convolutional layers. Li et al. , Wang et al. , and Zheng et al.  took a step forward, using the attention maps to enforce trainable attention constraints, demonstrating improved model performance. These explanation techniques were very specific in their focus. Li et al.  and Wang et al.  focused on categorization, whereas Zheng et al.  focused on re-id, leading to very application-specific assumptions and pipeline design. In our work, we propose a generic algorithm that can generate similarity attention from, in principle, any similarity measure, and additionally, can enforce trainable constraints using the generated similarity attention. Our design leads to a powerful and flexible technique that we show results in generalizable models and performance improvements in areas ranging from metric learning to low-shot semantic segmentation.
3 Proposed Method
Given a set of labeled images each belonging to one of categories, where , , and , we seek to learn a distance metric to measure the similarity between two images and . Our key innovation includes the design of a flexible technique to produce similarity model explanations, by means of CNN attention, which we show can be used to enforce trainable constraints during model training. This leads to a model equipped with similarity explanation capability as well as improved model generalizability. In Section 3.1, we first briefly discuss the basics of existing similarity learning architectures followed by our proposed technique to learn similarity attention, and show how it can be easily integrated with existing networks. In Section 3.2, we discuss how the proposed similarity attention mechanism facilitates principled attentive training of similarity models with our new similarity mining learning objective.
3.1 Similarity attention
Traditional similarity predictors such as Siamese or triplet models are trained to respect the relative ordinality of distances between data points. For instance, given a training set of triplets , where have the same categorical label while belong to different classes, a triplet similarity predictor learns a dimensional feature embedding of the input , , such that the distance between and is larger than that between and (within a predefined margin ).
Starting from such a baseline predictor (we choose the triplet model for all discussion here, but later show variants with Siamese and quadruplet models as well), our key insight is that we can use the similarity scores from the predictor to generate visual explanations, in the form of attention maps as in GradCAM , for why the current input triplet indeed satisfies the triplet criterion with respect to the learned feature embedding . As a concrete example of our final result, see Figure 2, where we note our model is able to highlight common (cat) face region in the anchor (A) and the positive (B) image, whereas we highlight the corresponding face and ears region for the dog image (negative, C), clearly illustrating why this current triplet satisfies the triplet criterion. This is what we refer to by similarity attention: the ability of the similarity predictor to automatically discover local regions in the input that contribute the most to the final decision (in this case, satisfying the triplet condition) and visualize these regions with attention maps.
does. These techniques have a classification module that takes in feature vectors as input and produces classification probabilities. The gradients of the classification scores are then computed with respect to convolutional feature maps to obtain the attention maps. In our case, we are not limited by the requirement of having a classification module. Instead, as we discuss below, we compute a similarity score directly from the feature vectors (e.g., , , and ), which is then used to compute gradients and obtain the attention map.
Given a triplet sample , we first extract feature vectors , , and (denoted , , and respectively going forward). Ideally, a perfectly trained triplet similarity model must result in , , and satisfying the triplet criterion. Under this scenario, local differences between the images in the image space will roughly correspond to proportional differences in the feature space as well. Consequently, there must exist some dimensions in the feature space that contribute the most to this particular triplet satisfying the triplet criterion, and we seek to identify these elements in order to compute the attention maps. To this end, we compute the absolute difference between the pairs , ), and , ), and construct the weight vectors and as and .
With , we seek to highlight the feature dimensions that have a small absolute difference value (e.g., for those dimensions , will be closer to 1), whereas with we seek to highlight the feature dimensions with large absolute differences. Given and , we construct a single weight vector , where denotes the element-wise product operation. With , we will have obtained a higher weight with feature dimensions that have a high value in both and . In other words, we seek to focus on elements that contribute the most to (a) the positive feature pair being close, and (b) the negative feature pair being further away. This way, we identify dimensions in the feature space that contribute the most to the feature vectors , , and satisfying the triplet criterion. We now use these dimensions in the feature space to compute network attention for the current triplet of images .
Given the weight vector , we compute the dot product with the feature vectors , , and to get the sample scores , , and for each image in the triplet . We then compute the gradients of these sample scores with respect to the image’s convolutional feature maps to get the attention map. Specifically, given a score , the attention map is determined as:
where is the convolutional feature channel (from one of the intermediate layers) of the convolutional feature map and . The GAP operation is the same global average pooling operation described in GradCAM .
3.1.1 Extensions to other architectures
Our proposed technique to generate similarity attention is not limited to triplet CNNs and is extensible to other architectures as well. Here, we describe how to generate similarity attention using our proposed technique in conjunction with Siamese and quadruplet models.
For a Siamese similarity model, the inputs are pairs of data samples . Given their feature vectors and , we compute the weight vector in the same way as the triplet scenario. If and belong to the same class, . If they belong to different classes, . With , we compute the sample scores and , and use Equation 1 to compute attention maps and for and respectively.
For a quadruplet similarity model, the inputs are quadruplets of data samples , where is the positive sample and and are negative samples with respect to the anchor . Here, we compute the three difference feature vectors , , and . Following the intuition described in the triplet case, we get the difference weight vectors as for the positive pair and and for the two negative pairs. The overall weight vector is then computed as the element-wise product of the three individual weight vectors: . Given , we compute the sample scores , , , and , and use Equation 1 to obtain the four attention maps , , , and .
3.2 Learning with similarity mining
With our proposed mechanism to compute similarity attention, one can generate attention maps, like in Figure 2, to explain why the similarity model predicted that the data sample satisfies the similarity criterion. However, we note all operations leading up to and including Equation 1, where we compute the similarity attention, are differentiable and we can use the generated attention maps to further bootstrap the training process. As we show later, this helps improve downstream model performance, leading to better generalizability. To this end, we describe a new learning objective, similarity mining, that enables such similarity-attention-driven training of similarity models.
The goal of similarity mining is to facilitate the complete discovery of local image regions that the model deems necessary to satisfy the similarity criterion. To this end, given the three attention maps (triplet case), we upsample them to be the same size as the input image and perform soft-masking, producing masked images that exclude pixels corresponding to high-response regions in the attention maps. This is realized as: , where (all element-wise operations). These masked images are then fed back to the same encoder of the triplet model to obtain the feature vectors , , and . Our proposed similarity mining loss, , can then be expressed as:
where and represent the euclidean norm and element-wise absolute value of the vector . The intuition here is that by minimizing , the model has difficulties in predicting whether the input triplet would satisfy the triplet condition. This is because as gets smaller, the model will have exhaustively discovered all possible local regions in the triplet, and erasing these regions (via soft-masking above) will leave no relevant features available for the model to predict that the triplet satisfies the criterion.
3.2.1 Extensions to other architectures
Like similarity attention, similarity mining is also extensible to other similarity learning architectures, and here we briefly describe how to integrate similarity mining with Siamese and quadruplet models.
For a Siamese similarity model, we consider only the positive pairs when enforcing the similarity mining objective. Given the two attention maps and , we perform the soft-masking operation described above to obtain the masked images, which gives the feature vectors and . The similarity mining objective in this case attemps to maximize the distance between and , i.e., . Like the triplet case, the intuition of here is that it seeks to get the model to a state where after erasing, the model can no longer predict that the data pair belongs to the same class. This is because as gets smaller, the model will have exhaustively discovered all corresponding regions that are responsible for the data pair to be predicted as belonging to the same class (i.e., low feature space distance), and erasing these regions (via soft-masking) will result in a larger feature space distance between the positive samples.
For a quadruplet similarity model, using the four attention maps, we compute the feature vectors , , , and using the same masking strategy above. We then consider the two triplets and in constructing the similarity mining objective as , where and correspond to Equation 2 evaluated for and respectively.
3.3 Overall training objective
We train similarity models with both the traditional similarity/metric learning objective (e.g., contrastive, triplet, etc.) as well as our proposed similarity mining objective . Our overall training objective is:
where is a weight factor controlling the relative importance of and . Figure 3 provides a visual summary of our training pipeline.
4 Experiments and Results
We conduct experiments on three different tasks: image retrieval (Sec. 4.1), person re-identification (Sec. 4.2), and one-shot semantic segmentation (Sec. 4.3) to demonstrate the efficacy and generality of our proposed framework.
4.1 Image Retrieval
We conduct experiments on the CUB200 (“CUB”) , Cars-196 (“Cars”) and Stanford Online Products (“SOP”)  datasets, following the protocol of Wang et al. , and reporting performance using the standard Recall@K (R-K) metric . We first show ablation results to demonstrate performance gains achieved by the proposed similarity attention and similarity mining techniques. Here, we also empirically evaluate our proposed technique with three different similarity learning architectures to demonstrate its generality. In Table 1, we show both baseline (trained only with ) and our results with the Siamese, triplet, and quadruplet architectures (trained with ). As can be noted from these numbers, our method consistently improves the baseline performance across all three architectures. Since the triplet model gives the best performance among the three architectures considered in Table 1, for all subsequent experiments, we only report results with the triplet variant.
We next compare the performance of our proposed method with competing, state-of-the-art metric learning methods. To this end, we consider many recently published algorithms, summarizing our comparative performance in Table 2. We note our proposed method is quite competitive, with R-1 performance improvement of on CUB, matching (with DeML) R-1 and slightly better R-2 performance on Cars, and very close R-1 and slightly better R-1k performance (w.r.t. MS ) on SOP.
In addition to obtaining superior quantitative performance, another key difference between our method and competing algorithms is explainability. With our proposed similarity attention mechanism, we can now visualize, by means of similarity attention maps, the model’s decision reasoning. In Figures 4 and 5, we show examples of attention maps generated with our method on testing data unseen during training. Figures 4(a) (model trained and tested on CUB) and 5(a) (model trained and tested on Cars) each show three triplet examples. As can be seen from these figures, our proposed method is generally able to highlight intuitively satisfying correspondence regions across the images in each triplet. For example, in Figure 4(a) (left triplet), the region around the face is what makes the second bird image similar, and the third bird image dissimilar, to the first (anchor) bird image. In Figure 5(a) (left triplet), the region around the headlights is highlighted by the model as being important for satisfying the triplet criterion.
To further demonstrate model generalizability, we show inter-dataset results in Figures 4(b) (model trained on Cars and tested on CUB) and Figures 5(b) (model trained on CUB and tested on Cars). We note that, despite not being trained on relevant data, our model trained with similarity attention is able to discover local regions contributing to the final decision. For example, in Figure 4(b) (left triplet), the model is able to discover the corresponding regions around the face and neck of the bird images, whereas in Figure 5(b) (center triplet), the model is able to discover the corresponding region around the rear window/wheels of the car images.
Finally, we show the impact of similarity mining on the generated attention maps in Figure 6, where (a) and (b) show one triplet example from CUB and Cars respectively (left triplet: baseline , right triplet: proposed ). We clearly see from both examples that the proposed results in more exhaustive and accurate discovery of local regions, further demonstrating (along with Table 1) its impact in improving model performance.
4.2 Person Re-Identification
Since re-id is a special case of image retrieval, our proposed method is certainly applicable, and we conduct experiments on the CUHK03-NP detected (“CUHK”) [16, 51] and DukeMTMC-reid (“Duke”) [25, 50] datasets, following the protocol in Sun et al. .
|CASN (PCB) ||71.5||64.4||87.7||73.7|
and train the model for 40 epochs with the Adam optimizer. We summarize our results in Table3, where we note our method results in about rank-1 performance improvement on CUHK and very close performance (88.5% rank-1) to the best performing method (MGN) on Duke. We note that some of these competing methods have re-id specific design choices (e.g., upright pose assumption for attention consistency in CASN , hard attention in HA-CNN , attentive feature refinement and alignment in DuATM ). On the other hand, we make no such assumptions, and despite this, our method is able to achieve competitive performance.
4.3 Weakly supervised one-shot semantic segmentation
In the one-shot semantic segmentation task, we are given a test image and a pixel-level semantically labeled support image, and we are to semantically segment the test image. Given that we learn similarity predictors, we can use our model to establish correspondences between the test and the support images. One aspect that is particularly appealing with our method is explainability, and the resulting similarity attention maps we generate can be used as cues to perform semantic segmentation.
We use the PASCAL dataset (“Pascal”)  for all experiments, following the same protocol as Shaban et al. . Given a test image and the corresponding support image, we first use our trained model to generate two similarity attention maps, one for each image. We then use the attention map for the test image as a cue to generate the final segmentation mask using the GrabCut  algorithm. We first show some qualitative results in Figure 7 (left to right: test image, support image, test attention map, support image attention map, predicted segmentation mask, ground truth mask). In the first row, we see that, in the test attention map, our method is able to capture the “dog” regions in the test image, helping generate the final segmentation result. In the third row, despite the presence of a horse in the support image, we are able to get high-response attention on mostly the person part of the image, resulting in a fairly reasonable segmentation mask. We also show meanIOU results in Table 4. Here, we highlight several aspects. First, all these competing methods are specifically trained towards the one-shot segmentation task, whereas our model is trained for metric learning. Second, they use the support image label mask both during training and inference, whereas our method does not use this label data. Finally, they are trained on Pascal, i.e., relevant data, whereas our model was trained on CUB and Cars, data that is irrelevant in this context. Despite these seemingly disadvantageous factors, our method is able to achieve competitive performance, and even outperform in some cases and for the overall mean, when compared to these methods. These results demonstrate the potential of our proposed method in training similarity predictors that can generalize to data unseen during training and also to tasks the models were not originally trained for. While we show results in the one-shot case, our proposed pipeline is not limited by the number of support images and is certainly extensible to multi-shot scenarios without modifying the core architecture of our proposed method.
5 Summary and Future Work
We presented new techniques to explain and visualize, with gradient-based attention, predictions of similarity models. We showed our resulting similarity attention is generic and applicable to many commonly used similarity architectures. We presented a new paradigm for learning similarity functions with our similarity mining learning objective, resulting in improved downstream model performance. We also demonstrated the versatility of our framework in learning models for a variety of unrelated applications, e.g., image retrieval (including re-id) and low-shot semantic segmentation. While we show one-shot (e.g., one image per class in triplet) similarity explanations, one can easily extend this approach to generate explanations for set-to-set matching too. Our results also suggest that the similarity explanations we generate can be used to address a variety of label propagation problems that need a first step of correspondence learning. With our similarity attention, we can establish such correspondences in an unsupervised fashion, opening new avenues for advances in zero- or few-shot learning. Our method can also find use in targeted retrieval for medical diagnosis applications where a doctor can “tag” a certain region in the image under examination and retrieve relevant “similar” historical records for further diagnosis.
-  Sameer Agarwal, Josh Wills, Lawrence Cayton, Gert Lanckriet, David Kriegman, and Serge Belongie. Generalized non-metric multidimensional scaling. In AISTATS, 2007.
-  Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Laura Leal-Taixé, Daniel Cremers, and Luc Van Gool. One-shot video object segmentation. In CVPR, 2017.
-  Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In WACV, 2018.
-  Binghui Chen and Weihong Deng. Hybrid-attention based decoupled metric learning for zero-shot image retrieval. In CVPR, 2019.
-  Sumit Chopra, Raia Hadsell, and Yann LeCun. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005.
-  Hiroshi Fukui, Tsubasa Hirakawa, Takayoshi Yamashita, and Hironobu Fujiyoshi. Attention branch network: Learning of attention mechanism for visual explanation. In CVPR, 2019.
-  Ben Harwood, BG Kumar, Gustavo Carneiro, Ian Reid, and Tom Drummond. Smart mining for deep metric learning. In ICCV, 2017.
-  Ben Harwood, G VijayKumarB., Gustavo Carneiro, Ian Reid, and Tom Drummond. Smart mining for deep metric learning. In ICCV, 2017.
-  Srikrishna Karanam, Mengran Gou, Ziyan Wu, Angels Rates-Borras, Octavia Camps, and Richard J Radke. A systematic evaluation and benchmark for person re-identification: Features, metrics, and datasets. IEEE T-PAMI, 41(3):523–536, 2018.
-  Wonsik Kim, Bhavya Goyal, Kunal Chawla, Jungmin Lee, and Keunjoo Kwon. Attention-based ensemble for deep metric learning. In ECCV, 2018.
-  Martin Koestinger, Martin Hirzer, Paul Wohlhart, Peter M. Roth, and Horst Bischof. Large scale metric learning from equivalence constraints. In CVPR, 2012.
Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
3d object representations for fine-grained categorization.
Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
Metric learning: A survey.
Foundations and Trends® in Machine Learning, 5(4):287–364, 2013.
-  Marc T Law, Nicolas Thome, and Matthieu Cord. Quadruplet-wise image similarity learning. In ICCV, 2013.
-  Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. Tell me where to look: Guided attention inference network. In CVPR, 2018.
-  Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. DeepReID: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
-  Wei Li, Xiatian Zhu, and Shaogang Gong. Harmonious attention network for person re-identification. In CVPR, 2018.
-  Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
-  Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In CVPR, 2015.
-  Yair Movshovitz-Attias, Alexander Toshev, Thomas K. Leung, Sergey Ioffe, and Saurabh P. Singh. No fuss distance metric learning using proxies. In ICCV, 2017.
-  Michael Opitz, Georg Waltner, Horst Possegger, and Horst Bischof. BIER —- boosting independent embeddings robustly. In ICCV, 2017.
-  Sateesh Pedagadi, James Orwell, Sergio Velastin, and Boghos Boghossian. Local Fisher discriminant analysis for pedestrian re-identification. In CVPR, 2013.
-  Kate Rakelly, Evan Shelhamer, Trevor Darrell, Alyosha Efros, and Sergey Levine. Conditional networks for few-shot semantic segmentation. In ICLR Workshops, 2018.
-  Oren Rippel, Manohar Paluri, Piotr Dollar, and Lubomir Bourdev. Metric learning with adaptive density discrimination. In ICLR, 2016.
-  Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCVW, 2016.
-  Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM transactions on graphics (TOG), 23(3):309–314, 2004.
-  Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
-  Amirreza Shaban, Shray Bansal, Zhen Liu, Irfan Essa, and Byron Boots. One-shot learning for semantic segmentation. In BMVC, 2017.
-  Jianlou Si, Honggang Zhang, Chun-Guang Li, Jason Kuen, Xiangfei Kong, Alex C Kot, and Gang Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, 2018.
-  Kihyuk Sohn. Improved deep metric learning with multi-class N-pair loss objective. In NIPS, 2016.
-  Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In CVPR, 2016.
-  Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. SVDNet for pedestrian retrieval. In ICCV, 2017.
-  Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
-  Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In CVPR, 2018.
-  Evgeniya Ustinova and Victor Lempitsky. Learning deep embeddings with histogram loss. In NIPS, 2016.
-  Laurens Van Der Maaten and Kilian Weinberger. Stochastic triplet embedding. In MLSP, 2012.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
-  Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset, 2011.
-  Fei Wang, Mengqing Jiang, Chen Qian, Shuo Yang, Cheng Li, Honggang Zhang, Xiaogang Wang, and Xiaoou Tang. Residual attention network for image classification. In CVPR, 2017.
-  Guanshuo Wang, Yufeng Yuan, Xiong Chen, Jiwei Li, and Xi Zhou. Learning discriminative features with multiple granularities for person re-identification. In ACM MM, 2018.
-  Lezi Wang, Ziyan Wu, Srikrishna Karanam, Kuan-Chuan Peng, Rajat Vikram Singh, Bo Liu, and Dimitris Metaxas. Sharpen focus: Learning with attention separability and consistency. In ICCV, 2019.
-  Xun Wang, Xintong Han, Weiling Huang, Dengke Dong, and Matthew R. Scott. Multi-similarity loss with general pair weighting for deep metric learning. In CVPR, 2019.
-  Kilian Q. Weinberger and Lawrence K. Saul. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 2009.
-  Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In ICCV, 2017.
-  Yuhui Yuan, Kuiyuan Yang, and Chao Zhang. Hard-aware deeply cascaded embedding. In ICCV, 2017.
-  Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In ECCV, 2014.
-  Liang Zheng, Yi Yang, and Alexander G Hauptmann. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
-  Meng Zheng, Srikrishna Karanam, Ziyan Wu, and Richard J Radke. Re-identification with consistent attentive siamese networks. In CVPR, 2019.
-  Wenzhao Zheng, Zhaodong Chen, Jiwen Lu, and Jie Zhou. Hardness-aware deep metric learning. In CVPR, 2019.
-  Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, 2017.
-  Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, 2017.
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning deep features for discriminative localization.In CVPR, 2016.
-  Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In CVPR, 2016.