Log In Sign Up

Re-Identification with Consistent Attentive Siamese Networks

We propose a new deep architecture for person re-identification (re-id). While re-id has seen much recent progress, spatial localization and view-invariant representation learning for robust cross-view matching remain key, unsolved problems. We address these questions by means of a new attention-driven Siamese learning architecture, called the Consistent Attentive Siamese Network. Our key innovations compared to existing, competing methods include (a) a flexible framework design that produces attention with only identity labels as supervision, (b) explicit mechanisms to enforce attention consistency among images of the same person, and (c) a new Siamese framework that integrates attention and attention consistency, producing principled supervisory signals as well as the first mechanism that can explain the reasoning behind the Siamese framework's predictions. We conduct extensive evaluations on the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets, and establish a new state of the art, with our proposed method resulting in mAP performance improvements of 6.4


page 5

page 8


Deep High-Resolution Representation Learning for Cross-Resolution Person Re-identification

Person re-identification (re-ID) tackles the problem of matching person ...

Where-and-When to Look: Deep Siamese Attention Networks for Video-based Person Re-identification

Video-based person re-identification (re-id) is a central application in...

Person image generation with semantic attention network for person re-identification

Pose variation is one of the key factors which prevents the network from...

Devil's in the Detail: Graph-based Key-point Alignment and Embedding for Person Re-ID

Although Person Re-Identification has made impressive progress, difficul...

Learning Similarity Attention

We consider the problem of learning similarity functions. While there ha...

Siamese Cookie Embedding Networks for Cross-Device User Matching

Over the last decade, the number of devices per person has increased sub...

ABD-Net: Attentive but Diverse Person Re-Identification

Attention mechanism has been shown to be effective for person re-identif...

1 Introduction

Given an image or a set of images of a person of interest in a “probe” camera view, person re-identification (re-id) attempts to retrieve this person of interest among a set of “gallery” candidates in another camera view. Due to its broad appeal in several video analytics applications such as surveillance, re-id has seen explosive growth in the computer vision community

[14, 42, 40].

While we have seen tremendous progress in re-id [3, 30, 32, 2, 34, 26, 29, 33], there are several problems that still hinder the reliable, real-world use of person re-id. Probe and gallery camera views in real-world applications typically have large viewpoint variations, causing substantial view misalignment between probe and gallery images of the same person. Illumination differences between the locations where the cameras are installed, as well as occlusions in the captured data, add to re-id’s challenges. Ideally, we want a method that can reliably spatially localize the person of interest in the image, while also providing a robust representation of the localized part in order to match accurately to the gallery of candidates. This suggests we consider the spatial localization and feature representation problems jointly and formulate the learning objective in a way that can facilitate end-to-end learning.

Figure 1: We present the first framework for re-id that provides mechanisms to make attention and attention consistency end-to-end trainable in a Siamese learning architecture, resulting in a technique for robust cross-view matching as well as explaining the reasoning for why the model predicts that the two images to belong to the same person.

Attention is a powerful concept for understanding and interpreting neural network decisions

[23, 46, 8, 25], providing ways to generate attentive regions given image-level labels and trained models, and to perform spatial localization. Some recent extensions [16] take this a step forward by training models with attention providing end-to-end supervision, resulting in improved spatial localization. These methods were not designed for the re-id problem and consequently did not have to consider localization and invariant representation learning jointly. While there have been some attempts at joint learning with these two objectives [35, 15, 22, 19], these methods do not explicitly enforce any sort of attention consistency between images of the same person. Intuitively, given two images of the same person from different views, there typically exist some common regions in the images that are important for matching, which should be reflected in how attention is modeled and used for supervision.

Furthermore, such attention consistency should lead to consistent feature representations for the two different images, leading to invariant representations for robust cross-view matching. These considerations naturally suggest the design of a Siamese framework that jointly learns consistent attention regions for images of the same person while also producing robust, invariant feature representations. While one recent paper approached these problems jointly [35]

, this method requires specially-designed architectures for attention modeling and considers the attention in each image independently, ignoring the intuition that attentive regions across images of the same person have to be consistent. It also does not have an explicit mechanism to explain the reasoning behind the model’s prediction. To this end, we design and propose a new deep architecture for re-id, which we call the Consistent Attentive Siamese Network (CASN), addressing all the key questions and considerations discussed above (Figure

1). Specifically, we design a novel two-branch architecture that (a) produces attentive regions during training without requiring any additional supervision other than identity labels or any specially-designed architecture for modeling attention, (b) explicitly enforces these attentive regions to be consistent for the same person, (c) uses attention and attention consistency as an explicit and principled part of the learning process, and (d) learns to produce robust representations for cross-view matching.

To summarize, our key contributions include:

  • We present a technique that makes spatial localization of the person of interest a principled part of the learning process, providing supervision only by means of person identity labels. This makes spatial localization end-to-end trainable and automatically discovers complete attentive regions.

  • We present a new scheme that explicitly enforces attention consistency as part of the learning process, providing supervision that facilitates end-to-end learning of consistent attentive regions of images of the same person.

  • We present the first learning architecture that integrates attention consistency and Siamese learning in a joint learning framework.

  • We present the first Siamese attention mechanism that jointly models consistent attention across similar images, resulting in a powerful method that can help explain the reasoning behind the network’s prediction.

  • We establish a new state of the art on the CUHK03-NP, DukeMTMC-ReID, and Market1501 datasets, with our method resulting in substantial mAP performance improvements of 6.4%, 4.2%, and 1.2% respectively as of ECCV 2018.

2 Related Work

Traditional person re-id algorithms involved hand-crafted feature design followed by supervised distance metric learning. See Karanam et al. [14] and Zheng et al. [42] for excellent experimental and algorithmic studies.

Recent developments in deep learning

[10, 11] have influenced the design of re-id algorithms as well, with deep re-id algorithms achieving impressive performance on challenging datasets [30, 32, 3]. However, naive training of re-id models without being spatial-localization-aware will not result in satisfactory performance due to cross-view misalignment, occlusions, and clutter. To get around these issues, several recent methods adopt some form of localized representation learning. Zhao et al. [39] decomposed person images into different part regions and learned region-specific representations followed by an aggregation scheme to produce the overall image representation. Li et al. [15]

proposed to first learn and localize part body features by means of spatial transformer networks

[13], followed by a combination of local and global features to learn a classification network. Su et al. [29] used human pose information as a supervisory signal to learn normalized human part representations as part of an identification network. However, these and several other recent methods [16] consider the spatial localization problem in itself and produce representations and localizations that are not cross-view consistent. On the other hand, our approach tackles spatial localization and representation learning in a holistic, joint framework while enforcing consistency, which is key to re-id.

Attention has been used in re-id to tackle localization and misalignment problems. Liu et al. [22] proposed the HydraPlus-Net architecture that learns to discover low- and semantic-level attentive features for richer image representations. Li et al. [19] designed a scheme to simultaneously learn “hard” region-level and “soft” pixel-level attentive features for a multi-granular feature representation. Li et al. [17] learned multiple, predefined attention models and showed that each model corresponds to a specific body part, the outputs of which are then aggregated by means of a temporal attention model. These methods typically have inflexible region-specific attention models as part of the overall framework to learn important regions in the image, and more importantly, do not have an explicit mechanism to enforce attention consistency. Our approach is markedly different from these and other methods [37, 28] in this category in that we only need image-level labels to learn attention, while also enforcing attention consistency by making it a principled part of the learning process.

Consistency is an important aspect of re-id to account for cross-view differences, which has typically been reflected in Siamese-like designs for re-id models that attempt to learn invariant feature representations [18, 7, 5, 26, 38]. While these models learn features and distance metrics jointly, they do not address the spatial localization problem directly, typically formulating a local parts-based approach to solve the problem. In scenarios involving occlusion and clutter, this may not be an optimal solution, with attention leading to better spatial localization. To this end, our method, as opposed to these approaches, exploits attention during the learning process while also learning consistent spatial localization and invariant feature representations jointly.

3 The Consistent Attentive Siamese Network

In this section, we introduce our proposed attention-based deep architecture for person re-id, the Consistent Attentive Siamese Network (CASN), summarized in Figure 2. CASN includes an identification module and a Siamese module that provide for a powerful, flexible approach to deal with viewpoint variations, occlusions, and background clutter. The identification module (Section 3.1), with its explicit attention guidance as supervision given only identity labels, helps find reliable and accurate spatial localization for the person of interest in the image and performs identity (ID) prediction. The Siamese module (Section 3.2) provides the network with supervisory signals from attention consistency, ensuring that we obtain spatially consistent attention regions for images of the same person, as well as learning view-invariant feature representations for robust gallery matching.

In the following, we describe each of these two modules in more detail, leading up to the overall design of the CASN.

Figure 2: The proposed Consistent Attentive Siamese Network (CASN).
Figure 3: Architecture of the Identification (IDE) Baseline.

is the feature vector extracted after Resnet50

conv5. is the ID prediction vector, which has dimensionality equal to the total number of different training identities. is the prediction score of ID label for the input image.

3.1 The Identification Module

We first introduce the architecture of the identification module of the CASN. We begin by describing the baseline architecture for training an identification (IDE) model [42], followed by the overall identification module that integrates attention guidance into the IDE architecture.

3.1.1 The IDE Baseline Architecture

The IDE baseline is based on the ResNet50 architecture [10], following the work in [42] and recent papers that adopt ResNet50 [17, 32, 33]. Convolutional layers from conv1 through conv5

are pretrained on ImageNet


, following which an IDE classifier comprised of two fully-connected layers produces the identity prediction for the input image.

Figure 4: An attention map with identification loss (left) and identification loss with attention learning (right).

The identification baseline is visually summarized in Figure 3. The model is learned by optimizing the identification loss, which essentially maximizes the likelihood of predicting the correct class (identity) label for each training image. Formally, given training images belonging to different identities, with each image having an identity label , we optimize the following multi-class cross-entropy loss:


where is the prediction of class from the IDE classifier for input image .

3.1.2 Identification Attention

Spatial localization of the person of interest is a key first step for a re-id algorithm, which should be reflected in the end-to-end learning process. While much recent work has focused on generating attention regions given image-level labels [23, 46, 8, 25], we need to make attention an explicit part of the learning process itself, which can then guide the network to better localize the person of interest. To this end, we adopt the framework of Li et al. [16] and introduce attention learning as part of our identification module, helping the network generate spatially attentive regions in person images without needing any extra information as supervision other than identity labels, which are already available.

Given an input image with its identity label , we first obtain the attention (localization) map from the IDE classifier prediction by means of Grad-CAM [25]. However, a re-id model trained only with IDE loss would focus only on the most discriminative regions important for satisfying the current classification objective, and may not generalize well. To better illustrate this concept, consider the Grad-CAM attention map example shown in Figure 4 (left) for an image from Market1501 [41]. The gray pants of the person attract the most attention, but the blue jacket is also useful information that is ignored in the attention map on the left. To obtain more complete attention maps and focus on the foreground subject, we use the notion of attention learning. Specifically, given and , we compute its attention map and mask out the most discriminative regions in (corresponding to high responses in ) by means of the soft-masking operation to get , where is pixel-wise multiplication and . This produces an that excludes all high-response image pixels. If perfectly spatially localizes the person of interest, will contain no pixels contributing to the corresponding identity prediction . We use this notion to provide supervision to the identification module to produce more complete spatial localization. Specifically, we define the identification attention loss for the identification module as the prediction score of masked input image :


A comparison of the attention maps retrieved from model trained only with the identification loss and identification with identification attention loss is shown in Figure 4, where we see more foreground subject coverage with attention learning on the right. To summarize, in the identification module, we first use the IDE baseline architecture to obtain identity predictions. Attention maps are then computed by means of Grad-CAM, which are then refined using the identification attention objective on masked images that exclude high-attention regions to perform more complete spatial localization.

3.1.3 Discussion

While the IDE architecture can provide a good baseline feature representation for matching [42, 14, 33] and our proposed identification module discussed above can further lead to reasonable spatial localization by design, several problems still remain unaddressed. First, the identification module has no mechanism to ensure we obtain consistent attention regions for different images of the same person. This can be inferred from the design itself, which lacks any guiding principle to result in attention consistency. Intuitively, this is key to robust re-id since there are typically common regions in different images of the same person that need to be brought out as important during matching. Second, the identification module has no mechanism to learn invariant identity-aware representations across different camera views. Furthermore, attention consistency should correspond to consistent feature representations, suggesting it should inform representation learning. Finally, the attention component of the identification module is not particularly suitable during inference since we do not know the identity of a test image to compute its attention map. While a workaround to this problem would be to use the top-k predictions to compute attention, this clearly would be a sub-optimal solution.

The problems with the identification module lead us to the design of the Siamese module of the CASN, which attempts to address these issues in a principled manner.

Figure 5: Demonstration of the Siamese Attention Mechanism. Yellow arrows denote backward operation and green arrows denote forward operation. The BCE loss and spatial constraint are added as Siamese Attention loss .

3.2 The Siamese Module

In this section, we introduce the Siamese module to complement the identification module of the proposed CASN. Given a pair of input images, we first consider a binary classification problem (Section 3.2.1), whose objective function is then used to formulate a Siamese attention mechanism (Section 3.2.2) to enforce attention consistency and consistency-aware invariant representation learning.

3.2.1 Binary Classification

Given a pair of input images, we construct a binary classification objective for predicting whether or not the pair belongs to the same class. Given feature vectors and for the images and in the input pair (see Figure 3), we compute the difference , which forms the input for a classifier that uses the binary cross-entropy objective (BCE) to get the class prediction for the current input pair. The BCE classifier is structurally similar to the IDE classifier in Section 3.1.1, with two fully connected layers. The output prediction vector of the BCE classifier is a 2-dimensional vector, which indicates whether or not the input pair belongs to the same identity. The BCE classification objective that is optimized is defined, for a batch of input pairs, as:


where is the same () or different () identity prediction of the BCE classifier for input pair .

3.2.2 The Siamese Attention Mechanism

As discussed previously, identification attention alone does not ensure attention consistency and identity-aware invariant representations. To this end, we propose a new Siamese attention mechanism with explicit guidance towards attention consistency. Consider two images and of the same identity and the corresponding BCE classifier prediction . We first localize the attentive regions in the two images that contribute to this BCE prediction. To this end, we compute the gradient of the prediction score with respect to the feature vector , i.e., . We then find the features in that have a positive influence on the final BCE prediction by means of an indicator vector constructed as:


Based on the indicator vector , the importance scores for the input feature vectors and can be calculated as the dot products of and the feature vectors: and . In the same spirit as Grad-CAM [25]

, gradients backpropagated from

and are first globally average-pooled to find the channel importance weights and at the last convolutional layer:


where and are feature maps of image and at the last convolutional layer. The attention maps can then be computed as:


A visualization of the attention maps, extracted from the BCE loss, is shown in Figure 6. For images of the same person, we want the attention maps and to provide consistent importance to corresponding regions in the images. For instance, as we can see in Figure 6(b), the attention map in Image 1 focuses on the full body of the person while that in Image 2 mostly focuses on the lower part. To provide an explicit attention-consistency-aware supervisory signal and guide the network to discover consistent cross-view importance regions, we introduce the notion of spatial attention constraints based on the attention maps derived from the BCE classification objective.

Figure 6: Demonstration of attention maps from BCE loss. (a-c): positive pairs, (d-f): negative pairs.

Given the attention maps and

, we first apply the max-pooling operation to compute the highest response across each horizontal row of pixels, giving us the two importance vectors

and . To enforce attention consistency, we explicitly constrain them to be as close as possible. To avoid alignment issues as in Figure 6(c), we find the first and the last element of the vertical vector larger than a certain threshold in and , and then resize the remaining elements to be of the same dimensions. We define the Siamese attention loss that enforces attention consistency as:


where is defined in Equation 3, and are resized vectors of and after alignment, is the distance between and , and is a weight parameter controlling the importance of the BCE loss vis-a-vis the spatial attention constraints.

A visual summary of our proposed Siamese attention mechanism is shown in Figure 5. For input pairs belonging to the same identity, attention maps are retrieved from the BCE classifier predictions, following which they are max-pooled to gather localization statistics for enforcing spatial attention consistency.

3.3 Overall Design of the CASN

With the identification and Siamese modules discussed in the previous sections, we now present our overall framework that integrates these two modules. Our proposed CASN, depicted in Figure 7

, is a two-branch architecture. During training, we pass as input a pair of images belonging either to the same or different identity. After feature extraction (see Figure

3), the feature vectors are input to the identification module and Siamese module separately. In the identification module, the feature vectors are first passed to the IDE classifier for identity classification, following which an attention map for the input image in the current branch is retrieved from its identity label. The identification attention loss then guides the identification module to discover complete attention regions for the input image. The Siamese module takes as input the element-wise subtraction of the feature vectors from two branches, which is then input to the BCE Classifier to retrieve the image-pair attention maps from the BCE loss. Given the attention map pair, we enforce the spatial constraint objective to ensure spatial consistency of attentive regions across the two images in the input pair.

Figure 7: Identification and Siamese modules in the CASN.

We optimize our proposed CASN for all the objectives described here jointly, with the overall CASN training objective given as:


where is the IDE classification loss, is the identification attention loss and is the Siamese attention loss. Note that the feature extraction blocks across the two branches in Figure 7 share weights. The proposed CASN addresses all problems discussed previously in a principled fashion, allowing us to (a) generate attention maps with attention consistency, (b) learn identity-aware invariant representations by design, and (c) use attention maps during inference for identities not seen during training. Furthermore, compared to existing attention mechanisms employed in person re-id, our framework is flexible by design in that it can be used in conjunction with any base architecture or baseline re-id algorithm. For instance, in Section 4, we show performance improvements with both the IDE [36] and the PCB [32] baselines. Furthermore, we only need identity labels during training (which are used by competing algorithms as well), but crucially, do not need any specially designed architecture sub-modules to make attention a part of the learning process.

4 Experiments and Results

Datasets. For evaluation, we use three benchmark datasets: Market-1501 [41], CUHK03-NP [18, 45], and DukeMTMC-ReID [43, 24]. Market-1501 [41] collects person images from 6 camera views, containing 12,936 training images with 751 different identities. Gallery and query sets have 19,732 and 3,368 images respectively with 750 different identities. CUHK03-NP is a new training-testing split protocol for CUHK03 [18], first proposed in [45]. CUHK03 datasets contain two subsets which provide labeled and detected (from a person detector) person images. The detected CUHK03 set includes 7,365 training images, 1,400 query images and 5,332 gallery images. The labeled set contains 7,368 training, 1,400 query and 5,328 gallery images respectively. The new protocol in [45] splits the training and testing sets into 767 and 700 identities. DukeMTMC-ReID [43] is an image-based re-id dataset generated from DukeMTMC [24] that randomly splits the training and testing sets equally into 702 different identities. It includes 16,522 training, 2,228 query, and 17,661 gallery images.

Implementation Details. Before training, we resize all images to

. We adopt the SGD optimizer with a momentum factor of 0.9, a learning rate of 0.03, and a total of 40 epochs, with the learning rate decreased by a factor of 10 at epoch 30. The parameter

in our Siamese attention loss (Equation 7) is set to 0.2; parameters and in Equation 8 are , respectively for all experiments. In addition to the IDE baseline introduced in Section 3.1.1, we also report evaluation results with the Part-based Convolutional Baseline (PCB) [32]. PCB [32] is a modification of IDE that replaces the global average pooling operation in IDE with spatial pooling for discriminative part-informed feature learning. For the PCB baseline, we follow the same evaluation settings as in [32] and resize the images to

before training. We set the batch size to 16 in all experiments; the model training uses two NVIDIA GTX-1080Ti GPUs. All our code is written in the Pytorch framework


Evaluation Protocol. After training, we send the query and gallery as pair inputs to obtain attention maps from BCE classifier predictions. The distance of the attention maps (Equation 7 in Section 3.2.2) and distance of the feature vectors after feature extraction are normalized and summed for final ranking. Cumulative Match Characteristic (CMC) and mean average precision (mAP) statistics are then calculated as evaluation results for our proposed model. We report the rank-1 and mAP results, as is conventional in the re-id community.

4.1 Comparison to the State of the Art

In Tables 1 and 2, we compare the performance of our method with several recently proposed algorithms applied to the CUHK03-NP, DukeMTMC-ReID, and Market-1501 datasets. Note that all our results are evaluations without re-ranking [45] and the PCB [32] architecture as the backend.

CUHK03-NP. We report experimental results on both detected and labeled person images. The new train-test split, containing only around 7,300 training images, is much more prone to overfitting when compared to the other datasets. However, results show that our method surpasses the state of the art by a large margin for rank-1 (+7.8%, +7.6%) and mAP (+5.4%, +6.4%) on detected and labeled sets respectively, demonstrating the strong generalization ability of the CASN. More crucially, compared to a recently proposed attention-based method, HA-CNN [19], our CASN achieves 29.8% and 25.8% rank-1 and mAP improvements (on detected sets) respectively, clearly bringing out the efficacy of the proposed attention mechanisms.

Detected Labeled
R-1 mAP R-1 mAP
BoW+XQDA [40] 6.4% 6.4% 7.9% 7.3%
LOMO+XQDA [20] 12.8% 11.5% 14.8% 13.6%
IDE [42] 21.3% 19.7% 22.2% 21.0%
PAN [44] 36.3% 34.0% 36.9% 35.0%
DPFL [6] 40.7% 37.0% 43.0% 40.5%
HA-CNN [19] 41.7% 38.6% 44.4% 41.0%
MLFN [2] 52.8% 47.8% 54.7% 49.2%
DaRe+RE [34] 63.3% 59.0% 66.1% 61.6%
PCB+RPP [32] 63.7% 57.5% - -
CASN (proposed) 71.5% 64.4% 73.7% 68.0%
Table 1: Comparisons to the state of the art on CUHK03-NP (detected and labeled) [18, 45].
DukeMTMC-ReID Market-1501
R-1 mAP R-1 mAP
BoW+KISSME [40] 25.1% 12.2% 44.4% 20.8%
LOMO+XQDA [20] 30.8% 17.0% 43.8% 22.2%
SVDNet [31] 76.7% 56.8% 82.3% 62.1%
HA-CNN [19] 80.5% 63.8% 91.2% 75.7%
DuATM [26] 81.8% 64.6% 91.4% 76.6%
PCB+RPP [32] 83.3% 69.2% 93.8% 81.6%
DNN_CRF [4] 84.9% 69.5% - -
CASN (proposed) 87.7% 73.7% 94.4% 82.8%
Table 2: Comparisons to the state of the art on DukeMTMC-ReID [43, 24] and Market-1501 [41] (SQ).
(a) Attention maps retrieved from BCE loss (training)
(b) Attention maps retrieved from BCE loss with Siamese Attention loss (training)
(c) Attention maps retrieved from model trained with Siamese Attention (Rank 1 gallery match for query images)
Figure 8: Demonstrating the efficacy of the proposed Siamese attention by means of attention maps for same-person images.

DukeMTMC-ReID. Our proposed CASN establishes a new state of the art here as well, producing rank-1 and mAP performance improvements of 2.8% and 4.2% respectively. Again, compared to recently proposed attention-based methods, HA-CNN [19] and DuATM [26], our CASN achieves 7.2% and 5.9% rank-1 accuracy improvements and 9.9% and 9.1% mAP improvements respectively.

Market-1501. Table 2 compares the proposed CASN with state-of-the-art methods on Market-1501. Our method outperforms the next best method (PCB+RPP) by a small margin, 0.6% for rank-1 accuracy and 1.2% for mAP. However, compared to recently proposed attention-based methods, e.g., HA-CNN [19] and DuATM [26] (shown in the table), and CAN [21] (R-1: 60.3%, mAP: 35.9%), HPN [22] (R-1: 76.9%), MSCAN [15] (R-1: 80.3%, mAP: 57.5%) our method produces much higher results for both rank-1 and mAP evaluations.

As can be noted from these results, the proposed CASN substantially outperforms existing attention-based methods for re-id. More importantly, unlike these competing attention-based methods, CASN does not require any specially designed deep architecture for modeling attention, relying only on identity labels for supervision. This allows the CASN to be highly flexible for use in conjunction with any baseline CNN architecture, such as VGGNet [27], DenseNet [11], or SqueezeNet [12]. For instance, with DenseNet and the IDE baseline, CASN achieves a rank-1 and mAP performance of and respectively on CUHK03-NP (detected), which is close to CASN’s results with ResNet50 and IDE, discussed next.

4.2 Ablation Study and Discussion

In this section, we further study the role of the identification attention and Siamese attention mechanisms individually, and how they influence the performance of the CASN. In Table 3, we report evaluation results of our proposed model on CUHK03-NP (detected), DukeMTMC-ReID and Market-1501, starting from baseline IDE and PCB architectures and working up to the full CASN model. From Table 3, we can see clear performance improvements over the baseline with individual attention modules. For instance on CUHK03-NP, IDE+IA improves the rank-1 and mAP performance of baseline IDE by 9.0% and 9.2% whereas IDE+SA improves the rank-1 accuracy by 9.4% and 10.2% respectively. This provides evidence for our initial hypothesis that spatial localization, via end-to-end trainable attention mechanisms, should be an important and integral part of the framework design. Furthermore, adding both attention modules improves performance as measured by both rank-1 accuracy and mAP, demonstrating the importance of using both identification and Siamese modules.

Loss type CUHK03-NP DukeMTMC-ReID Market-1501 (SQ)
R-1 mAP R-1 mAP R-1 mAP
IDE [32] 43.8% 38.9% 73.2% 52.8% 85.3% 68.5%
IDE + IA 54.8% 48.1% 83.2% 66.0% 91.0% 76.9%
IDE + SA 55.2% 49.1% 83.5% 66.0% 91.6% 77.7%
CASN(IDE) 57.4% 50.7% 84.5% 67.0% 92.0% 78.0%
PCB [32] 61.3% 54.2% 81.7% 66.1% 92.4% 77.3%
PCB + IA 68.5% 62.4% 87.3% 73.4% 93.9% 81.8%
PCB + SA 69.9% 64.2% 86.8% 73.5% 94.1% 82.6%
CASN(PCB) 71.5% 64.4% 87.7% 73.7% 94.4% 82.8%
Table 3: Ablation study on CUHK03-NP (detected), DukeMTMC-ReID and Market-1501 (Single-Query). IA: Identification Attention, SA: Siamese Attention.

Comparisons of the attention maps acquired from the models trained with BCE loss and BCE loss with Siamese Attention loss are shown in Figure 8(a-b). Clearly, with the proposed Siamese attention mechanism, we obtain more consistent attention maps of the same person image pair in Figure 8(b) compared to Figure 8(a). Furthermore, we also demonstrate these attention maps for the testing image pairs in Figure 8(c), where we again see attention consistency among the query and retrieved gallery images. These examples demonstrate the effectiveness of our proposed Siamese attention mechanism, and also provide a powerful interpretability tool. With such attention maps, we can now explain why our Siamese network predicts a certain input image pair to be similar or dissimilar, leading to intuitive explanations for person re-id.

5 Conclusions

We proposed the first learning architecture that integrates attention consistency modeling and Siamese representation learning in a joint learning framework, called the Consistent Attentive Siamese Network (CASN), for person re-id. Our framework provides for principled supervisory signals guiding our model towards discovering consistent attentive regions for same-identity images while also learning identity-aware invariant representations for cross-view matching. We conduct extensive evaluations on three popular person re-id datasets and achieve new state-of-the-art results. While we show results on re-id, a natural extension of our work would be to study and evaluate CASN in the context of more similarity learning tasks such as generic image retrieval.


This material is based upon work supported by the U.S. Department of Homeland Security under Award Number 2013-ST-061-ED0001. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security.


  • [1] Pytorch.
  • [2] X. Chang, T. M. Hospedales, and T. Xiang. Multi-level factorisation net for person re-identification. In CVPR, 2018.
  • [3] D. Chen, H. Li, X. Liu, Y. Shen, Z. Yuan, and X. Wang. Improving deep visual representation for person re-identification by global and local image-language association. In ECCV, 2018.
  • [4] D. Chen, D. Xu, H. Li, N. Sebe, and X. Wang. Group consistent similarity learning via deep crf for person re-identification. In CVPR, 2018.
  • [5] W. Chen, X. Chen, J. Zhang, and K. Huang. Beyond triplet loss: A deep quadruplet network for person re-identification. In CVPR, 2017.
  • [6] Y. Chen, X. Zhu, and S. Gong. Person re-identification by deep learning multi-scale representations. In ICCVW, 2017.
  • [7] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng.

    Person re-identification by multi-channel parts-based cnn with improved triplet loss function.

    In CVPR, 2016.
  • [8] R. G. Cinbis, J. Verbeek, and C. Schmid. Weakly supervised object localization with multi-fold multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, Jan 2017.
  • [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [11] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [12] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, W. J. Dally, and K. Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and 0.5mb model size. arXiv:1602.07360, 2016.
  • [13] M. Jaderberg, K. Simonyan, A. Zisserman, and k. kavukcuoglu. Spatial transformer networks. In Advances in Neural Information Processing Systems 28. 2015.
  • [14] S. Karanam, M. Gou, Z. Wu, A. Rates-Borras, O. Camps, and R. J. Radke. A Systematic Evaluation and Benchmark for Person Re-Identification: Features, Metrics, and Datasets. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
  • [15] D. Li, X. Chen, Z. Zhang, and K. Huang. Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, 2017.
  • [16] K. Li, Z. Wu, K. Peng, J. Ernst, and Y. Fu. Tell me where to look: Guided attention inference network. In CVPR, 2018.
  • [17] S. Li, S. Bak, P. Carr, and X. Wang. Diversity regularized spatiotemporal attention for video-based person re-identification. In CVPR, 2018.
  • [18] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, 2014.
  • [19] W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification. In CVPR, 2018.
  • [20] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, 2015.
  • [21] H. Liu, J. Feng, M. Qi, J. Jiang, and S. Yan. End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing, July 2017.
  • [22] X. Liu, H. Zhao, M. Tian, L. Sheng, J. Shao, J. Yan, and X. Wang.

    Hydraplus-net: Attentive deep features for pedestrian analysis.

    In ICCV, 2017.
  • [23] M. Oquab, L. Bottou, I. Laptev, and J. Sivic.

    Is object localization for free? - weakly-supervised learning with convolutional neural networks.

    In CVPR, June 2015.
  • [24] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, 2016.
  • [25] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [26] J. Si, H. Zhang, C.-G. Li, J. Kuen, X. Kong, A. C. Kot, and G. Wang. Dual attention matching network for context-aware feature sequence based person re-identification. In CVPR, 2018.
  • [27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [28] C. Song, Y. Huang, W. Ouyang, and L. Wang. Mask-guided contrastive attention model for person re-identification. In CVPR, pages 1179–1188, 2018.
  • [29] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian. Pose-driven deep convolutional model for person re-identification. In ICCV, 2017.
  • [30] Y. Suh, J. Wang, S. Tang, T. Mei, and K. M. Lee. Part-aligned bilinear representations for person re-identification. In ECCV, 2018.
  • [31] Y. Sun, L. Zheng, W. Deng, and S. Wang. Svdnet for pedestrian retrieval. In ICCV, 2017.
  • [32] Y. Sun, L. Zheng, Y. Yang, Q. Tian, and S. Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, 2018.
  • [33] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou. Learning Discriminative Features with Multiple Granularities for Person Re-Identification. ArXiv e-prints, 2018.
  • [34] Y. Wang, L. Wang, Y. You, X. Zou, V. Chen, S. Li, G. Huang, B. Hariharan, and K. Q. Weinberger. Resource aware person re-identification across multiple resolutions. In CVPR, 2018.
  • [35] L. Wu, Y. Wang, J. Gao, and X. Li. Where-and-when to look: Deep siamese attention networks for video-based person re-identification. IEEE Transactions on Multimedia, 2018.
  • [36] T. Xiao, H. Li, W. Ouyang, and X. Wang. Learning deep feature representations with domain guided dropout for person re-identification. In CVPR, 2016.
  • [37] J. Xu, R. Zhao, F. Zhu, H. Wang, and W. Ouyang. Attention-aware compositional network for person re-identification. In CVPR, 2018.
  • [38] H. L. Yantao Shen, S. Yi, D. Chen, and X. Wang. Person re-identification with deep similarity-guided graph neural network. In ECCV, 2018.
  • [39] L. Zhao, X. Li, J. Wang, and Y. Zhuang. Deeply-learned part-aligned representations for person re-identification. In ICCV, 2017.
  • [40] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, 2015.
  • [41] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In ICCV, 2015.
  • [42] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. ArXiv e-prints, 2016.
  • [43] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV, 2017.
  • [44] Z. Zheng, L. Zheng, and Y. Yang. Pedestrian alignment network for large-scale person re-identification. IEEE Transactions on Circuits and Systems for Video Technology, 2018.
  • [45] Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, 2017.
  • [46] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for discriminative localization. In CVPR, 2016.