Cross-Correlated Attention Networks for Person Re-Identification

06/17/2020 ∙ by Jieming Zhou, et al. ∙ CSIRO Monash University Australian National University 0

Deep neural networks need to make robust inference in the presence of occlusion, background clutter, pose and viewpoint variations – to name a few – when the task of person re-identification is considered. Attention mechanisms have recently proven to be successful in handling the aforementioned challenges to some degree. However previous designs fail to capture inherent inter-dependencies between the attended features; leading to restricted interactions between the attention blocks. In this paper, we propose a new attention module called Cross-Correlated Attention (CCA); which aims to overcome such limitations by maximizing the information gain between different attended regions. Moreover, we also propose a novel deep network that makes use of different attention mechanisms to learn robust and discriminative representations of person images. The resulting model is called the Cross-Correlated Attention Network (CCAN). Extensive experiments demonstrate that the CCAN comfortably outperforms current state-of-the-art algorithms by a tangible margin.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we propose a Cross-Correlated Attention Network (CCAN) to jointly learn a holistic attention selection mechanism along with discriminative feature representations for person Re-IDentification (Re-ID). To this end, we make use of complementary attentional information along a global and a local branch (or feature extractor), in order to localize and focus on the discriminative regions of the input image.

Person Re-ID refers to the task of judging whether two images, depicting people, belong to the same individual or not. In general, the two images are obtained from two distinct cameras without any overlapping views. More specifically, given a query image containing the person of interest (or probe), Re-ID aims to find all the images that contain the same identity (id) , as that of the query image, from a large gallery set zheng2016person.

Any robust Re-ID algorithm is required to address the following challenges: (1) viewpoint variations in visual appearance and environmental conditions due to different non-overlapping camera views, (2) significant pose changes for the same probe across time, space and camera views, (3) background clutter and occlusions, (4) different individuals may have similar appearance across different cameras or vice versa, (5) low resolution of the images limiting the use of face based biometric systems decann2015modelling. All these factors lead to significant visual deformations across the multiple camera views for the same person of interest.

In order to overcome these challenges, most of the early works focused on (1) designing discriminative hand-engineered feature representations which are invariant to lighting, pose and viewpoint changes, and occlusion or clutter zheng2016person; farenzena2010person; (2) learning a robust distance metric

for similarity measurement such that the embedded feature vectors belonging to the same class are closer to each other compared to the ones from different classes 

chen2016similarity; koestinger2012large.

With the success of Deep Learning (DL) algorithms 


across a large number of tasks in computer vision, recent deep Re-ID algorithms combine both the aforementioned aspects together into a unified end-to-end framework. While some deep algorithms address Re-ID by developing distinct global feature extraction units 

AhmedEjaz2015CVPRAnimprovedDeepLearningArchitectureforReID; li2014deepreid, others use a hybrid model which holistically combines the global and local features for an improved performance TianMaoqing2018CVPREliminatingBackgroundBiasforReID; LiDeepJoint. Body-part detectors have been pre-dominantly used to extract local features that are distinct, discriminative and compatible with global features LiDangwei2017CVPRLearningDeepContextAwarefeaturesforReID; LiWei2018CVPRHarmoniousAttentionNetforReID

. Similarly, pose estimation, correction and normalization networks 

su2017pose; ZhengLiang2017arXivPoseInvariantEmbeddingforReID; Sarfraz_2018_CVPR have also shown great potential with, or without, part detectors in handling misalignment and viewpoint variations prevalent in the Re-ID datasets. The use of such special purpose auxiliary information tend to improve upon the methods it is applied to.

Attention based person Re-ID models have also been showing promising results as of late. Attention, as the name suggests, is comprised of two basic conceptual functionalities: “where to look” and “how carefully to look”. Hard-attention often uses a window produced by, e.g

, a Spatial Transformer Network (STN) 

JaderbergMax2015NIPSSTN that models the former with a binary mask over the input features, whereas soft-attention simulates the latter by importance weighting of the input features xu2015show.

Both these attention based learning approaches have been successfully integrated when addressing the person Re-ID task LiDangwei2017CVPRLearningDeepContextAwarefeaturesforReID; LiWei2018CVPRHarmoniousAttentionNetforReID. However, these models do not capture spatial inter-dependencies (i.e, self-attention

) within the input features, thereby failing to recognize and perceive spatially distant, yet visually similar regions. They also do not capture (or improve) any inter- (or cross-correlated) dependencies between the separately attended regions, thus failing to boost the overall Signal-to-Noise Ratio (SNR) in the learnt feature maps. Moreover, convolutional based soft-attention blocks are not able to capture the inherent contextual information that exist in the input features.

To address the aforementioned drawbacks, we design the CCAN, a novel yet intuitive Cross-Correlated Attention based deep network. CCAN consists of a novel attention module which aims to exploit and explore the correlation between different regions at various levels of a deep model. It also benefits from a top-down interaction scheme between the global and local feature extractors through the different attention modules to automatically focus and extract distinct regions in the input image for enhanced feature representation learning.

The major contributions of our work are as follows:

  • A novel Cross-Correlated Attention (CCA) module to model the inherent spatial relations between different attended regions within the deep architecture.

  • A novel deep architecture for joint end-to-end cross correlated attention and representational learning.

  • State-of-the-art results in terms of mAP and Rank-1 accuracies across several challenging datasets such as Market-1501 zheng2015scalable and DukeMTMC-reID ristani2016MTMC, CUHK03 li2014deepreid and MSMT17 wei2018person.

2 Related Work

Much of the earlier work in person Re-ID was focused on hand-engineered feature representations liao2015person; li2016richly; wang2016highly; zhong2017re; zheng2016person or learning a robust metric zheng2013reidentification; koestinger2012large; xiong2014person to overcome the associated challenges. Recent studies employ Deep Neural Networks (DNNs) for joint learning of the discriminative features and similarity measures in end-to-end frameworks AhmedEjaz2015CVPRAnimprovedDeepLearningArchitectureforReID; ChengDe2016CVPRPersonReIDbyMultiChannelPartswithImprovedTriplet. Since we are chiefly interested in attention methods for person Re-ID in this paper, we will not cover part/pose-based solutions here and refer interested readers to su2017pose; ZhengLiang2017arXivPoseInvariantEmbeddingforReID; sun2018beyond.

To address the viewpoint/pose variations and misalignment issues commonly present in a Re-ID system, a profound idea is to benefit from the use of attention techniques in DNNs zhao2017deeply; LiWei2018CVPRHarmoniousAttentionNetforReID; LiuXihui2017ICCVHydraPlusNetforReID; LiuHao2017arXivEndtoEndComparativeAttentionNetworksforReID; LiDangwei2017CVPRLearningDeepContextAwarefeaturesforReID; Mancs; AANet_A; Fang_2019_ICCV. Li et al. LiDangwei2017CVPRLearningDeepContextAwarefeaturesforReID used a Spatial Transformer Network (STN) JaderbergMax2015NIPSSTN as a basis for creating a form of hard-attention to search and focus on the discriminative regions in the image, subject to a pre-defined spatial constraint. Zhao et al. zhao2017deeply designed a novel hard-attention module (with components similar to STN) and integrated it into a CNN. This helped to focus on more discriminative regions. Subsequently, by extracting and processing features from the attention regions, improvements to the overall performance were observed. AANet AANet_A proposed a Part Feature Network by cropping body parts according to the location of the peak activation in the feature maps. Arguably, hard-attention modules fail to capture the coherence between image pixels within the attention windows due to their inflexible modelling nature. The Comparative Attention Network (CAN) LiuHao2017arXivEndtoEndComparativeAttentionNetworksforReID employs LSTMs to perform soft-attention at a holistic scale and identify discriminative regions in Re-ID images. Liu et al. LiuXihui2017ICCVHydraPlusNetforReID proposed HydraPlus-Net (HPN) which utilizes soft-attention across multiple scales and levels to learn discriminative representations. Dual ATtention Matching networks (DuATMs) DuATM use spatial bi-directional attentions along sequence matching to learn context-aware feature representations. Wang et al. proposed Mancs Mancs and designed a soft-attentional block and a novel curriculum sampling method to learn focused attention masks. In contrast to the aforementioned algorithms, HA-CNN LiWei2018CVPRHarmoniousAttentionNetforReID uses both hard and soft attention modules to efficiently learn where to look and how carefully to look simultaneously.

Recently, Zhou et al. zhou2019discriminative propose a novel attention regularizer along with a novel triplet loss which consistently learns correlated attention masks from low, mid and higher level feature maps within an interactive loop. DGNet zheng2019joint proposed coupling person re-id learning and image generation in a unified joint learning framework such that the re-id learning stage can benefit from the generated data with an inherent feedback loop to learning a superior embedding space. CAMA yang2019towards enhances learning of traditional global representations for person Re-ID by learning class activation maps to discover discriminative and distinct visual features. CASN zheng2019re designed a new siamese framework in order to learn discriminative attention masks and enforce attention consistency among images of the same person. Likewise, OSNet zhou2019omni designed a new aggregation gate that dynamically fuses features at multiple different scales with channel-wise attentional weights. MHAN chen2019mixed proposed the High-Order Attention (HOA) to integrate complex and higher order statistical information in learning an attention mask so as to capture and distinguish subtle differences between the pedestrian and the background.

In contrast to the aforementioned techniques, CCAN makes use of a novel, yet intuitive, cross-correlated attention module which discovers and exploits inter-correlated spatial dependencies in the learnt feature maps. It then propagates these learnt dependencies along the feature extraction units to inherently learn robust and discriminative features and attention maps; thereby improving the overall information gain in a data-driven fashion.

3 Cross-Correlated Attention Networks

Let be an image, with denoting the image-space, where and indicate its rows, columns and channels, respectively. In person Re-ID, we are provided with pairs of the form with representing the identity of the person depicted in . The aim, here, is to learn a generic non-linear mapping from the image space onto a latent feature space such that, in , embeddings coming from the same identity are closer to each other than those of different identities. We achieve this by exploiting the complementary nature of global and local information in Re-ID images using a combination of two different, and complementary, learnable attention modules. We first provide a detailed overview of the attention modules (§3.1); followed by the overall structure of CCAN (§3.2).

3.1 Attention Layers

In CCAN, we introduce a variation of self-attention named Cross-Correlated Attention. The Cross-Correlated Attention mechanism aims to capture, exploit and boost spatial inter-dependencies (or cross-correlation) between different selected regions.

The Cross-Correlated Attention (CC-Attention or CCA) module which aims to model the cross-correlation (or inter-dependencies) between different feature maps as a means to construct the attention mask. Each CCA module accepts two inputs and calculates the attention as a weighted combination of the input feature maps (see Fig. 1 for a conceptual diagram). This, as will be shown empirically, captures the inter-dependencies between the spatial regions in various feature maps with only a small computational overhead.

Figure 1: The architecture of Cross-Correlated Attention (CC-Attention) used in our model (blue blocks in Fig. 3). CC-Attention is able to find correlated spatial locations in its two different input feature maps, which are further processed by the subsequent processing layers for discriminative feature learning.

The CCA block works with the so-called positional matrices . In our application, the positional matrices are constructed from two feature maps via reshaping through spacial dimension, i.e . The matrices and are then transformed into two feature spaces using independent non-linear mappings g and f, respectively. The non-linear mappings are realized through and , where , where the non-linearity acts element-wise on f and g. In our experiments, we choose . These two spaces are then used to calculate a primary attention map between the inputs at the different spatial locations as follows:


where , denotes the concatenation operation along the width. Furthermore, is a linear layer with weight . is a measure of spatial dependencies between the and the spatial locations of the positional matrices and respectively; thereby realizing a measure of cross-correlation between them. The symmetric operation described above guides the CCA module to focus on the correlated positions in both the and , which is processed by the subsequent layers of the network. The resultant map is then used to generate for input as follows:


where is Hadamard (element-wise) product , is a weighted combination of the responses at all positions denoted by , and h is also a non-linear layer with its weight such that . We further pass through a linear layer w to obtain the final output of the CC-Attention module as follows


with and , and such that . The output is reshaped to to match that of input . In all our experiments, we have fixed the value of to be .

An intuitive way of thinking about the CCA module is to see g and f as non-linear signatures of elements and . The cross-correlation between the non-linear signatures acts as a gate and controls the information flow based on inter-correlation for generating the mask. The information, here, is encoded through h. The result is further pruned by w and generates the attention map in an additive form. The additive form resembles the residual computing which is proven to be beneficial in training deep architectures.

Remark 1

In the CCA module, we have introduced a symmetric cross-correlation operation between its input feature maps and to generate the attention map (see Eqn. 1). It thereby encapsulates symmetrical inter-dependencies between its inputs. The standard cross-correlation operation does not take into account such symmetric relationships between the inputs. We believe that this subtle change makes CCA attend to highly correlated regions in both of its input feature maps.

Remark 2

When , the overall structure represents a form of Symmetric Self-Attention (SS-Attention or SSA) that aims to model highly correlated regions within itself. This form of symmetric self-attention is applied in the global branch, (i.e, ) which models the intra-dependencies within the input. Further simplification of the SS-Attention module by removing the Concat and block leads to the Non-Local Self-Attention module which is shown in Fig. 2. Thus we equip the traditional self-attention module with these two important changes to model symmetric cross-correlation attention between its two different inputs.

Figure 2: Schematic of the Non-Local Self-Attention module as defined in WangXiaolong2017CVPRNonLocalNN.
Figure 3: Architecture of CCAN. and denote global and local branches. The local branch has sub-branches. The local branches receive part patches from the global branch (i.e and ). Building blocks of the sub-branches are shown by dashed boxes (refer to § 3.2 for more detail). and denote cross-entropy and triplet loss respectively. Green arrows indicate inputs for creating part patches.

3.2 Structure of the CCAN

A CCAN consists of two main branches (i.e, streams or feature extractors), namely the global, , and the local, , branch (see Fig. 3 for an overview of the architecture of CCAN). The purpose of the global branch is to capture and encode the overall appearance of a person, while the local branch encodes part information. The local branch, itself, has sub-branches (or part-streams).

The basic building block of all branches is the Inception block of GoogLeNet szegedy2015going. The global branch makes use of three Inception blocks, along with a self-attention module to encode the global appearance ( marks the beginning of the -th level of processing in CCAN). The Inception blocks in the global stream enable us to analyze the input at various resolutions, thereby realizing a coarse to fine global representation. The local branch, as the name implies, attends to the local and discriminative parts of the input image. The local branch comprises of sub-branches, each intended to extract features belonging to a distinct part in the input image. For the -th sub-branch, we denote its Inception blocks by with and (see Fig. 3 for details). We emphasize that each is an independent module, meaning that weights are not shared across the part-streams.

In order to feed part information into local branches, we slice the feature maps at and (i.e the input and output of ) into

horizontal equal patches independently. Thereafter, all the sliced patches are resized to the size of their corresponding feature maps using bilinear interpolation. Moreover, each of the sub-branches consist of a cross-correlated attention module (

i.e ) . Every calculates the cross-correlation between the sliced part patches of (after having been passed through ) and in each of the sub-branches independently. This sharing of feature mapsbetween the attention modules across the global and local branch within CCAN leads to the discovery of highly correlated regions; thereby realizing a simple but effective CCA scheme within CCAN.

The global branch is appended with a global average pooling (GAP) layer and two fully connected ( and ) layers, with the output of the realizing a -dimensional embedding space. Similarly, the outputs of local sub-branches are passed through GAP layers and concatenated to produce a feature vector. This is then passed through to produce the -dimensional embedding vector in the local branch, which is further passed through . It should be noted that the and realize representations suitable for classification (i.e, and ). As such, their output dimensionality is , the number of identities in the training set. We will discuss this in more detail later.

3.3 Loss function

Following the common practice in learning embeddings weinberger2009distance; oh2016deep; hu2014discriminative; song2017deep, we make use of a combination of classification and ranking losses (cross entropy loss with Label-Smoothing Regularization (LSR) Szegedy2016RethinkingTI and the semi-hard triplet loss SchroffFlorian2015CVPRFaceNet; manmatha2017sampling, respectively), to jointly optimize the global and the local branch. The overall loss is defined as follows:


where the subscripts and denote the cross-entropy and triplet loss respectively. Moreover, the superscripts and indicate the global and local branch. We briefly describe the semi-hard triplet mining strategy used in our algorithm for calculating the triplet loss.

Semi-hard Triplet Mining

In each mini-batch of training samples, we mine triplets of the form , with the constraint that are in the same category, while are not. We also use the semi-hard mining strategy SchroffFlorian2015CVPRFaceNet to generate robust triplets for training the network. More specifically, given the anchor and its positive example , we obtain the top semi-hard negative triplets as follows

where . is set to for all the datasets. Moreover, to avoid any degeneracy, we randomly pick different identities and sample random images from each of the selected identities to create the mini-batch. These triplets are then used to compute the triplet embedding loss:


where is the hinge loss, and is a user-specified margin.

3.4 Person Re-ID by CCAN

Given a trained CCAN model and an input image ; we first obtain its dimensional global feature and dimensional local feature . We perform L2 normalization on each of them separately, and then proceed to concatenate them to obtain the joint feature vector . Thus, given a probe image from one camera view and all the gallery images from the other camera views, we obtain and and compute the between-camera matching distances using the Euclidean distance. We then rank all in ascending order based on their distances given and use that to evaluate the identity of .

4 Experiments

Datasets and Evaluation Protocol In this section, we show the effectiveness of our proposed algorithm through an extensive set of experiments across three well known person Re-ID datasets; (a) Market-1501 zheng2015scalable, (b) DukeMTMC-reID (or DukeMTMC) ristani2016MTMC, (c) CUHK03 li2014deepreid and (d) MSMT wei2018person. Market-1501 has train/test identity split, and images in total. DukeMTMC-reID has train/test identity split, and images in total. CUHK03 has images in total. In order to make the re-identification task more challenging on CUHK03, we use the train/test identity split Re_rankingReID instead of the standard split. The train/test id split and the test protocol are shown in Table 1. The MSMT17 wei2018person dataset consists of person images from identities, thus constituting the largest person Re-ID dataset at present. All person images are detected using a Faster R-CNN girshickICCV15fastrcnn. This dataset is collected using different cameras; and the images were captured over different days experiencing different weather conditions during a month. The training set consists of images belonging to identities, whereas the test set contains images belonging to the remaining identities. The test set is further randomly divided into and images for query and gallery sets respectively. Both mean Average Precision (mAP) and Cumulative Matching Characteristic (CMC) metrics are used for measuring performance on these datasets.

Dataset Images IDs Train Test TS
Market1501 32,668 1501 751 750 SQ/MQ



408 dis
702 702 SQ
CUHK03 14,097 1467 767 700 SS
MSMT17 126,441 4,101 1,041 3,060 SQ
Table 1: The details of evaluated datasets. dis refers to the distractor images of the DukeMTMC-reID dataset. TS, SQ, MQ and SS stand for Test Setting, Single Query, Multiple Query and Single Shot, respectively.
Method SVDNet SunYifan2017ICCVSVDNet MHAN chen2019mixed Dare Wang_2018_CVPR AOS huang2018adversarially MLFN MLFN SGGNN Shen_2018_ECCV
mAP 62.1 85.0 69.9 70.4 74.3 82.8
R1 82.3 95.1 86.0 86.5 90.0 92.3
Method IANet IANet PCB sun2018beyond MSCAN LiDangwei2017CVPRLearningDeepContextAwarefeaturesforReID JLML LiDeepJoint PBR Suh_2018_ECCV MGCAM SongChunfeng2018CVPRMaskGuidedContrastiveAttentionModelforReID
mAP 83.1 81.6 57.5 65.5 76.0 74.3
R1 94.4 93.1 80.3 85.1 90.2 83.8
Method AANet AANet_A HPN LiuXihui2017ICCVHydraPlusNetforReID DKPM Shen_2018_CVPR DuATM DuATM Mancs Mancs HA-CNN LiWei2018CVPRHarmoniousAttentionNetforReID
mAP 83.4 - 75.3 76.6 82.3 75.7
R1 93.9 76.9 90.1 91.4 93.1 91.2
Method CASN zheng2019re CAR zhou2019discriminative OSNet zhou2019omni DGNet zheng2019joint CAMA yang2019towards CCAN (Ours)
mAP 82.8 84.7 84.9 86.0 84.5 87.0
R1 94.4 96.1 94.8 94.8 94.7 94.6
Table 2: Comparison results on Market-1501 zheng2015scalable dataset.
Method SVDNet SunYifan2017ICCVSVDNet IDE zheng2016person Dare Wang_2018_CVPR AOS huang2018adversarially MLFN MLFN SGGNN Shen_2018_ECCV
mAP 56.8 64.2 56.3 62.1 62.8 68.2
R1 76.7 80.1 74.5 79.2 81.0 81.1
Method IANet IANet PCB sun2018beyond MSCAN LiDangwei2017CVPRLearningDeepContextAwarefeaturesforReID JLML LiDeepJoint PBR Suh_2018_ECCV MGCAM SongChunfeng2018CVPRMaskGuidedContrastiveAttentionModelforReID
mAP 73.4 69.7 - 56.4 64.2 -
R1 87.1 83.9 - 73.3 82.1 -
Method AANet AANet_A HPN LiuXihui2017ICCVHydraPlusNetforReID DKPM Shen_2018_CVPR DuATM DuATM Mancs Mancs HA-CNN LiWei2018CVPRHarmoniousAttentionNetforReID
mAP 74.3 - 63.2 64.6 71.8 63.8
R1 87.7 - 80.3 81.8 84.9 80.5
Method CASN zheng2019re CAR zhou2019discriminative OSNet zhou2019omni DGNet zheng2019joint CAMA yang2019towards CCAN (Ours)
mAP 73.7 73.1 73.5 74.8 72.9 76.8
R1 87.7 86.3 88.6 86.6 85.8 87.2
Table 3: Comparison results on DukeMTMC ristani2016MTMC dataset.

4.1 Implementation

Our CCAN model is implemented in PyTorch

paszke2017automatic. We use GoogLeNet-V1 szegedy2015going

with Batch Normalization 


pretrained on Imagenet 

russakovsky2015imagenet as our backbone architecture. The dimensionality of the output feature maps of the global branch (i.e, , and ) is fixed to , , and respectively. Similarly, in the local branch, the dimensionality of the output feature maps of and is set to , and for every respectively. The embedding dimension and the number of local parts (i.e ) are set to and across all the four datasets. None of the Inception and FC layers share weights between each other. The ADAM optimizer kingma2014adam

is used to train the model, with the two moment terms (

), and the weight decay set to (, ) and , respectively. The learning rate is initially set to for Market-1501 and DukeMTMC-reID; and for CUHK03 in both the labeled and detected settings; which is fixed for the first epochs and decayed by a factor of after every epochs thereafter. The batch size is set to of identities with images per identity in all the datasets. The smoothing parameter of LSR is . The margin for the triplet loss (Refer to Eqn. 5) is set to for Market-1501 and DukeMTMC-reID, and for CUHK03 in both the dataset settings. The training images are first resized to and then randomly cropped to , followed by a random horizontal flip. Following the protocol of Mancs, we apply random erasing zhong2017random after the epoch. However, during the test phase, the images are resized to without any such data-augmentation techniques. We report the results after epochs of training.

4.2 Comparison to State-of-the-Art Methods333We report our results in bold, while we use red to report the best previous results obtained so far.

Evaluation on Market-1501

We have evaluated against a number of recently proposed methods with, or without, the use of attention modules. Table 2 clearly shows the superior performance of CCAN against all the other methods in terms of mAP and Rank-1 accuracies on the Market-1501 dataset. More specifically, CCAN improves over the current state-of-the-art AANet by a prominent margin in the single query setting. We also outperform hard and soft attention based HA-CNN by with respect to mAP and Rank-1 respectively in the single query setting.

Evaluation on DukeMTMC-reID

We further evaluated our proposed CCAN on the DukeMTMC-reID ristani2016MTMC dataset. More variations in resolution and viewpoints due to wider camera views, and more complex environmental layout make DukeMTMC-reID more challenging compared to the Market-1501 dataset for the task of Re-ID. Table 3  shows that CCAN again outperforms almost all the baseline algorithms, except AANet in terms of Rank-1. However, we achieve higher mAP by a significant margin. We also outperform hard and soft attention based HA-CNN by with respect to mAP and Rank-1 respectively.

Labeled Detected
Measure (%) mAP R1 mAP R1
MLFN MLFN 49.2 54.7 47.8 52.8
IDE zheng2016person 48.5 52.9 46.3 50.4
AOS huang2018adversarially - - 47.1 43.4
Dare (De) Wang_2018_CVPR 52.2 56.4 50.1 54.3
PCB sun2018beyond 56.8 61.9 54.4 60.6
SVDNet SunYifan2017ICCVSVDNet - - 37.3 41.5
MGCAMSongChunfeng2018CVPRMaskGuidedContrastiveAttentionModelforReID 50.2 50.1 46.9 46.7
MancsMancs 63.9 69.0 60.5 65.5
HA-CNNLiWei2018CVPRHarmoniousAttentionNetforReID 41.0 44.4 38.6 41.7
CAMA yang2019towards - - 64.2 66.6
OSNet zhou2019omni - - 67.8 72.3
CASN zheng2019re 68.0 73.7 64.4 71.5
CCAN (Ours) 72.9 75.2 70.7 73.0
Table 4: Comparison results on CUHK03 dataset in both the Labeled and the Detected settings.

Evaluation on CUHK03

We have also evaluated CCAN on both the manually labeled and detected person bounding boxes versions of CUHK03. The split results in a small training set with only images against training images in Market-1501/DukeMTMC-reID datasets respectively. Even with such a constrained training setting, Table 4 clearly shows that notable improvement for CCAN against the baseline methods, including the current state-of-the-art Mancs, in both the labeled and detected settings. Furthermore, we also outperform HA-CNN by and in terms of mAP and Rank-1 in both the settings respectively.

Evaluation on MSMT17

Table 5 shows the result of our proposed CCAN when trained and evaluated on the new challenging MSMT17 wei2018person dataset. As can be seen, CCAN achieves a significant performance gain with regards to mAP and Rank-1 over all the baseline algorithms. Specifically, CCAN outperforms the current state-of-the-art algorithm on MSMT, i.e. Glad glad, by in terms of mAP and Rank-1 respectively.

Model mAP R-1 R-5 R-10
GLNet szegedy2015going 23.0 47.6 65.0 71.8
PDC SuChi2017ICCVPoseDrivenDeepModelforReID 29.7 58.0 73.6 79.4
Glad glad 34.0 61.4 76.8 81.6
PCB sun2018beyond 40.4 68.2 - -
OSNet zhou2019omni 52.9 78.7 - -
IANet IANet 46.8 75.5 - -
DGNet zheng2019joint 52.3 77.2 - -
CCAN (Ours) 53.6 76.3 86.9 90.2
Table 5: Comparison results on MSMT17 dataset.

These results, on all the four challenging datasets mentioned above, clearly demonstrate and validate our proposed approach of cross-correlation based joint attention and discriminative feature learning for person Re-ID. CCAN outperforms all the current methods that rely only on hard, soft, or a combination of these two types of attention.

5 Ablation Study

In this section, we undertake a detailed study of the various aspects of our proposed CCAN framework.

Figure 4: Ablation study of the (a) dimensionality of the embedding space (i.e. ) and (b) number of body parts (i.e. ). Both the experiments were conducted on Market-1501 in the Single Query setting.

5.1 Dimensionality of the embedding space.

We first evaluate CCAN for different values of on the Market-1501 zheng2015scalable dataset. As observed in Fig. 4, both mAP and R1 continue to increase as is increased from to , with the highest values obtained when is set to . Based on this experimental study, we decided to choose as the embedding dimension for all the experiments. It is to be noted that even with a smaller (such as ), we still outperform all baseline algorithms (Refer to Table LABEL:tab:market_duke). This clearly shows that CCAN is able to learn discriminative features and achieve state-of-the-art results for a large range of .

5.2 Number of body parts

We further evaluated the effect of various number of parts, i.e., in CCAN. Fig. 4 provides a detailed overview of the following evaluation for five different values of . It can be seen that CCAN performs the best when is set to , thereby suggesting that CCAN is able to detect and focus on the distinct regions of the input person image; namely (a) head-shoulder, (b) upper-body, (c) thighs, and (d) crus-foot. It should also be noted that even with different parts, CCAN is able to achieve competitive results against several baseline algorithms. This indeed demonstrates that CCAN is successful in exploiting the complementary nature of the learnt CCA attention modules even when lesser number of parts are specified. Based on this, in all the subsequent experiments, we have fixed the dimensionality of the embedding space (i.e. ) to and the number of parts (i.e. ) to .

5.3 Importance of various attention modules

We perform an ablation study in order to study the importance of various attention modules in CCAN. The results, evaluated on Market1501 dataset zheng2015scalable single query setting, are shown in Table 6. The following critical insights are observed : (a) The performance of the global branch (Id = ) and the local branch (Id = ) by itself reads as and mAP respectively. (b) Though combination of and helps (Id = ), incorporating only along (Id = ) leads to almost similar performance. (c) Furthermore, Id= and show the importance of adding a CCA module, i.e , along .(d) Finally CCAN improves over Id= with the addition of a along (Refer to Fig 3). This indeed verifies the joint interactive learning of the attention modules and feature extractors to obtain a discriminative embedding space for the person images. It is to be noted that in all our experiments, we have kept the final structure of CCAN fixed across all the datasets, suggesting a novel and rich architecture for the task of Re-ID that generalises well.

Id 1 2 3 4 5 6
Setting G L G+L G+ G+L+ CCAN
mAP 81.7 79.5 83.6 83.3 85.6 87.0
R1 92.7 92.1 93.3 92.9 94.3 94.6
Table 6: Study of the importance of various attention modules on Market-1501 dataset.

6 Conclusions

In this paper, we propose a new attention module, called Cross-Correlated Attention (CCA), which aims to improve the information gain by learning to focus on the correlated regions of the input image. We incorporate CCA into a novel deep attention architecture that we name Cross-Correlated Attention Network

(CCAN) to achieve state-of-the-art results on three challenging datasets by utilizing the complementary nature of the attention mechanisms. In contrast to most existing attention based Re-ID models that use constrained attention learning algorithms, CCAN is capable of exploring and exploiting correlated interaction among the attention modules to locate and focus on the discriminative regions of the input person image without the need of any part (or pose) based estimator or detector network in a unified end-to-end CNN architecture. In the future, we plan to design and incorporate attention-diversity loss into CCAN to obtain further improvements and better focused attention maps. We also plan to study the effects of augmenting CCAN with additional part/pose estimation or detection networks in the future.