Person re-identification (Re-ID), aims at retrieving a person of interest across multiple non-overlapping cameras deployed at different locations via images matching . It is a long-lasting research topic due to the wide range of applications in intelligent video surveillance [48, 45]. Re-ID is a challenging task because of significant pose-variations, varying illumination conditions, frequent human occlusions, background clutter and different camera views. All of those conditions give rise to the notorious image matching misalignment challenge in cross-view Re-ID, leading to the difficulty of extracting discriminative person features. However, extracting those subtle features that fully characterize the person, at the same time distinguish from other people, is not straightforward due to the existence of small inter-class variations and large intra-class differences. Therefore, how to obtain discriminative person features is crucial for person Re-ID.
Person Re-ID can be considered as a fine-grained classification task, also is a zero-shot setting task, where the test identities are never seen during training. Even in some practical scenarios, Re-ID is still a cross-domain task, where training and testing are under different domains (datasets) with different camera networks; the data distributions between training and testing domains are also apparent different, e.g., Market-1501  contains identities always wearing shorts, while DukeMTMC-reID  consists of pedestrians usually with coats and trousers, as shown in Fig. 1. All of these discrepancies put forward higher demands for Re-ID model to adaptively extract the discriminative and robust person features.
With the prosperity of deep learning, deep convolutional neural network (CNN) has dominated this community for extracting representations of person images with better discrimination and robustness, and significantly leverages the performance of person Re-ID tasks. The main superiority of CNN is that CNN can optimize the network parameters by arranging the visual feature extraction, metric learning and classification in an end-to-end learning manner. From the perspective of feature extraction, those deep learning based methods for person Re-ID can be divided into two main categories.
The first category is the global feature learning, an intuitive method to extract person features by deep convolutional neural networks from the whole body on images [60, 50, 35, 1, 44, 3, 51]. The global feature learning aims to capture the most salient clues of appearance to describe identities and distinguish them from other people. Most of the state-of-the-art CNN based person Re-ID methods are adopting the pre-trained CNN models (e.g., ResNet 
model) on ImageNet and fine-tuning them on the person Re-ID datasets under the supervision of different losses (e.g., softmax and triplet loss). The second category is the partial feature learning. The latest state-of-the-art on Re-ID benchmarks [18, 32, 34, 52, 36, 41, 9, 25, 23, 61, 40] are almost all achieved with deep-learned part features, which confirms that locating significant body parts from the whole images to represent the local features of person is an effective approach for boosting the performance of Re-ID. The deep-learned part features also can be adopted as an important complement for global features.
In the single-domain setting, training and testing on single dataset, person Re-ID has achieved significant performance. However, in the cross-domain setting, training on one dataset while testing on another dataset, those Re-ID methods directly exploiting a pre-trained model to new domains (datasets) were always along with huge performance drop. More and more researchers are leveraging the auxiliary information of those unlabeled target-domain data to improve the cross-domain person Re-ID [42, 5, 46, 27, 15]
. However, using auxiliary information of those unlabeled target-domain data always faces more tasks and high complexities, such as, pose estimation and image translation [5, 46]. Therefore, we rise to the challenge focusing on performing the cross-domain person Re-ID by directly exploiting a pre-trained model to new domains, without any information of the target domain.
In this case, the most challenge point is how to improve the generalization and adaptation of the pre-trained model. Namely, how to make sure that the model pre-trained on one dataset still can extract the discriminative features on another dataset. We argue that those models essentially with the ability of adaptively extracting the discriminative features on different datasets are with high generalization and adaptation. Therefore, to address this problem, we propose to introduce the attention mechanism for discriminative feature learning to enhance the model generalization and adaptation in cross-domain person Re-ID.
It is well known that the attention mechanism plays an important role in human visual perception system [16, 30, 4]. Human visual perception does not process the whole image at once, but exploits a sequence of partial glimpses and selectively focuses on salient parts to well capture the visual information . It is one kind of partial feature learning approach. Attention mechanism dynamically focuses on salient parts according to the detailed content of each image, which can make a high-level information integration to emphasise the salient aspects. This characteristic can contribute for enhancing the model generalization and adaptation for cross-domain person Re-ID.
In this paper, we address the model generalization and adaptation under single-dataset training setting for cross-domain person Re-ID, by focusing on adaptively extracting the discriminative person features. Attention mechanism is introduced for discriminative feature learning to perform the cross-domain Re-ID task. We adopt two popular type of attention mechanisms, long-range dependency based attention and direct generation based attention. Both of them can perform the attention via spatial or channel dimensions alone, even the combination of spatial and channel dimensions. We also illustrate the structures of different attentions. Based on a strong baseline, the attention modules are incorporated to improve the performance, especially the cross-domain person Re-ID by directly exploiting a pre-trained CNN model to new domains. In summary, our paper has the following main contributions.
We implement a strong baseline, based on ResNet50 model, with some empirically architecture modifications and training strategies.
We introduce two type of attention modules, type I: long-range dependency based attention and type II: direct generation based attention. Both of them leverage the attention mechanism from perspective of spatial and channel dimensions. The possible arrangements to combine the spatial and channel attention modules are explored.
By a simple way to incorporate attention, we surprisingly find the great effectiveness of attention for enhancing the model generalization and adaptation.
By directly exploiting a pre-trained CNN model to new domains, the effectiveness of attention for enhancing the model generalization and adaptation, is demonstrated through the excellent performances on three Re-ID datasets, especially those cross-domain Re-ID experiments.
Ii Related works
In this section, we will briefly review some related works for person Re-ID from the following aspects, discriminative feature learning, attention and cross-domain.
Discriminative feature learning: Recently the performance of deep person Re-ID has been pushed to a new level. Among all of these strategies, focusing on the local features from parts of person images may be the most effective one to handle the misalignment of person images. To accurately locate the body parts with semantics, pose estimation methods [2, 49] are adopted to predict the body landmarks, and then based on which the part features are learned [33, 57, 18, 32, 34, 52]. Compared to the extra task for pose estimation, recently more works are directly and uniformly part the feature maps to learn the body features [55, 36, 41, 9, 25]. Part-based convolutional baseline (PCB)  adopts the uniform partition strategy with identity supervision on every local part, and the refined part pooling method to enhance the part representations. Wang et.al  carefully designed a multi-branch deep network architecture, multiple granularity network (MGN), consisting of one branch for global feature representations and two branches for local feature representations, to learn features with various granularities. Moreover, horizontal pyramid matching (HPM)  was proposed to use partial feature representations at different horizontal pyramid scales, to enhance the discriminative capabilities of various person parts. CANet  was proposed to learn the appearance features from both horizontal and vertical body parts of pedestrians with spatial dependencies among body parts, simultaneously exploiting the semantic attributes of person.
Attention: Alternatively, more and more researchers are adopting the attention mechanism to focus on the desired body parts. Liu et.al 
proposed a soft attention based model, comparative attention network (CAN), to selectively focus on parts of pairs of person images with long short-term memory (LSTM) method. Li et.al 
designed a multi-scale context-aware network (MSCAN) to learn powerful features over full body and body parts, where the body parts are learned by the hard attention using the spatial transformer networks (STN) with spatial constraints. Zhao et.al 
realised the attention models through a deep convolutional networks to compute the representations over part regions. Liu et.al proposed a new attention-based model, hydraPlus-net (HP-net), which multi-directionally feeds the multi-level attention maps to different feature layers. Hp-net is capable of capturing multiple attentions from low-level to semantic-level, and explores the multi-scale selectiveness of attentive features to enrich the final feature representations for a pedestrian image. Li et.al  formulated a novel harmonious attention CNN (HA-CNN) model for joint learning of soft pixel attention and hard regional attention along with simultaneous optimisation of feature representations, dedicated to optimise person Re-ID in uncontrolled (misaligned) images. Zheng et.al  proposed the consistent attentive siamese network, providing mechanisms to make attention and attention consistency end-to-end trainable in a siamese learning architecture. It is an effective technique for robust cross-view matching as well as can explain the reason why the model predicts the two images to belong to the same person. Wang et.al  proposed a multi-task attentional network with curriculum sampling (MANCS) method for Re-ID from the following aspects: fully utilizing the attention mechanism for the person misalignment problem and properly sampling for the ranking loss to obtain more stable person representation.
Different from those existing attention-based Re-ID methods, we introduce two type of attentions to adaptively extract discriminative person features for enhancing the model generalization and adaptation for cross-domain person Re-ID. We also systematically explore the possible arrangements of the two type of attentions, the combination of spatial and channel attentions, and hierarchical attention.
Cross-domain: As to the cross-domain property of person Re-ID, it refers to high requirements of model generalization and adaptation. Recently, more and more researches are concentrating on incorporating the properties of target-domain into the model training with the source-domain data. Wang et.al  introduced a transferable joint attribute-identity deep learning (TJ-AIDL) model for simultaneously learning an attribute-semantic and identity discriminative feature representation in a multi-task learning framework, which can be transferrable to any unseen target domain. With the big success of CycleGAN  in image translation, similarity preserving generative adversarial network (SPGAN)  and person transfer generative adversarial network (PTGAN) , were proposed to transfer the source-domain images into the target-domain style, then train those translated images with their corresponding labels in the source domain. Lv et.al 
proposed an unsupervised incremental learning algorithm, TFusion, to use the abundant unlabeled data in the target domain by adding the transfer learning of the pedestrians’ spatio-temporal patterns in the target domain. Huang et.al proposed the enhancing alignment networks (EANet) to address the cross-domain Re-ID task. EANet constits of two new modules: part aligned pooling (PAP) and part segmentation (PS), both of which are based on the body keypoints predicted by the pose estimation model.
All of the aforementioned cross-domain methods would leverage those unlabeled target-domain data. We are here trying to directly exploit our model training only on the source-domain data to perform the cross-domain Re-ID task, through a simple way to incorporate the attention modules.
Iii The framework of attention-based discriminative feature learning
The goal of this paper is to perform the cross-domain person Re-ID under the single dataset training setting. The most important point is to enhance the generalization and adaptation of model trained only with the source domain data. We argue that those models essentially with the ability of adaptively extracting the discriminative features on different datasets are with high generalization and adaptation. Therefore, how to adaptively extract the discriminative features on different datasets is the key point to focus on. To address this problem, we propose to introduce the attention mechanism for discriminative feature learning. In summary, we incorporate attention mechanism to enhance the model generalization and adaptation for cross-domain person Re-ID.
In this section, we will introduce our simple framework of attention-based discriminative feature learning (ADFL). Based on the popular attention modules (Sec. IV), we try to use a simple way to incorporate them into the well designed deep convolutional neural networks, such as ResNet50, in order to boost the cross-domain person Re-ID performance.
Fig. 2 depicts the detailed structure of our proposed ADFL model, which mainly consists of three components: the backbone, the attention modules incorporation and the skip connections for additionally incorporating the attention features.
Iii-a Backbone network
ADFL can take any deep neural network designed for image classification as the backbone, e.g., Google Inception  and ResNet . We would take the ResNet50 model as an example to perform the Re-ID task, with the consideration of its competitive performance in some Re-ID systems [36, 40] as well as its relatively concise architecture. ResNet50 model mainly consists of four res-convolution blocks, , , and , as illustrated in Fig. 2. The modifications to the original ResNet50 model are as following.
We employ no down-sampling operations in the block to preserve more areas of reception fields for local (body part) features.
The global max pooling layer substitutes for the global average pooling layer after the.
Then a full connected layer is added as a bottleneck to reduce the dimensions if necessary.
Finally a full connected layer with desired dimensions (corresponding to the number of identities of person in our model) is adopted to perform the classification with softmax loss.
The procedure of backbone network follows the black lines without the attention modules, as illustrated in Fig. 2.
Iii-B Attention modules incorporation
Since our attention modules (Sec. IV) are all in a size-identical map manner, that the output and input are with identical size, we can incorporate our attention modules into ResNet50 model at any position. For simplicity, as illustrated in Fig. 2 we only consider the following 3 simple cases:
Incorporating the attention module after . We term it as .
Incorporating the attention module after . We term it as .
Incorporating the attention module after . We term it as .
Iii-C Additional attention features incorporation
Attention modules can dynamically focuses on salient parts according to the detailed content of each image, which can make a information integration to emphasise the salient aspects. Moreover, those attention outputs are just the mid-level features in terms of the model outputs. To directly take advantage of those attention features, we try to fuse them with those features before the final layer of backbone network to obtain the final person features. Therefore, the final person features are with both high and middle level semantic visual information.
As illustrated in Fig. 2, taking the output of for example, the red lines depicts the procedures for incorporating additional attention features (AF).
The attention features should pass a convolutional block to change the feature map size according to the feature map size of backbone network. We design the convolutional block in a bottleneck manner consisting of two convolutional layers, each of which is followed with a batch normalization layer and a ReLU layer, as illustrated in Fig. 3.
A global max pooling layer is followed, similar to the backbone network.
The attention features and the backbone features are fused to obtain the final person features. For simplicity, we only consider the fusion mechanism to be sum or concatenation, which would be explored in the experiments.
Similar to the backbone network, a batch normalization layer with triplet loss and then a full connected layer with softmax loss are adopted.
Iv Attention modules
We argue that the attention mechanism is inherent with the characteristic of discriminative feature learning. The attention mechanism plays an important role in human visual perception system, where human visual perception does not process the whole image at once, but exploits a sequence of partial glimpses and selectively focuses on salient parts to well capture the visual information.
For the person Re-ID task, as discussed in Sec. I, the partial body features play an important role to distinguish one person from others. The partial body features are intuitively learned through the spatial partition from the human visual perception view, which corresponds to the spatial attention. Spatial attention mainly focuses on “where” are the informative parts given an person image.
What’s more, in the community of deep convolutional neural networks, due to the adoption of fully connected layer and the softmax loss, each channel map of high level feature maps can be regarded as a class-specific response, and different semantic responses are associated with each other . In person Re-ID task, different sematic classes maybe correspond to the salient body parts. It indicates that different person body parts cause different responses corresponding to different channels in the feature map. Therefore, by exploring the interdependencies between channel maps, which corresponds to the channel attention, we could improve the person features with specific semantics for body parts. Channel attention mainly focuses on “what” are the most meaningful parts given an person image.
In this section, we apply two type of attention mechanisms from different aspects to well extract the discriminative person features. One type is to capture the long-range dependency of contextual information, while the other is to directly generate the attention maps. Both of the two type of attention mechanisms take the spatial dimension and channel dimension into account. Finally, different aggregations of the spatial attention and channel attention for further refinement will be briefly described.
Note that the attention mechanisms are well explored in many literatures [39, 43, 58, 26, 23, 61, 40]. Only some subtle modifications are accordingly made. Therefore, we only outline those attention modules in Figs. 4 and 5. The computation details of attention modules can be find in our supplementary materials.
Iv-a Type I: long-range dependency based attention
The long-range dependency is emphasized in transformer  and non-local network , which computes the response at a position as a weighted sum of features at all positions in the input feature maps.
where is the input with positions, is the attention map measuring the position’s impact on the position, is the output with the same size to 111Note that all of the attention modules in this paper are in a size-identical map manner, that the output and input are with identical size..
Here, we exploit the effectiveness of self-attention mechanism from the long-range dependency relationship aspect to represent the person features. Given an intermediate feature map as input, where , , and are batch size, channel, height and width, we design the attention modules for spatial, channel, spatial+channel (hyper) and batch dimensions, respectively. All of the attention modules are shown in Fig. 4. The detailed calculation procedures of these attention modules can be found in supplementary materials.
Iv-B Type II: direct generation based attention
Another type of attention is intuitive to assign different weights for different positions.
where is the input feature map, is the attention weight for the position, and is the attention output.
We design the direct generation based attention for spatial, channel and spatial+channel (hyper), respectively. All of the attention modules are shown in Fig. 5. The detailed calculation procedures of these attention modules also can be found in supplementary materials. Compared to the existing direct generation based attention works , we specially add the operation over the attention map to emphasize the importance of one position related to all the other positions, following the philosophy of long-range dependency in the last section (Sec IV-A).
Iv-C Arrangements of the spatial and channel attentions
Given an input image, the hyper attentions (in Fig. 4 (c) and Fig. 5 (c)) model the spatial and channel attentions simultaneously. However, when spatial and channel attentions are separately calculated, we can arrange them in a sequential or parallel manner for both the two types of attention.
As illustrated in Fig. 6 (a) and (b), the spatial and channel attentions can be sequentially arranged to obtain the final attention results. Or the spatial attention and channel attention could be performed simultaneously in a parallel manner, and then summed together to obtain the final attention results, as shown in Fig. 6 (c).
For the type II attention (direct generation based attention, Sec. IV-B), which can calculate the attention maps along the spatial and channel dimension, respectively, obtaining a 2D spatial attention map and a 1D channel attention map , we consider to perform the multiplication to obtain a 3D spatial-channel attention map , as illustrated in Fig. 6 (d).
where is the multiplication with broadcast mechanism accordingly.
V Experiments and analysis
In this section, we evaluate the effectiveness of our proposed ADFL model for person Re-ID tasks with three datasets, Market-1501 , DukeMTMC-reID , and MSMT17 , under both of the single-domain and cross-domain settings.
V-a Datasets and protocols
This section will briefly introduce these datasets and the evaluation protocols.
Market-1501. This dataset contains 32,668 annotated bounding boxes, predicted by DPM detector , of 1,501 identities which are taken from 6 different cameras. The identities are split into 751 training IDs with 12,936 images and 750 query IDs. There are in total 3,368 query images, each of which is randomly selected from each camera so that cross-camera search can be performed. The retrieval for each query image is conducted over a gallery of 19,732 images including 6,796 junk images.
DukeMTMC-reID. This dataset consists of 36,411 images of 1,812 identities from 8 high-resolution cameras. Among them, 16,522 images of 702 identities are randomly selected from the dataset as the training set. While the testing set consists of the remaining 1,110 identities, among which 2,228 images of 702 identities are as query, and 17,661 images of the 1,110 identities are as gallery set.
MSMT17. It is a large and challenging Re-ID dataset with 126,441 annotation bounding boxes of 4,101 identities from 15 cameras, involving wide light varieties and different weather conditions. The bounding boxes are predicted by Faster-RCNN . The training set contains 32,621 bounding boxes of 1,041 identities, and the testing set contains 93,820 bounding boxes of 3,060 identities. From the testing set, 11,659 bounding boxes are randomly selected as query images and the other 82,161 images are used as gallery images.
Protocols. We adopt the cumulative matching characteristics (CMC) at rank-1 and the mean average precision (mAP) as the evaluation indicators to report the performances of different Re-ID methods on these Re-ID datasets.
V-B Implementation details
The implementation of our proposed method is based on the Pytorch framework.Model.
We adopt the ResNet50 model pre-trained on ImageNet as the backbone network of our proposed model, changing the stride of convolution infrom 2 to 1. Preprocessing. To capture more detailed information from the person images, in training phase the input image is resized to
, and padded with 10. Then the input image is randomly left-right flipped and cropped tofor data augmentation. Left-right image flipping is also utilized in the testing phase. Mini-batch size. Each mini-batch is sampled with randomly selected identities and randomly sampled images for each identity from the training set to cooperate the requirement of triplet loss. Here we set and , leading to a mini-batch size of 32 to train our proposed model. Optimization. We use the adam method  as the optimizer with and
. The network is trained for 150 epochs in total. Those layers in backbone network are fixed for 5 epochs firstly.Learning rate. (1) The initial learning rate is set as 3.5e-4 and weight decay is set as 5e-4. (2) Increasing the learning rate in the early 20 epochs from 3e-6 to 3.5e-4 with a linear warmup strategy. (3) The learning rate is decayed with a parameter 0.1 at epoch 80 and 130, respectively. Overall, the learning rate () is adjusted as follows,
|not fix backbone||93.3||81.5||45.1||27.0|
Features. For the baseline and attention incorporation methods, the features after the final layer of backbone network are adopted as the person representations during testing. While for the additional attention features incorporation method, the features after the other final layer in the skip-connection flows (the red lines in Fig. 2) are adopted as the person representations during testing.
V-C The strong baseline
|Method||Type I: long-range dependency based attention|
|93.0 0.6||81.1 1.0||48.7 0.4||29.9 1.6||93.4 0.5||81.3 0.6||51.4 1.1||31.7 1.3|
|Method||Type II: direct generation based attention|
Input image is resized to compared to usually used .
Loss function: softmax+triplet combination vs only softmax or triplet.
Fixing the backbone network for 5 epochs or not.
Employing the warmup strategy or not.
Employing the batch normalization () layer after the final pooling layer or not.
We mainly conducted experiments training only on Market-1501 dataset and testing on both Market-1501 and DukeMTMC-reID datasets, trying to directly exploit our model trained only on the source-domain data (Market-1501) to perform the cross-domain (from Market-1501 to DukeMTMC-reID) Re-ID task. The dimension of final person features is set . The corresponding results are listed in Table I, from which, we can know that.
All of these components, including both of the architecture designs and the training details, utilized in the backbone network truly can improve the performance Re-ID.
Compared to some recently published methods, we give a extremely strong baseline. Our baseline is even better than those methods with multi-layer features fusion (DaRe ), multi region partition (PCB+RPP ), image generating (CamStyle ), and attention mechanism (MLFN , HACNN  and Mancs ).
V-D The effectiveness of attention modules incorporation
As described in Sec. III-B, we consider the attention modules incorporation following 3 cases, , and . For the two type of attention mechanisms, type I: long-range dependency based attention and type II: direct generation based attention described in Sec. IV, we respectively conducted experiments training on Market-1501 dataset, testing on Market-1501 and DukeMTMC-reID datasets, to verify the effectiveness of these attention modules. Since the type I attention modules consist of many learnable parameters which leads them hard to converge in the case of under our training settings, we evaluated them only in case and , while test type II attention modules in case and
. For the batch attention module in type I, which mainly depends on sample compositions of every testing mini-batch (as analyzed in the supplementary materials), we conducted experiments for 10 times changing the mini-batch size in range [10:10:100], recording the mean and std variance. The results are reported in TableII, from which we can find that.
Compared to our baseline in Table I, surprising results are obtained. (1) Those attention modules incorporation methods perform not better in the single-domain training-testing setting (MM), which demonstrates the effectiveness of our baseline for single-domain person Re-ID. (2) However, those attention modules always achieve much better performance in the cross-domain setting (MD) with large margins. It demonstrates the effectiveness of attention incorporation for model generalization and adaptation.
For type I: long-range dependency based attention, (1) in the single-domain setting, and obtain comparable performance. However in the cross-domain setting, always performs much better than , which maybe attribute to that the learned features from containing higher semantical information of person body compared to those from . (2) In the case under cross-domain setting, the spatial and channel combination attention modules seem to perform better than those spatial or channel attention alone. (3) The batch attention is truly with randomness depending on the composition of testing mini-batch samples.
For type II: direct generation based attention, (1) in the single-domain setting, and also obtain comparable performance. However, in the cross-domain setting, and performs differently, maybe better, maybe worse than the other one. There is no regular to follow. (2) For those attention modules, spatial or channel attention alone, or their different arrangements, all of them also perform differently under different settings, not always better or worse than others.
In all cases, the channel attention module performs the same or a little better compared to the spatial attention module.
In summary, Type I: long-range dependency based attention under case, performs the best in cross-domain setting (MD, 54.8 () vs. 48.9 (baseline) at Rank-1 (%)). It demonstrates the effectiveness of computing attention in the long-range dependency manner for extracting discriminative person features.
V-E The effectiveness of additional attention features incorporation
In the above section, we evaluate the effectiveness of attention modules incorporation, especially in the cross-domain setting. In this section, we consider to additionally incorporate the attention features, forming our final attention-based discriminative feature learning (ADFL) method. Based on the experimental results in the above section V-D, we will adopt the attention modules in the following cases. For attention modules, the type I attention in case with spatial and channel attention combination methods , and will be adopted. For attention features incorporation, we mainly consider under concatenation () and summation () fusion methods, with dimensions of , and . As illustrated in Fig. 2, attention features are passed into a , then fused with other features. The details of the combination methods and the corresponding results are listed in Table III, where the methods denotes “the attention features the fusion method the dimension outputted from the the position of attention modules incorporation the attention modules”. From the results we can find that.
|baseline (ours)||94.0 (83.1)||48.9 (30.8)|
|94.0 (83.4)||53.5 (34.3)|
|94.2 (84.4)||53.6 (34.7)|
|94.1 (83.6)||54.0 (34.9)|
|94.4 (83.1)||56.8 (37.4)|
|94.1 (83.2)||53.3 (34.6)|
|93.6 (82.8)||54.5 (36.0)|
|94.1 (81.8)||56.1 (36.4)|
|93.9 (82.6)||54.3 (34.9)|
|93.8 (81.8)||54.7 (36.1)|
|94.0 (82.1)||56.0 (36.6)|
|94.0 (82.4)||55.2 (35.1)|
|93.7 (81.5)||55.4 (36.3)|
In single-domain setting, those ADFL methods obtain comparable results compared to the baseline method. However, in the cross-domain setting those ADFL methods perform much better than the baseline method. MD, 56.8 () vs. 48.9 (baseline) at Rank-1 (%), there are almost 8 percent improvements. Therefore, in the following we will only focus on the cross-domain setting.
Compared those results in Table II with Table III, we can find that the ADFL methods, which combine the attention modules and the attention features, achieve better performance compared to the corresponding methods with only attention modules. It demonstrates the effectiveness of our proposed attention-based discriminative feature learning for enhancing the model generalization and adaptation by improving the person features with discrimination and robustness.
For the attention modules, the methods always perform better than and except in the cases.
methods perform worse compared to and methods.
In cases, , and can achieve comparable performance.
V-F Comparison with state-of-the-art
We compare our proposed ADFL methods with state-of-the-art methods on Market-1501, DukeMTMC-reID and MSMT17 datasets, under both single-domain and cross-domain settings. From the results in Table III, we decide to adopt , and as the delegates of our ADFL methods. Table IV lists the results of Market-1501 and DukeMTMC-reID under single-domain setting. Table V lists the results of Market-1501 and DukeMTMC-reID under cross-domain setting. Table VI lists the corresponding results of MSMT17 under both single-domain and cross-domain settings.
From the results in those Tables, we can find that.
The attention-based discriminative feature learning method is effective for enhancing the model generalization and adaptation, which especially could be verified in the cross-domain setting by the results in Table V. Our ADFL methods achieve the best performance, even better than those methods which leverage the information of target-domain data. However, we perform the cross-domain person Re-ID by directly exploiting our ADFL model training only on the source-domain without any auxiliary information.
Compared to those existing cross-domain person Re-ID methods, which leverage the information of target-domain data and perform more tasks [42, 46, 5, 15, 62], the big success of our simple attention incorporation for cross-domain person Re-ID can direct us a way to effectively design models for cross-domain tasks.
This work mainly aims at enhancing the model generation and adaptation, by focusing on adaptively extracting the discriminative and robust person features. We emphasized to learn discriminative person features, through the simple incorporation of attention modules. It can improve the learned person features with more discrimination and robustness. Based on our strong baseline and attention modules incorporation, the experimental results on three Re-ID datasets demonstrated the effectiveness of our proposed ADFL methods compared to the state-of-the-art approaches, especially under the cross-domain setting by directly exploiting a pre-trained Re-ID model to new domains. The surprisingly good results by only simple attention incorporation may give us some new insights when consider the cross-domain tasks in the future.
This work was supported by the National Natural Science Foundation of China (grant numbers 61671125, 61201271, 61301269), and the State Key Laboratory of Synthetical Automation for Process Industries (grant number PAL-N201401).
-  (2017) Scalable person re-identification on supervised smoothed manifold. In CVPR, pp. 3356–3365. Cited by: §I, TABLE I.
-  (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, pp. 1302–1310. Cited by: §II.
-  (2018) Multi-level factorisation net for person re-identification. In CVPR, Cited by: §I, item b, TABLE I, TABLE IV.
-  (2002) Control of goal-directed and stimulus-driven attention in the brain. Nature reviews neuroscience 3 (3), pp. 201. Cited by: §I.
-  (2018) Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR, pp. 994–1003. Cited by: §I, §II, item c, TABLE V.
-  (2018) Feature affinity based pseudo labeling for semi-supervised person re-identification. IEEE Transactions on Multimedia. Cited by: §I.
-  (2018) Unsupervised person re-identification: clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications 14 (4), pp. 83. Cited by: TABLE V.
-  (2008) A discriminatively trained, multiscale, deformable part model. In CVPR, pp. 1–8. Cited by: §V-A.
-  (2018) Horizontal pyramid matching for person re-identification. arXiv preprint arXiv:1804.05275. Cited by: §I, §II.
-  (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §I, §III-A.
-  (2017) In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737. Cited by: §I, item 4, TABLE IV.
-  (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §II.
-  (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §VI-B2.
-  (2018) Adversarially occluded samples for person re-identification. In CVPR, pp. 5098–5107. Cited by: TABLE IV.
-  (2018) EANet: enhancing alignment for cross-domain person re-identification. arXiv preprint arXiv:1812.11369. Cited by: §I, §II, item c, TABLE V.
-  (1998) A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence 20 (11), pp. 1254–1259. Cited by: §I.
-  (2015) Spatial transformer networks. In NeurIPS, pp. 2017–2025. Cited by: §II.
-  (2018) Human semantic parsing for person re-identification. In CVPR, pp. 1062–1071. Cited by: §I, §II, TABLE IV.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-B.
Learning to combine foveal glimpses with a third-order boltzmann machine. In NeurIPS, pp. 1243–1251. Cited by: §I.
-  (2017) Learning deep context-aware features over body and latent parts for person re-identification. In CVPR, pp. 384–393. Cited by: §II.
-  (2017) Person re-identification by deep joint learning of multi-loss classification. In IJCAI, pp. 2194–2200. Cited by: TABLE IV.
-  (2018) Harmonious attention network for person re-identification. In CVPR, pp. 2285–2294. Cited by: §I, §II, §IV-B, §IV, TABLE I, TABLE IV.
-  (2017) End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing 26 (7), pp. 3492–3506. Cited by: §II.
-  (2018) CAnet: contextual-attentional attribute-appearance network for person re-identification. In ACM MM, pp. 737–745. Cited by: §I, §II, TABLE IV.
Hydraplus-net: attentive deep features for pedestrian analysis. In ICCV, pp. 350–359. Cited by: §II, §IV.
-  (2018) Unsupervised cross-dataset person re-identification by transfer learning of spatial-temporal patterns. In CVPR, pp. 7948–7956. Cited by: §I, §II.
-  (2018) Pose-normalized image generation for person re-identification. In ECCV, pp. 650–667. Cited by: TABLE IV.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §V-A.
-  (2000) The dynamic representation of scenes. Visual cognition 7 (1-3), pp. 17–42. Cited by: §I.
-  (2016) Performance measures and a data set for multi-target, multi-camera tracking. In ECCV, pp. 17–35. Cited by: Fig. 1, §I, §V.
-  (2018) A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In CVPR, Vol. 7, pp. 8. Cited by: §I, §II, TABLE IV.
-  (2017) Pose-driven deep convolutional model for person re-identification. In ICCV, pp. 3980–3989. Cited by: §II, TABLE IV, TABLE VI.
-  (2018) Part-aligned bilinear representations for person re-identification. ECCV. Cited by: §I, §II, TABLE IV.
-  (2017) SVDNet for pedestrian retrieval. In ICCV, pp. 3820–3828. Cited by: §I, TABLE I, TABLE IV.
-  (2018) Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In ECCV, pp. 501–518. Cited by: §I, §II, §III-A, item b, TABLE I, TABLE IV, TABLE V.
Inception-v4, inception-resnet and the impact of residual connections on learning.. In AAAI, Vol. 4, pp. 12. Cited by: §III-A.
-  (2015) Going deeper with convolutions. In CVPR, pp. 1–9. Cited by: TABLE VI.
-  (2017) Attention is all you need. In NeurIPS, pp. 5998–6008. Cited by: §IV-A, §IV.
-  (2018) Mancs: a multi-task attentional network with curriculum sampling for person re-identification. In ECCV, pp. 384–400. Cited by: §I, §II, §III-A, §IV, item b, TABLE I, TABLE IV.
-  (2018) Learning discriminative features with multiple granularities for person re-identification. ACM MM. Cited by: §I, §II.
-  (2018) Transferable joint attribute-identity deep learning for unsupervised person re-identification. CVPR. Cited by: §I, §II, item c, TABLE V.
-  (2018) Non-local neural networks. In CVPR, pp. 7794–7803. Cited by: §IV-A, §IV.
-  (2018) Resource aware person re-identification across multiple resolutions. In CVPR, pp. 8042–8051. Cited by: §I, item b, TABLE I, TABLE IV.
-  (2015) Zero-shot person re-identification via cross-view consistency. IEEE Transactions on Multimedia 18 (2), pp. 260–272. Cited by: §I.
-  (2018) Person transfer gan to bridge domain gap for person re-identification. In CVPR, pp. 79–88. Cited by: §I, §II, item c, TABLE V, §V.
-  (2017) Glad: global-local-alignment descriptor for pedestrian retrieval. In ACM MM, pp. 420–428. Cited by: TABLE IV, TABLE VI.
-  (2018) GLAD: global-local-alignment descriptor for scalable person re-identification. IEEE Transactions on Multimedia. Cited by: §I.
-  (2018) Simple baselines for human pose estimation and tracking. ECCV. Cited by: §II.
-  (2017) Joint detection and identification feature learning for person search. In CVPR, pp. 3376–3385. Cited by: §I, TABLE I.
-  (2018) Towards good practices on building effective cnn baseline model for person re-identification. arXiv preprint arXiv:1807.11042. Cited by: §I, TABLE I.
-  (2018) Attention-aware compositional network for person re-identification. CVPR. Cited by: §I, §II.
-  (2017) The devil is in the middle: exploiting mid-level representations for cross-domain instance matching. arXiv preprint arXiv:1711.08106. Cited by: TABLE IV.
-  (2014) Visualizing and understanding convolutional networks. In ECCV, pp. 818–833. Cited by: §IV.
-  (2017) Alignedreid: surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184. Cited by: §II, TABLE IV.
-  (2018) Deep mutual learning. In CVPR, pp. 4320–4328. Cited by: TABLE IV.
-  (2017) Spindle net: person re-identification with human body region guided feature decomposition and fusion. In CVPR, pp. 907–915. Cited by: §II.
-  (2017) Deeply-learned part-aligned representations for person re-identification. In ICCV, pp. 3219–3228. Cited by: §II, §IV, TABLE IV.
-  (2015) Scalable person re-identification: a benchmark. In ICCV, pp. 1116–1124. Cited by: Fig. 1, §I, §V.
-  (2016) Person re-identification: past, present and future. arXiv preprint arXiv:1610.02984. Cited by: §I, TABLE I.
-  (2018) Re-identification with consistent attentive siamese networks. arXiv preprint arXiv:1811.07487. Cited by: §I, §II, §IV.
-  (2018) Generalizing a person retrieval model hetero-and homogeneously. In ECCV, pp. 172–188. Cited by: item c, TABLE V.
-  (2018) Camera style adaptation for person re-identification. In CVPR, pp. 5157–5166. Cited by: item b, TABLE I, TABLE IV.
Unpaired image-to-image translation using cycle-consistent adversarial networkss. In ICCV, Cited by: §II.