Given a probe image, the goal of person re-identification (re-ID) is to search the pedestrian image of the same identity from a gallery set. It has been widely used in many applications, such as video surveillance and self-driving. In recent years, many deep learning based person re-ID approaches dai2019batch ; Zhong_2018_CVPR ; Deng_2018_CVPR ; song2018mask ; liu2017end ; li2018harmonious ; xu2018attention ; li2018diversity ; zhang2020semi ; zhu2019simultaneous are proposed and achieve great improvement over traditional approaches. However, the re-ID performance in some challenging scenarios is still unsatisfied due to the influence of clutter background, illumination, motion blur, low resolution, and occlusion.
. Attention mechanisms usually pursue to exploiting the most discriminative feature learning, which have been successfully used in many other computer vision tasks. However, how to learn robust fine-grained local features for person re-ID is still a challenging problem. Some works also introduce other information, such as attribute recognitionwang2019pedestrian
, pose estimationsaquib2018pose and part detection sun2019perceive to improve person re-ID. Recent works demonstrate that hard sample mining/generation strategies are greatly beneficial for robust feature learning wang2017fast ; wang2018learning ; Wang_2018_CVPR ; Zhong_2018_CVPR ; Deng_2018_CVPR ; Simo2014Fracking ; Xiaolong2015Unsupervised ; Loshchilov2015Online ; dai2019batch ; choe2019attention ; wang2019improved . Among them, Batch DropBlock Network (BDB) dai2019batch is a recent feature learning approach that can jointly utilize the global and local features for person re-ID. Specifically, it introduces a feature dropping module to randomly erase the most discriminative features and thus focuses more on non-discriminative features. However, one main limitation of BDB is that it randomly drops the features to generate hard samples for training the network which may be sub-optimal. Recent works wang2017fast ; Wang_2018_CVPR ; choe2019attention reveal that carefully designed feature dropping module can achieve better performance. This inspires us to rethink how to drop specific regions of extracted feature maps to obtain better fine-grained local features.
Based on above discussion, this paper develops a novel Self-attention guided Adaptive DropBlock Network (SaADB) for person re-ID, as shown in Figure 3. The key aspect of SaADB is to adaptively erase the most discriminative features according to the estimated attention map rather than randomly dropping in BDB dai2019batch
. More specifically, SaADB mainly contains three sub-networks, i.e., global branch, attention branch, and local feature drop network. First, we employ a global branch to extract the global feature representation for the input pedestrian image. Second, we use a local feature dropping module to adaptively erase the most discriminative parts and make our neural network be more sensitive to the non-discriminatory features. We erase the discriminative features according to the estimated self-attention guided regions. Finally, we introduce the spatial and channel attention for more discriminative feature representation learning for person re-ID. This attention branch and local feature dropping network can be randomly selected and optimized along with a global branch together in the training phase. Comparing with existing person re-ID approaches, such as BDB networkdai2019batch , the proposed SaADB network has the following advantages: 1) Our feature dropping network can adaptively erase the most discriminative features according to the attention regions, while other models erase the features randomly which can only obtain sub-optimal results. 2) Previous works generally either utilize feature dropping module or attention module for robust feature learning, while our method makes full use of the advantages of both two learning schemes simultaneously.
The main contributions of this paper can be summarized as the following three aspects:
We propose a novel Self-attention guided Adaptive feature Dropping Module (SaADB) for the person image representation and identification tasks.
We jointly utilize the adaptive feature dropping module and attention scheme which can attain better feature representation for person re-ID.
Extensive experiments on multiple person re-ID benchmark datasets validate the effectiveness of the proposed SaADB network.
For the rest of this paper, we first review some related works in section 2. Then, we overview and elaborate our method in section 3.1 and section 3.2 respectively. We compare the proposed SaADB model with state-of-the-art person re-ID approaches in section 4.3, followed by the component analysis in section 4.4. Section 5 concludes our paper.
2 Related Work
Inspired by the powerful feature representation of convolutional neural networks (CNN), recent person re-ID approaches generally utilize CNN to automatically learn the deep features from massive training datasetschang2018multi ; Shen2018Deep ; sun2017svdnet ; Zhao2017Deeply ; Zheng2016A ; fayyaz2019person ; cheng2019hierarchical ; cheng2020scale . Image dicing is a popular strategy to extract local features Varior2016A . The dense semantic estimation technique (DensePose) is first used in work Zhao2017Spindle to perform pixel-wise fine-grained semantic estimation to obtain the semantic information of each pixel. It can handle the problem of spatial semantic misalignment and significantly improves the performance of person re-ID tasks.
In addition, many works introduce some attention modules into person re-ID networks, such as liu2017end ; li2018harmonious ; song2018mask ; xu2018attention ; li2018diversity . Specifically, Liu et al. liu2017end demonstrate that multiple local areas with more distinguishable information could further improve the overall performance. Li et al. li2018harmonious
propose to jointly learn the hard and soft attention for person re-ID. A continuous attention model guided by a binary mask is introduced in worksong2018mask . It firstly uses the binary segmentation masks to construct the synthetic RGB-Mask pairs as the input, and then employs a mask-guided contrastive attention model to learn features separately from the body and background regions. Xu et al. xu2018attention introduces the pose-guided part attention (PPA) and attention-aware feature composition (AFC) for person re-ID in which PPA is used to mask out undesirable background features in person feature maps and can also handle the part occlusion issue. Li et al. li2018diversity propose to use the spatial attention to handle the issue of alignment between frames which can avoid occlusion damage. Although these works achieve better results by exploiting some attention models, however, they all attempt to mine the most discriminative features and thus ignore the fine-grained local features, which are important cues in some challenging scenarios. In this paper, we jointly utilize the feature dropping module and attention model, which can obtain better local feature representations for person re-ID.
Hard Example Mining/Generation: Some researchers attempt to design hard example mining/generation techniques wang2017fast ; wang2018learning ; Wang_2018_CVPR ; Zhong_2018_CVPR ; Deng_2018_CVPR ; Simo2014Fracking ; Xiaolong2015Unsupervised ; Loshchilov2015Online ; dai2019batch ; choe2019attention ; wang2019improved for person re-ID and other related computer vision tasks. Specifically, Wang et al. wang2019improved propose to utilize person attributes to mine hard mini-batch samples for the training of their network. Wang et al. wang2018learning propose to combine global features with multi-granularity local features and to characterize the integrity of input image. Dai et al. dai2019batch propose a Batch Dropblock Network (BDB) to learn some attentive local features for re-ID. Although the BDB network can obtain better performance for person re-ID and some other related retrieval tasks, however, the design of this module may still not be optimal. In this paper, we propose to drop the features guided by a self-attention module and design a novel SaADB for person image representation and identification.
3 The Proposed Approach
In this section, we first give an overview of our proposed person re-ID model. Then, we provide the details of each component of our model. Finally, we present the details of the proposed model in training and testing phase.
As shown in Figure 3, the proposed network model mainly contains three modules, i.e., global branch, attention branch, and local feature drop module. The global branch is used to encode the global feature representation of the given pedestrian image. To capture the local detailed information of pedestrian images, we introduce the local feature drop network to adaptively erase the most discriminative parts by employing a self-attention scheme. In addition, we introduce the widely used spatial and channel attention modules to further improve the discriminative ability of learned feature. This attention branch and local feature drop network can be randomly selected and optimized along with the global branch. More details about these modules are described below.
3.2 Network Architecture
3.2.1 Global Branch
For person re-ID task, CNN is usually adopted for global feature extraction. As shown in Figure3, we utilize ResNet-50 he2016deepResidual as our backbone network by following dai2019batch
. Given the feature map predicted by the backbone network, we first use GAP (Global Average Pooling) layer to transform the feature map into a vector, followed by two fully connected layers (FC) to encode the feature vector into the fixed dimension. The number of neurons of the two FC layers is set as 2048 and 512, respectively.
3.2.2 Adaptive Dropping Branch
The Self-attention guided Adaptive Dropping Branch (SaADB) is proposed to learn the fine-grained non-discriminating features by erasing the most discriminative features. The motivation of this module is that the aforementioned global branch can already work well in regular person re-ID scenarios which can capture the global discriminative features. However, in some challenging cases, the most discriminative features may not be the target person due to the influence of similar targets or occlusion. Inspired by the feature dropping module proposed in previous work BDB dai2019batch which aims to erase the features randomly, we utilize such mechanism to boost the robustness of the person re-ID model.
Specifically speaking, as shown in Figure 3, our SaADB takes the feature maps extracted from the backbone network as the input. Then, we employ a channel-wise pooling operation on this feature map to obtain a self-attention map. In this way, we obtain a corresponding dropping mask via a threshold selection operation. The drop mask is used to mask the input feature map to generate the dropped feature map. The obtained feature map contains non-discriminative features that can make our neural network be more sensitive to these features, as discussed in work choe2019attention . The threshold is defined as follows. Assuming is the maximum pixel value in the feature map of self-attention map, we set the threshold as to attain a drop mask, where is a hyper-parameter. More concretely, the value of self-attention map which is greater than will be set to 0, otherwise, we set it as 1. After we obtain the drop mask, we multiply it with input feature map to obtain the drop map. Such a drop map does not contain discriminative features of the target object. Therefore, we can enforce the neural networks to pay more attention to the non-discriminative features.
3.2.3 Attention Module
In addition to above feature dropping module, which aims to mine the non-discriminative features, we further introduce the attention estimation to learn the most discriminative feature map. The attention mechanism can help the neural network to learn better feature representation of pedestrian images, as utilized in many previous works liu2017end ; li2018harmonious ; song2018mask ; xu2018attention ; li2018diversity . We introduce spatial and channel attention woo2018cbam to highlight the most discriminative features for person re-ID tasks. Therefore, we can attain more robust feature representation by designing random selection of adaptive feature dropping module and attention module when training our neural network.
Channel Attention: First, we take the input feature map as input ( and
denote channel number, height, and width of feature map respectively) and use the feature correlation between channels to generate channel attention features. The channel information of the feature map can be aggregated with global max pooling and global average pooling operations based on width and height, respectively. Therefore, we can obtain two-channel descriptionsand and feed them into a shared network to generate two kinds of feature descriptors and
. The shared network consists of a multi-layer perceptron (MLP) and one hidden layer. To reduce the parameter overhead, the hidden activation size is set to, where represents the parameter reduction ratio. Then, we merge two kinds of feature descriptors and using element-wise summation, and obtain the channel attention map
through the Sigmoid activation functionwoo2018cbam , i.e.,
where denotes the activation function.
Therefore, we can obtain the attended feature map by multiplying the channel attention map with the original feature map as,
where denotes element-wise multiplication. It is worthy to note that the channel attention values are broadcasted (copied) along the spatial dimension before the multiplication.
Spatial Attention : Different from channel attention, spatial attention focuses on mining useful feature regions from the perspective of spatial coordinates. The spatial attention module takes the output of channel attention as input and then returns two-channel descriptions and with channel-based global max pooling and global average pooling operations respectively. These descriptions will be fed into a convolutional layer to obtain two kinds of feature descriptors and . Then, we merge and together using element-wise summation, and generate the spatial attention map through a Sigmoid layer as woo2018cbam ,
Thus, we can obtain the final attended feature map by multiplying the spatial attention map with the channel attended feature map as,
Similarly, the attention values are broadcasted (copied) along the channel dimension before multiplication.
3.2.4 Random Selection
Once obtaining the attended feature map and erased feature map, how to use them for discriminative feature learning is another aspect of the proposed approach. Inspired by the random selection mechanism in previous work choe2019attention , we propose to randomly select one from them for subsequent classification in training process. On the one hand, the attention module learned in previous iterations will improve the quality of feature map and help the feature dropping module to erase the most discriminative features more accurately. Therefore, we can learn better feature representation than only using the adaptive feature dropping module. On the other hand, our model can capture the local fine-grained features, which can be more effective than attention module in challenging scenarios where the most discriminative features are inexplicit. The two branches can be trained simultaneously and the features generated from adaptive dropping module and attention module are opposite. Concatenating these two features into one representation may weaken the attention branch or local feature drop network. Therefore, we select one branch randomly from them to train in each iteration, as suggested in choe2019attention . Our experimental results also demonstrate that such alternative learning approach can boost re-ID performance significantly.
3.3 Training and Testing Phase
In the training phase, we adopt both label prediction loss and metric learning loss to train the proposed network,
Formally, the label prediction loss is defined as,
where is the value of the output vector , i.e.
, the probability of this sample belongs to thecategory. is the number of categories and is the ground truth whose dimension is .
For the metric learning loss, we adopt a soft margin batch-hard triplet loss hermans2017defense which aims to increase the distances of negative and anchor samples and decrease the distances of positive and anchor samples. The detailed formulation of this metric learning loss can be written as ,
where and indicate the number of person IDs and images in each ID respectively. is defined as hermans2017defense ,
Therefore, we have triplets in a mini-batch. For each triplet sample, we have , , where and denote anchor and positive sample respectively and denotes the negative sample. is the Euclidean distance function and represents the feature vector of sample which is obtained from the last fully connected layer of our network model.
In the testing phase, we jointly utilize features from the global and attention branch as the embedding vector of a given pedestrian image. That is, the local feature drop network is only used in the training phase for robust feature learning.
3.4 Comparison with Related Works
The proposed SaADB method is most related with BDB dai2019batch , which proposes a batch dropblock network for person re-ID. Different from BDB dai2019batch , the proposed SaADB further employs a self-attention scheme choe2019attention to adaptively select the attentive regions to erase, which makes our features more discriminative and improves final re-ID performance significantly. Also, SaADB is different from ADL choe2019attention . First, SaADB is designed for person image representation and re-ID tasks, while ADL choe2019attention focuses on weakly supervised object localization. Second, SaADB exploits both channel and spatial attention for feature enhancement while only self-attention (or spatial) attention is used in ADL choe2019attention . The channel attention woo2018cbam mainly focuses on the different channel information of the input, while the spatial attention woo2018cbam mainly focuses on different position information of the input. Therefore, we can attain better feature representation with these two operations.
In this section, we first introduce the datasets, evaluation metrics and implementation details in section4.1 and 4.2 respectively. Then, we compare the proposed SaADB with other state-of-the-art person re-ID methods, followed by ablation studies to evaluate the effectiveness of each module in our method.
4.1 Datasets and Evaluation Metrics
We evaluate our model on three widely used person re-ID benchmark datasets, including Market-1501 zheng2015scalable , DukeMTMC zheng2017unlabeled , and CUHK03 dataset zhong2017re . We follow the same protocols as previous works liu2018pose ; zheng2017unlabeled ; dai2019batch ; Zhong_2018_CVPR . Two evaluation metrics are adopted for the evaluation, including mAP and Rank- zheng2015scalable .
4.2 Implementation Details
In this paper, ResNet-50 he2016deepResidual is adopted as our backbone network with a slight modification, i.e., removing the down-sampling layers after the layer-3. All person images are resized to 384 128. Our model size is 34.8 M and the training is conducted on a PC with 4 GTX-1080 GPUs. The batch size is 128 with 32 identities in each batch. We use Adam kingma2014adam
as the optimizer and the dynamic learning rate is used in the first 50 epochs,i.e., , where is the index of epoch. Then, we decay the learning rate to , after 200 and 300 epochs respectively. Our network is trained in 600 epochs. In the local feature drop network, 20% of the activate values in the feature maps are erased. In each iteration, the probability of selecting the dropped branch is 0.25. That is to say, the probability of choosing the attention branch is 0.75. In the testing phase, we jointly utilize the feature vectors from both the global branch and attention branch as the embedding vector of a pedestrian image.
4.3 Comparison with State-of-the-art Algorithms
In this section, we report our re-ID results and compare with other state-of-the-art approaches on three benchmark datasets, including Market-1501, CUHK03 and DukeMTMC dataset.
Results on Market-1501 dataset. As shown in Table 1, BDB dai2019batch achieves 86.3%, 94.7% on the mAP and Rank-1, respectively; while our method can obtain 86.7% and 95.2% on these two metrics respectively. It is also worthy to note that the CASN zheng2019re is also developed by combining the local and global features, which achieves 82.8% and 94.4% on the mAP and Rank-1. Our method significantly outperforms CASN zheng2019re . When obtaining their fine-grained features, CASN zheng2019re only focuses on each local area by manually segmenting the feature maps without targeted learning of local features, while our proposed SaADB focuses on learning the non-discriminative features via the local feature dropping network and emphasizing the most discriminative features via the attention module. Therefore, the features extracted by SaADB network are more powerful and thus achieve better re-ID performance. In addition, our model does not require the division of local features, which is more efficient than CASN zheng2019re . Mancs wang2018mancs introduces a multi-branch network which combines the attention module and global module for person re-ID. It achieves 82.3% and 93.1% on the mAP and Rank-1 respectively. Benefit from the attention module, local feature dropping network and global module, our proposed SaADB significantly beats Mancs wang2018mancs . Our proposed algorithm also better than some recent and strong re-ID algorithms published on CVPR-2020 on this benchmark, including HOReID wang2020high (mAP/Rank-1: 84.9/94.2) and SNR jin2020style (mAP/Rank-1: 84.7/94.4).
Results on CUHK03 dataset. As shown in Table 2, the mAP and Rank-1 accuracy of our model achieve 80.0% and 83.2% respectively on CUHK03-labeled dataset, while 76.2 % and 79.3 % on CUHK03-detected dataset. The baseline approach BDB dai2019batch achieves 76.7% and 79.4% on the mAP and Rank-1. It is easy to find that our results are 3.3% and 3.8% higher than BDB dai2019batch on the CUHK03-labeled dataset. Meanwhile, our results are also significantly better than BDB dai2019batch on the CUHK03-labeled dataset, while BDB dai2019batch achieves 73.5% and 76.4% on the mAP and Rank-1. BDB dai2019batch is proposed to randomly erase the feature maps to obtain the local features, however, this simply random dropping operation can’t discard the discriminative regions in original feature maps. Therefore, they can only obtain sub-optimal samples for training. Our proposed SaADB generates effective local features using the self-attention guided adaptive feature dropping module. This makes our model focus more on the non-discriminative features. In addition, we introduce the spatial and channel attention modules to achieve more discriminative features in the testing phase. These modules all contribute to our final re-ID performance, which significantly outperforms all the compared state-of-the-art person re-ID approaches.
Results on DukeMTMC dataset. As shown in Table 3, our approach achieves 77.1% on mAP and 89.1% on Rank-1 on DukeMTMC dataset, which is significantly better than the compared state-of-the-art algorithms, including PCB sun2018beyond (73.4% and 84.1% on the mAP and Rank-1) and BDB dai2019batch (76.0% and 88.7% on the mAP and Rank-1). The results consistently promise the performance of our model for fine-grained local feature learning. It is worthy to note that our re-ID algorithm also outperforms recent HOReID and SNR, as shown in Table 3, which fully demonstrate the advantages of our proposed algorithm.
4.4 Component Analysis
In this subsection, we conduct component analysis on DukeMTMC and Market-1501 datasets to evaluate the effectiveness of each module in our re-ID algorithm. Specifically, six kinds of variants of our model are implemented as shown in Table 4 and Table 5.
Global: global branch used for the feature learning;
Drop: local feature drop network is adopted for robust feature learning;
Attention: attention module is adopted for discriminative feature learning.
As shown in Table 4, the baseline approach, i.e., only Global branch is adopted for the feature learning, achieves / on mAP and Rank-1. When integrated with Drop branch, the re-ID performance can reach to / on two evaluation metrics. When we jointly introducing the Global and Attention module, the results reach to /. These two experiments validate the effectiveness of the proposed Drop and Attention branch for person re-ID. After integrating these three modules together, the mAP and Rank-1 are further boosted to /. Similar conclusions can also be drawn from Table 5, which consistently validate the effectiveness of each component in our model.
In addition to the aforementioned quantitative analysis, we provide some visualizations to show the advantages of our proposed modules. As shown in Figure 4, our activation maps all focus on the most discriminative regions of the target person, which responses to the more robust and discriminative feature learning of our method for person re-ID. As shown in Figure 5, it is clear that our person re-ID algorithm is significantly better than the baseline method BDB dai2019batch . These visualizations intuitively verify the effectiveness of our proposed fine-grained local feature learning scheme.
In this paper, a novel self-attention guided adaptive dropblock network (SaADB) is proposed for robust feature representation learning and person re-ID. Our feature learning framework contains three modules: global branch, local feature drop network and attention branch. To learn more detailed information, we introduce the feature dropping module to erase the most discriminative features. Different from the random feature dropping network used in many previous works, we erase the features according to the learned attention maps, which are more effective and accurate. In addition, we utilize the attention mechanism to emphasize the most discriminative local features. The feature dropping module and attention module are trained in an alternative manner via random selection. Extensive experiments on three large-scale person re-ID benchmark datasets demonstrate the effectiveness of the proposed SaADB method.
- (1) Chang, X., Hospedales, T.M., Xiang, T.: Multi-level factorisation net for person re-identification pp. 2109–2118 (2018)
Cheng, K., Tao, F., Zhan, Y., Li, M., Li, K.: Hierarchical attributes learning for pedestrian re-identification via parallel stochastic gradient descent combined with momentum correction and adaptive learning rate. Neural Computing and Applications pp. 1–18 (2019)
- (3) Cheng, L., Jing, X.Y., Zhu, X., Ma, F., Hu, C.H., Cai, Z., Qi, F.: Scale-fusion framework for improving video-based person re-identification performance. Neural Computing and Applications pp. 1–18 (2020)
- (4) Choe, J., Shim, H.: Attention-based dropout layer for weakly supervised object localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2219–2228 (2019), arxiv.org/abs/1908.10028
- (5) Dai, Z., Chen, M., Gu, X., Zhu, S., Tan, P.: Batch dropblock network for person re-identification and beyond. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3691–3701 (2019)
- (6) Deng, W., Zheng, L., Ye, Q., Kang, G., Yang, Y., Jiao, J.: Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018), arxiv.org/abs/1711.07027v3
- (7) Fayyaz, M., Yasmin, M., Sharif, M., Shah, J.H., Raza, M., Iqbal, T.: Person re-identification with features-based clustering and deep features. Neural Computing and Applications pp. 1–22 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
- (9) Hermans, A., Beyer, L., Leibe, B.: In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017)
- (10) Huang, H., Li, D., Zhang, Z., Chen, X., Huang, K.: Adversarially occluded samples for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5098–5107 (2018)
- (11) Jin, X., Lan, C., Zeng, W., Chen, Z., Zhang, L.: Style normalization and restitution for generalizable person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3143–3152 (2020)
- (12) Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
- (13) Li, S., Bak, S., Carr, P., Wang, X.: Diversity regularized spatiotemporal attention for video-based person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 369–378 (2018), arxiv.org/abs/1803.09882
- (14) Li, W., Zhu, X., Gong, S.: Harmonious attention network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2285–2294 (2018), arxiv.org/abs/1802.08122
- (15) Liu, H., Feng, J., Qi, M., Jiang, J., Yan, S.: End-to-end comparative attention networks for person re-identification. IEEE Transactions on Image Processing 26(7), 3492–3506 (2017), arxiv.org/abs/1606.04404v1
- (16) Liu, J., Ni, B., Yan, Y., Zhou, P., Cheng, S., Hu, J.: Pose transferrable person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4099–4108 (2018)
- (17) Liu, X., Zhao, H., Tian, M., Sheng, L., Shao, J., Yi, S., Yan, J., Wang, X.: Hydraplus-net: Attentive deep features for pedestrian analysis pp. 350–359 (2017)
- (18) Loshchilov, I., Hutter, F.: Online batch selection for faster training of neural networks. arXiv preprint arXiv:1511.06343 (2015)
- (19) Qian, X., Fu, Y., Xiang, T., Wang, W., Qiu, J., Wu, Y., Jiang, Y.G., Xue, X.: Pose-normalized image generation for person re-identification. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 650–667 (2018)
- (20) Saquib Sarfraz, M., Schumann, A., Eberle, A., Stiefelhagen, R.: A pose-sensitive embedding for person re-identification with expanded cross neighborhood re-ranking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 420–429 (2018)
- (21) Shen, Y., Li, H., Xiao, T., Yi, S., Chen, D., Wang, X.: Deep group-shuffling random walk for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2265–2274 (2018)
- (22) Shen, Y., Li, H., Yi, S., Chen, D., Wang, X.: Person re-identification with deep similarity-guided graph neural network. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 486–504 (2018)
- (23) Shen, Y., Xiao, T., Li, H., Yi, S., Wang, X.: End-to-end deep kronecker-product matching for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 6886–6895 (2018)
- (24) Si, J., Zhang, H., Li, C.G., Kuen, J., Gang, W.: Dual attention matching network for context-aware feature sequence based person re-identification (2018)
- (25) Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Moreno-Noguer, F.: Fracking deep convolutional image descriptors. arXiv preprint arXiv:1412.6537 (2014)
- (26) Song, C., Huang, Y., Ouyang, W., Wang, L.: Mask-guided contrastive attention model for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1179–1188 (2018), sciencedirect.com/science/article/abs/pii/S0167865518304562
- (27) Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3960–3969 (2017)
- (28) Suh, Y., Wang, J., Tang, S., Mei, T., Mu Lee, K.: Part-aligned bilinear representations for person re-identification pp. 402–419 (2018)
- (29) Sun, Y., Xu, Q., Li, Y., Zhang, C., Li, Y., Wang, S., Sun, J.: Perceive where to focus: Learning visibility-aware part-level features for partial person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 393–402 (2019)
- (30) Sun, Y., Zheng, L., Deng, W., Wang, S.: Svdnet for pedestrian retrieval. In: IEEE International Conference on Computer Vision (2017)
- (31) Sun, Y., Zheng, L., Yang, Y., Tian, Q., Wang, S.: Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 480–496 (2018)
Varior, R.R., Bing, S., Lu, J., Dong, X., Gang, W.: A siamese long short-term memory architecture for human re-identification. In: European Conference on Computer Vision (2016)
- (33) Wang, C., Zhang, Q., Huang, C., Liu, W., Wang, X.: Mancs: A Multi-task Attentional Network with Curriculum Sampling for Person Re-Identification (2018)
- (34) Wang, G., Yang, S., Liu, H., Wang, Z., Yang, Y., Wang, S., Yu, G., Zhou, E., Sun, J.: High-order information matters: Learning relation and topology for occluded person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6449–6458 (2020)
- (35) Wang, G., Yuan, Y., Chen, X., Li, J., Zhou, X.: Learning discriminative features with multiple granularities for person re-identification. In: 2018 ACM Multimedia Conference on Multimedia Conference. pp. 274–282. ACM (2018)
- (36) Wang, X., Chen, Z., Yang, R., Luo, B., Tang, J.: Improved hard example mining by discovering attribute-based hard person identity. arXiv preprint arXiv:1905.02102 (2019)
- (37) Wang, X., Li, C., Luo, B., Tang, J.: Sint++: Robust visual tracking via adversarial positive instance generation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018)
- (38) Wang, X., Zheng, S., Yang, R., Luo, B., Tang, J.: Pedestrian attribute recognition: A survey. arXiv preprint arXiv:1901.07474 (2019)
Wang, X., Gupta, A.: Unsupervised learning of visual representations using videos. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2794–2802 (2015)
- (40) Wang, X., Shrivastava, A., Gupta, A.: A-fast-rcnn: Hard positive generation via adversary for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3039–3048. IEEE (2017), arxiv.org/abs/1704.03414
- (41) Wang, Y., Wang, L., You, Y., Zou, X., Chen, V., Li, S., Huang, G., Hariharan, B., Weinberger, K.Q.: Resource aware person re-identification across multiple resolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8042–8051 (2018)
- (42) Woo, S., Park, J., Lee, J.Y., So Kweon, I.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 3–19 (2018)
- (43) Xu, J., Zhao, R., Zhu, F., Wang, H., Ouyang, W.: Attention-aware compositional network for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2119–2128 (2018), ieeexplore.ieee.org/document/8578324
- (44) Yu, R., Dou, Z., Bai, S., Zhang, Z., Xu, Y., Bai, X.: Hard-aware point-to-set deep metric for person re-identification. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 188–204 (2018)
- (45) Zhang, X., Jing, X.Y., Zhu, X., Ma, F.: Semi-supervised person re-identification by similarity-embedded cycle gans. Neural Computing and Applications pp. 1–10 (2020)
- (46) Zhao, H., Tian, M., Sun, S., Jing, S., Tang, X.: Spindle net: Person re-identification with human body region guided feature decomposition and fusion. In: IEEE Conference on Computer Vision and Pattern Recognition (2017)
- (47) Zhao, L., Li, X., Zhuang, Y., Wang, J.: Deeply-learned part-aligned representations for person re-identification pp. 3219–3228 (2017)
- (48) Zheng, A., Lin, X., Dong, J., Wang, W., Tang, J., Luo, B.: Multi-scale attention vehicle re-identification. Neural Computing and Applications pp. 1–15 (2020)
- (49) Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q.: Scalable person re-identification: A benchmark. In: Proceedings of the IEEE international conference on computer vision. pp. 1116–1124 (2015)
- (50) Zheng, M., Karanam, S., Wu, Z., Radke, R.J.: Re-identification with consistent attentive siamese networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5735–5744 (2019)
- (51) Zheng, Z., Zheng, L., Yang, Y.: A discriminatively learned cnn embedding for person re-identification. Acm Transactions on Multimedia Computing Communications and Applications 14(1) (2016)
- (52) Zheng, Z., Zheng, L., Yang, Y.: Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 3754–3762 (2017)
- (53) Zhong, Z., Zheng, L., Cao, D., Li, S.: Re-ranking person re-identification with k-reciprocal encoding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1318–1327 (2017)
- (54) Zhong, Z., Zheng, L., Zheng, Z., Li, S., Yang, Y.: Camera style adaptation for person re-identification. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2018), arxiv.org/abs/1711.10295
- (55) Zhong, Z., Zheng, L., Zheng, Z., Li, S., Yang, Y.: Camera style adaptation for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5157–5166 (2018)
- (56) Zhu, X., Jing, X.Y., Ma, F., Cheng, L., Ren, Y.: Simultaneous visual-appearance-level and spatial-temporal-level dictionary learning for video-based person re-identification. Neural Computing and Applications 31(11), 7303–7315 (2019)