SaADB: A Self-attention Guided ADB Network for Person Re-identification

07/07/2020 ∙ by Bo Jiang, et al. ∙ NetEase, Inc 0

Recently, Batch DropBlock network (BDB) has demonstrated its effectiveness on person image representation and re-ID task via feature erasing. However, BDB drops the features randomly which may lead to sub-optimal results. In this paper, we propose a novel Self-attention guided Adaptive DropBlock network (SaADB) for person re-ID which can adaptively erase the most discriminative regions. Specifically, SaADB first obtains a self-attention map by channel-wise pooling and returns a drop mask by thresholding the self-attention map. Then, the input features and self-attention guided drop mask are multiplied to generate the dropped feature maps. Meanwhile, we utilize the spatial and channel attention to learn a better feature map and iteratively train with the feature dropping module for person re-ID. Experiments on several benchmark datasets demonstrate that the proposed SaADB significantly beats the prevalent competitors in person re-ID.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Given a probe image, the goal of person re-identification (re-ID) is to search the pedestrian image of the same identity from a gallery set. It has been widely used in many applications, such as video surveillance and self-driving. In recent years, many deep learning based person re-ID approaches dai2019batch ; Zhong_2018_CVPR ; Deng_2018_CVPR ; song2018mask ; liu2017end ; li2018harmonious ; xu2018attention ; li2018diversity ; zhang2020semi ; zhu2019simultaneous are proposed and achieve great improvement over traditional approaches. However, the re-ID performance in some challenging scenarios is still unsatisfied due to the influence of clutter background, illumination, motion blur, low resolution, and occlusion.

Figure 1: The comparison between our approach and other person re-ID algorithms on the Market-1501 dataset (Rank-1 and mAP are used for the evaluation).

To address the above issues, many researchers resort to attention mechanisms for person re-ID liu2017end ; li2018harmonious ; song2018mask ; xu2018attention ; li2018diversity ; zheng2020multi

. Attention mechanisms usually pursue to exploiting the most discriminative feature learning, which have been successfully used in many other computer vision tasks. However, how to learn robust fine-grained local features for person re-ID is still a challenging problem. Some works also introduce other information, such as attribute recognition

wang2019pedestrian

, pose estimation

saquib2018pose and part detection sun2019perceive to improve person re-ID. Recent works demonstrate that hard sample mining/generation strategies are greatly beneficial for robust feature learning wang2017fast ; wang2018learning ; Wang_2018_CVPR ; Zhong_2018_CVPR ; Deng_2018_CVPR ; Simo2014Fracking ; Xiaolong2015Unsupervised ; Loshchilov2015Online ; dai2019batch ; choe2019attention ; wang2019improved . Among them, Batch DropBlock Network (BDB) dai2019batch is a recent feature learning approach that can jointly utilize the global and local features for person re-ID. Specifically, it introduces a feature dropping module to randomly erase the most discriminative features and thus focuses more on non-discriminative features. However, one main limitation of BDB is that it randomly drops the features to generate hard samples for training the network which may be sub-optimal. Recent works wang2017fast ; Wang_2018_CVPR ; choe2019attention reveal that carefully designed feature dropping module can achieve better performance. This inspires us to rethink how to drop specific regions of extracted feature maps to obtain better fine-grained local features.

Based on above discussion, this paper develops a novel Self-attention guided Adaptive DropBlock Network (SaADB) for person re-ID, as shown in Figure 3. The key aspect of SaADB is to adaptively erase the most discriminative features according to the estimated attention map rather than randomly dropping in BDB dai2019batch

. More specifically, SaADB mainly contains three sub-networks, i.e., global branch, attention branch, and local feature drop network. First, we employ a global branch to extract the global feature representation for the input pedestrian image. Second, we use a local feature dropping module to adaptively erase the most discriminative parts and make our neural network be more sensitive to the non-discriminatory features. We erase the discriminative features according to the estimated self-attention guided regions. Finally, we introduce the spatial and channel attention for more discriminative feature representation learning for person re-ID. This attention branch and local feature dropping network can be randomly selected and optimized along with a global branch together in the training phase. Comparing with existing person re-ID approaches, such as BDB network 

dai2019batch , the proposed SaADB network has the following advantages: 1) Our feature dropping network can adaptively erase the most discriminative features according to the attention regions, while other models erase the features randomly which can only obtain sub-optimal results. 2) Previous works generally either utilize feature dropping module or attention module for robust feature learning, while our method makes full use of the advantages of both two learning schemes simultaneously.

Figure 2: The process of the drop mask generation. In particular, red regions in the self-attention map denotes more discriminative features compared with the blue regions. The elements in the drop mask are 0 and 1, where 0 on the blue regions corresponds the discriminative area in the self-attention and the rest area is represented by 1.

The main contributions of this paper can be summarized as the following three aspects:

  • We propose a novel Self-attention guided Adaptive feature Dropping Module (SaADB) for the person image representation and identification tasks.

  • We jointly utilize the adaptive feature dropping module and attention scheme which can attain better feature representation for person re-ID.

  • Extensive experiments on multiple person re-ID benchmark datasets validate the effectiveness of the proposed SaADB network.

For the rest of this paper, we first review some related works in section 2. Then, we overview and elaborate our method in section 3.1 and section 3.2 respectively. We compare the proposed SaADB model with state-of-the-art person re-ID approaches in section 4.3, followed by the component analysis in section 4.4. Section 5 concludes our paper.

2 Related Work

Person Re-identification:

Inspired by the powerful feature representation of convolutional neural networks (CNN), recent person re-ID approaches generally utilize CNN to automatically learn the deep features from massive training datasets 

chang2018multi ; Shen2018Deep ; sun2017svdnet ; Zhao2017Deeply ; Zheng2016A ; fayyaz2019person ; cheng2019hierarchical ; cheng2020scale . Image dicing is a popular strategy to extract local features Varior2016A . The dense semantic estimation technique (DensePose) is first used in work Zhao2017Spindle to perform pixel-wise fine-grained semantic estimation to obtain the semantic information of each pixel. It can handle the problem of spatial semantic misalignment and significantly improves the performance of person re-ID tasks.

In addition, many works introduce some attention modules into person re-ID networks, such as liu2017end ; li2018harmonious ; song2018mask ; xu2018attention ; li2018diversity . Specifically, Liu et al. liu2017end demonstrate that multiple local areas with more distinguishable information could further improve the overall performance. Li et al. li2018harmonious

propose to jointly learn the hard and soft attention for person re-ID. A continuous attention model guided by a binary mask is introduced in work

song2018mask . It firstly uses the binary segmentation masks to construct the synthetic RGB-Mask pairs as the input, and then employs a mask-guided contrastive attention model to learn features separately from the body and background regions. Xu et al. xu2018attention introduces the pose-guided part attention (PPA) and attention-aware feature composition (AFC) for person re-ID in which PPA is used to mask out undesirable background features in person feature maps and can also handle the part occlusion issue. Li et al. li2018diversity propose to use the spatial attention to handle the issue of alignment between frames which can avoid occlusion damage. Although these works achieve better results by exploiting some attention models, however, they all attempt to mine the most discriminative features and thus ignore the fine-grained local features, which are important cues in some challenging scenarios. In this paper, we jointly utilize the feature dropping module and attention model, which can obtain better local feature representations for person re-ID.

Hard Example Mining/Generation: Some researchers attempt to design hard example mining/generation techniques wang2017fast ; wang2018learning ; Wang_2018_CVPR ; Zhong_2018_CVPR ; Deng_2018_CVPR ; Simo2014Fracking ; Xiaolong2015Unsupervised ; Loshchilov2015Online ; dai2019batch ; choe2019attention ; wang2019improved for person re-ID and other related computer vision tasks. Specifically, Wang et al. wang2019improved propose to utilize person attributes to mine hard mini-batch samples for the training of their network. Wang et al. wang2018learning propose to combine global features with multi-granularity local features and to characterize the integrity of input image. Dai et al. dai2019batch propose a Batch Dropblock Network (BDB) to learn some attentive local features for re-ID. Although the BDB network can obtain better performance for person re-ID and some other related retrieval tasks, however, the design of this module may still not be optimal. In this paper, we propose to drop the features guided by a self-attention module and design a novel SaADB for person image representation and identification.

3 The Proposed Approach

In this section, we first give an overview of our proposed person re-ID model. Then, we provide the details of each component of our model. Finally, we present the details of the proposed model in training and testing phase.

3.1 Overview

As shown in Figure 3, the proposed network model mainly contains three modules, i.e., global branch, attention branch, and local feature drop module. The global branch is used to encode the global feature representation of the given pedestrian image. To capture the local detailed information of pedestrian images, we introduce the local feature drop network to adaptively erase the most discriminative parts by employing a self-attention scheme. In addition, we introduce the widely used spatial and channel attention modules to further improve the discriminative ability of learned feature. This attention branch and local feature drop network can be randomly selected and optimized along with the global branch. More details about these modules are described below.

Figure 3: The procedure of our proposed person re-identification network with adaptive dropblock module.

3.2 Network Architecture

3.2.1 Global Branch

For person re-ID task, CNN is usually adopted for global feature extraction. As shown in Figure

3, we utilize ResNet-50 he2016deepResidual as our backbone network by following dai2019batch

. Given the feature map predicted by the backbone network, we first use GAP (Global Average Pooling) layer to transform the feature map into a vector, followed by two fully connected layers (FC) to encode the feature vector into the fixed dimension. The number of neurons of the two FC layers is set as 2048 and 512, respectively.

3.2.2 Adaptive Dropping Branch

The Self-attention guided Adaptive Dropping Branch (SaADB) is proposed to learn the fine-grained non-discriminating features by erasing the most discriminative features. The motivation of this module is that the aforementioned global branch can already work well in regular person re-ID scenarios which can capture the global discriminative features. However, in some challenging cases, the most discriminative features may not be the target person due to the influence of similar targets or occlusion. Inspired by the feature dropping module proposed in previous work BDB dai2019batch which aims to erase the features randomly, we utilize such mechanism to boost the robustness of the person re-ID model.

Specifically speaking, as shown in Figure 3, our SaADB takes the feature maps extracted from the backbone network as the input. Then, we employ a channel-wise pooling operation on this feature map to obtain a self-attention map. In this way, we obtain a corresponding dropping mask via a threshold selection operation. The drop mask is used to mask the input feature map to generate the dropped feature map. The obtained feature map contains non-discriminative features that can make our neural network be more sensitive to these features, as discussed in work choe2019attention . The threshold is defined as follows. Assuming is the maximum pixel value in the feature map of self-attention map, we set the threshold as to attain a drop mask, where is a hyper-parameter. More concretely, the value of self-attention map which is greater than will be set to 0, otherwise, we set it as 1. After we obtain the drop mask, we multiply it with input feature map to obtain the drop map. Such a drop map does not contain discriminative features of the target object. Therefore, we can enforce the neural networks to pay more attention to the non-discriminative features.

3.2.3 Attention Module

In addition to above feature dropping module, which aims to mine the non-discriminative features, we further introduce the attention estimation to learn the most discriminative feature map. The attention mechanism can help the neural network to learn better feature representation of pedestrian images, as utilized in many previous works liu2017end ; li2018harmonious ; song2018mask ; xu2018attention ; li2018diversity . We introduce spatial and channel attention woo2018cbam to highlight the most discriminative features for person re-ID tasks. Therefore, we can attain more robust feature representation by designing random selection of adaptive feature dropping module and attention module when training our neural network.

Channel Attention: First, we take the input feature map as input ( and

denote channel number, height, and width of feature map respectively) and use the feature correlation between channels to generate channel attention features. The channel information of the feature map can be aggregated with global max pooling and global average pooling operations based on width and height, respectively. Therefore, we can obtain two-channel descriptions

and and feed them into a shared network to generate two kinds of feature descriptors and

. The shared network consists of a multi-layer perceptron (MLP) and one hidden layer. To reduce the parameter overhead, the hidden activation size is set to

, where represents the parameter reduction ratio. Then, we merge two kinds of feature descriptors and using element-wise summation, and obtain the channel attention map

through the Sigmoid activation function 

woo2018cbam , i.e.,

(1)

where denotes the activation function.

Therefore, we can obtain the attended feature map by multiplying the channel attention map with the original feature map as,

(2)

where denotes element-wise multiplication. It is worthy to note that the channel attention values are broadcasted (copied) along the spatial dimension before the multiplication.

Spatial Attention : Different from channel attention, spatial attention focuses on mining useful feature regions from the perspective of spatial coordinates. The spatial attention module takes the output of channel attention as input and then returns two-channel descriptions and with channel-based global max pooling and global average pooling operations respectively. These descriptions will be fed into a convolutional layer to obtain two kinds of feature descriptors and . Then, we merge and together using element-wise summation, and generate the spatial attention map through a Sigmoid layer as woo2018cbam ,

(3)

Thus, we can obtain the final attended feature map by multiplying the spatial attention map with the channel attended feature map as,

(4)

Similarly, the attention values are broadcasted (copied) along the channel dimension before multiplication.

3.2.4 Random Selection

Once obtaining the attended feature map and erased feature map, how to use them for discriminative feature learning is another aspect of the proposed approach. Inspired by the random selection mechanism in previous work choe2019attention , we propose to randomly select one from them for subsequent classification in training process. On the one hand, the attention module learned in previous iterations will improve the quality of feature map and help the feature dropping module to erase the most discriminative features more accurately. Therefore, we can learn better feature representation than only using the adaptive feature dropping module. On the other hand, our model can capture the local fine-grained features, which can be more effective than attention module in challenging scenarios where the most discriminative features are inexplicit. The two branches can be trained simultaneously and the features generated from adaptive dropping module and attention module are opposite. Concatenating these two features into one representation may weaken the attention branch or local feature drop network. Therefore, we select one branch randomly from them to train in each iteration, as suggested in choe2019attention . Our experimental results also demonstrate that such alternative learning approach can boost re-ID performance significantly.

3.3 Training and Testing Phase

In the training phase, we adopt both label prediction loss and metric learning loss to train the proposed network,

(5)

Formally, the label prediction loss is defined as,

(6)

where is the value of the output vector , i.e.

, the probability of this sample belongs to the

category. is the number of categories and is the ground truth whose dimension is .

For the metric learning loss, we adopt a soft margin batch-hard triplet loss hermans2017defense which aims to increase the distances of negative and anchor samples and decrease the distances of positive and anchor samples. The detailed formulation of this metric learning loss can be written as ,

(7)

where and indicate the number of person IDs and images in each ID respectively. is defined as hermans2017defense ,

(8)

Therefore, we have triplets in a mini-batch. For each triplet sample, we have , , where and denote anchor and positive sample respectively and denotes the negative sample. is the Euclidean distance function and represents the feature vector of sample which is obtained from the last fully connected layer of our network model.

In the testing phase, we jointly utilize features from the global and attention branch as the embedding vector of a given pedestrian image. That is, the local feature drop network is only used in the training phase for robust feature learning.

3.4 Comparison with Related Works

The proposed SaADB method is most related with BDB dai2019batch , which proposes a batch dropblock network for person re-ID. Different from BDB dai2019batch , the proposed SaADB further employs a self-attention scheme choe2019attention to adaptively select the attentive regions to erase, which makes our features more discriminative and improves final re-ID performance significantly. Also, SaADB is different from ADL choe2019attention . First, SaADB is designed for person image representation and re-ID tasks, while ADL choe2019attention focuses on weakly supervised object localization. Second, SaADB exploits both channel and spatial attention for feature enhancement while only self-attention (or spatial) attention is used in ADL choe2019attention . The channel attention woo2018cbam mainly focuses on the different channel information of the input, while the spatial attention woo2018cbam mainly focuses on different position information of the input. Therefore, we can attain better feature representation with these two operations.

4 Experiments

In this section, we first introduce the datasets, evaluation metrics and implementation details in section

4.1 and 4.2 respectively. Then, we compare the proposed SaADB with other state-of-the-art person re-ID methods, followed by ablation studies to evaluate the effectiveness of each module in our method.

4.1 Datasets and Evaluation Metrics

We evaluate our model on three widely used person re-ID benchmark datasets, including Market-1501 zheng2015scalable , DukeMTMC zheng2017unlabeled , and CUHK03 dataset zhong2017re . We follow the same protocols as previous works liu2018pose ; zheng2017unlabeled ; dai2019batch ; Zhong_2018_CVPR . Two evaluation metrics are adopted for the evaluation, including mAP and Rank- zheng2015scalable .

4.2 Implementation Details

In this paper, ResNet-50 he2016deepResidual is adopted as our backbone network with a slight modification, i.e., removing the down-sampling layers after the layer-3. All person images are resized to 384 128. Our model size is 34.8 M and the training is conducted on a PC with 4 GTX-1080 GPUs. The batch size is 128 with 32 identities in each batch. We use Adam kingma2014adam

as the optimizer and the dynamic learning rate is used in the first 50 epochs,

i.e., , where is the index of epoch. Then, we decay the learning rate to , after 200 and 300 epochs respectively. Our network is trained in 600 epochs. In the local feature drop network, 20% of the activate values in the feature maps are erased. In each iteration, the probability of selecting the dropped branch is 0.25. That is to say, the probability of choosing the attention branch is 0.75. In the testing phase, we jointly utilize the feature vectors from both the global branch and attention branch as the embedding vector of a pedestrian image.

Methods Reference mAP Rank-1 Rank-5 Rank-10
SVDNet sun2017svdnet ICCV2017 62.1 82.3 92.3 95.2
HydraPlus liu2017hydraplus ICCV2017 76.9 91.3 94.5 -
PDC* su2017pose ICCV2017 63.451 84.14 92.73 94.92
Mancs wang2018mancs ECCV2018 82.3 93.1 - -
HAP2SE yu2018hard ECCV2018 69.43 84.59 - -
PN-GAN qian2018pose ECCV2018 72.58 89.43 - -
SGGNN shen2018person ECCV2018 82.8 92.3 96.1 97.4
PABR suh2018part ECCV2018 79.6 91.7 - -
MGCAM song2018mask CVPR2018 74.33 83.79 - -
CamStyle+RE Zhong2018Camera CVPR2018 71.55 89.49 - -
ECN+PSE saquib2018pose CVPR2018 80.5 90.4 - -
HA-CNN li2018harmonious CVPR2018 75.7 91.2 - -
MLFN chang2018multi CVPR2018 74.3 90 - -
DuATM++ si2018dual CVPR2018 76.62 91.42 - -
DaRe(De)+RE+RR wang2018resource CVPR2018 86.7 90.9 - -
KPM+RSA+HG shen2018end CVPR2018 75.3 90.1 - -
AOS huang2018adversarially CVPR2018 70.43 86.49 - -
PCB sun2018beyond CVPR2019 83 93.4 - -
CASN(PCB) zheng2019re CVPR2019 82.8 94.4 - -
HOReID wang2020high CVPR2020 84.9 94.2 - -
SNR jin2020style CVPR2020 84.7 94.4 - -
BDBdai2019batch ICCV2019 86.3 94.7 - -
SaADB 86.7 95.2 97.9 98.6

Table 1: Comparison with other re-ID algorithms on Market-1501 dataset.
CUHK03-labeled CUHK03-detected
Methods Reference mAP Rank-1 mAP Rank-1
Mances wang2018mancs ECCV2018 63.9 69 60.5 65.5
PN-GAN qian2018pose ECCV2018 - 79.76 - 67.65
MGCAM song2018mask CVPR2018 50.21 50.14 46.87 46.71
HA-CNN li2018harmonious CVPR2018 41 44.4 38.6 41.7
MLFN chang2018multi CVPR2018 49.2 54.7 47.8 52.8
DaRe(De)+RE+RR wang2018resource CVPR2018 74.7 73.8 71.6 70.6
CASN(PCB) zheng2019re CVPR2019 68.8 73.7 64.4 71.5
BDBdai2019batch ICCV2019 76.7 79.4 73.5 76.4
SaADB 80.0 83.2 76.2 79.3
Table 2: Comparison on the CUHK03 dataset (767/700 split).

4.3 Comparison with State-of-the-art Algorithms

In this section, we report our re-ID results and compare with other state-of-the-art approaches on three benchmark datasets, including Market-1501, CUHK03 and DukeMTMC dataset.

Results on Market-1501 dataset. As shown in Table 1, BDB dai2019batch achieves 86.3%, 94.7% on the mAP and Rank-1, respectively; while our method can obtain 86.7% and 95.2% on these two metrics respectively. It is also worthy to note that the CASN zheng2019re is also developed by combining the local and global features, which achieves 82.8% and 94.4% on the mAP and Rank-1. Our method significantly outperforms CASN zheng2019re . When obtaining their fine-grained features, CASN zheng2019re only focuses on each local area by manually segmenting the feature maps without targeted learning of local features, while our proposed SaADB focuses on learning the non-discriminative features via the local feature dropping network and emphasizing the most discriminative features via the attention module. Therefore, the features extracted by SaADB network are more powerful and thus achieve better re-ID performance. In addition, our model does not require the division of local features, which is more efficient than CASN zheng2019re . Mancs wang2018mancs introduces a multi-branch network which combines the attention module and global module for person re-ID. It achieves 82.3% and 93.1% on the mAP and Rank-1 respectively. Benefit from the attention module, local feature dropping network and global module, our proposed SaADB significantly beats Mancs wang2018mancs . Our proposed algorithm also better than some recent and strong re-ID algorithms published on CVPR-2020 on this benchmark, including HOReID wang2020high (mAP/Rank-1: 84.9/94.2) and SNR jin2020style (mAP/Rank-1: 84.7/94.4).

Results on CUHK03 dataset. As shown in Table 2, the mAP and Rank-1 accuracy of our model achieve 80.0% and 83.2% respectively on CUHK03-labeled dataset, while 76.2 % and 79.3 % on CUHK03-detected dataset. The baseline approach BDB dai2019batch achieves 76.7% and 79.4% on the mAP and Rank-1. It is easy to find that our results are 3.3% and 3.8% higher than BDB dai2019batch on the CUHK03-labeled dataset. Meanwhile, our results are also significantly better than BDB dai2019batch on the CUHK03-labeled dataset, while BDB dai2019batch achieves 73.5% and 76.4% on the mAP and Rank-1. BDB dai2019batch is proposed to randomly erase the feature maps to obtain the local features, however, this simply random dropping operation can’t discard the discriminative regions in original feature maps. Therefore, they can only obtain sub-optimal samples for training. Our proposed SaADB generates effective local features using the self-attention guided adaptive feature dropping module. This makes our model focus more on the non-discriminative features. In addition, we introduce the spatial and channel attention modules to achieve more discriminative features in the testing phase. These modules all contribute to our final re-ID performance, which significantly outperforms all the compared state-of-the-art person re-ID approaches.

Results on DukeMTMC dataset. As shown in Table 3, our approach achieves 77.1% on mAP and 89.1% on Rank-1 on DukeMTMC dataset, which is significantly better than the compared state-of-the-art algorithms, including PCB sun2018beyond (73.4% and 84.1% on the mAP and Rank-1) and BDB dai2019batch (76.0% and 88.7% on the mAP and Rank-1). The results consistently promise the performance of our model for fine-grained local feature learning. It is worthy to note that our re-ID algorithm also outperforms recent HOReID and SNR, as shown in Table 3, which fully demonstrate the advantages of our proposed algorithm.

Methods Reference mAP Rank-1
HAP2SE yu2018hard ECCV2018 59.58 76.08
PN-GAN qian2018pose ECCV2018 53.2 73.58
SGGNN shen2018person ECCV2018 68.2 81.1
PABR suh2018part ECCV2018 69.3 84.4
CamStyle+RE Zhong2018Camera CVPR2018 57.61 78.32
PSE+ECN saquib2018pose CVPR2018 75.5 84.5
HA-CNN li2018harmonious CVPR2018 63.8 80.5
MLFN chang2018multi CVPR2018 62.8 81
DuATM++ si2018dual CVPR2018 64.58 81.82
DaRe(De)+RE+RR wang2018resource CVPR2018 80 84.4
KPM+RSA+HG shen2018end CVPR2018 63.2 80.3
AOS huang2018adversarially CVPR2018 62.1 79.17
PCB sun2018beyond CVPR2019 73.4 84.1
CASN(PCB) zheng2019re CVPR2019 73.7 87.7
HOReID wang2020high CVPR2020 75.6 86.9
SNR jin2020style CVPR2020 72.9 84.4
BDB dai2019batch ICCV2019 76.0 88.7
SaADB 77.1 89.1
Table 3: Comparison on the DukeMTMC dataset.

4.4 Component Analysis

In this subsection, we conduct component analysis on DukeMTMC and Market-1501 datasets to evaluate the effectiveness of each module in our re-ID algorithm. Specifically, six kinds of variants of our model are implemented as shown in Table 4 and Table 5.

Global: global branch used for the feature learning;

Drop: local feature drop network is adopted for robust feature learning;

Attention: attention module is adopted for discriminative feature learning.

As shown in Table 4, the baseline approach, i.e., only Global branch is adopted for the feature learning, achieves / on mAP and Rank-1. When integrated with Drop branch, the re-ID performance can reach to / on two evaluation metrics. When we jointly introducing the Global and Attention module, the results reach to /. These two experiments validate the effectiveness of the proposed Drop and Attention branch for person re-ID. After integrating these three modules together, the mAP and Rank-1 are further boosted to /. Similar conclusions can also be drawn from Table 5, which consistently validate the effectiveness of each component in our model.

Global Drop Attention mAP Rank-1
74.9 87.5
73.0 86.4
73.4 87.7
75.4 88.0
76.6 88.6
77.1 89.1

Table 4: Component analysis on the DukeMTMC dataset

Global Drop Attention mAP Rank-1
84.4 93.7
84.4 93.5
84.1 94.0
84.8 94.4
85.7 93.7
86.7 95.2

Table 5: Component analysis on the Market-1501 dataset

4.5 Visualization

In addition to the aforementioned quantitative analysis, we provide some visualizations to show the advantages of our proposed modules. As shown in Figure 4, our activation maps all focus on the most discriminative regions of the target person, which responses to the more robust and discriminative feature learning of our method for person re-ID. As shown in Figure 5, it is clear that our person re-ID algorithm is significantly better than the baseline method BDB dai2019batch . These visualizations intuitively verify the effectiveness of our proposed fine-grained local feature learning scheme.

Figure 4: The visualization of attention maps of our model and baseline approach. Red regions represents the area with high response and the Blue color represents the area with few response.
Figure 5: The visualization of person re-ID results by our proposed algorithm and the baseline method BDB. The images highlighted in Blue are the correct results, the Red denotes false results.

5 Conclusion

In this paper, a novel self-attention guided adaptive dropblock network (SaADB) is proposed for robust feature representation learning and person re-ID. Our feature learning framework contains three modules: global branch, local feature drop network and attention branch. To learn more detailed information, we introduce the feature dropping module to erase the most discriminative features. Different from the random feature dropping network used in many previous works, we erase the features according to the learned attention maps, which are more effective and accurate. In addition, we utilize the attention mechanism to emphasize the most discriminative local features. The feature dropping module and attention module are trained in an alternative manner via random selection. Extensive experiments on three large-scale person re-ID benchmark datasets demonstrate the effectiveness of the proposed SaADB method.

References