Parameter-Free Spatial Attention Network for Person Re-Identification

11/29/2018 ∙ by Haoran Wang, et al. ∙ Max Planck Society Xidian University 12

Global average pooling (GAP) allows to localize discriminative information for recognition [40]. While GAP helps the convolution neural network to attend to the most discriminative features of an object, it may suffer if that information is missing e.g. due to camera viewpoint changes. To circumvent this issue, we argue that it is advantageous to attend to the global configuration of the object by modeling spatial relations among high-level features. We propose a novel architecture for Person Re-Identification, based on a novel parameter-free spatial attention layer introducing spatial relations among the feature map activations back to the model. Our spatial attention layer consistently improves the performance over the model without it. Results on four benchmarks demonstrate a superiority of our model over the state-of-the-art achieving rank-1 accuracy of 94.7 DukeMTMC-ReID, 74.9



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 7

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The aim of Person Re-Identification (Re-ID) is to match people from different camera viewpoints. The task is challenging because of significant pose-variations, frequent occlusions, and different camera viewpoints.

Figure 1: Class activation maps (CAM) for four training images. In each example, from left to right are the original image, the CAM from GAP with our proposed parameter-free spatial attention and the CAM from plain GAP. The highlighted area is less concentrated with the help of the spatial attention.

Global Average Pooling (GAP) [14] is a well-known technique to reduce the number of parameters in the fully-connected layer. GAP often improves model performance due to a regularization effect of the capacity. Zhou et al. [40] analyzed the ability of the GAP layer to localize important regions from the input image even when the convolutional neural network is merely trained on image-level labels. This localization can be thought of as an implicit attention mechanism. At the same time, they proposed the class activation map (CAM) as a generic visualization method to show which regions of the image the model attends to when making a prediction. The third image in each example in figure 1 shows some class activation maps using plain GAP when the corresponding person is recognized. We can see that the highlighted area from GAP is always spatially concentrated. If the model fails to put the concentration onto the most discriminative region or if this region is missing in another camera view, then it may fail to identify that person. For example, GAP focuses on the waist of the first person in figure 1, and if the person turns around in another image, the model might fail to match that person. This issue is less pronounced for general image classification as the classes are relatively distinctive. Therefore, the degradation from the absence of certain features is not necessarily fatal. But the Person Re-ID task is rather a fine-grained classification task, where the combination of many details and feature patterns of the object matter. Thus, the model should focus on the overall feature patterns and their relations. We will show that the spatially concentrated attention of GAP is due to the lack of the spatial relation among the activations on the feature map in section 3.3 and we thus propose a new model to re-introduce spatial relations.

Attention-based methods are particularly promising to model spatial relations [12, 22, 31, 19, 33, 23]. Attention mechanisms [1] equip the model with the ability by which it can pay attention to the most informative regions rather than the entire image during recognition. Inspired by these mechanisms, we proposes a parameter-free spatial attention layer to incorporate the spatial relation for GAP. The second image in each example in figure 1 shows the change of the class activation maps when our spatial attention layer is employed. From the images we can observe that the attention is, on the one hand, more distributed over the image than for GAP, but, on the other hand, also attends to regions that are attended to by GAP. Such attention thus can be more robust to partial occlusions or camera viewpoint changes.

Figure 2:

The structure of a common classifier with GAP is shown on the right; the modification on the left is that a spatial attention layer (SA) is inserted before GAP.

Figure 2 shows where the spatial attention layer (SA) is added for a common GAP-based classifier. Instead of passing the feature map from the last convolution layer through GAP before a fully-connected layer, we first assign different importance to different spatial positions with the help of the spatial attention layer. The importance of a spatial position is computed based on the total intensity of the activations along the corresponding channel. Therefore, it regularizes the GAP in a complementary way where all learned filters compete with each other to become more important. This leads to a distributed attention, thus covering more details of the image. We will show in section 3.3 that this is achieved by introducing the spatial relation on the feature map. From another perspective, the importance of an activation of a GAP-based model is only determined by its own intensity. In our case, however, it is also enhanced by other activations along the same channel, which in turn stabilizes the gradient flow, discussed in section 3.4.

Another advantage of the spatial attention layer over the previous attention-based methods is that this module is entirely parameter-free. The increase of the calculation is only proportional to the number of times it appears in a model, which means it doesn’t grow with respect to the scale of the dataset or the model size. Despite its simple structure, we show experimentally in section 5.3 that it improves the performance consistently over the model without it.

Overall, our contribution is threefold. 1) We introduce a parameter-free spatial attention layer which shows consistent improvement for GAP. 2) We propose a novel architecture with the spatial attention layer for Person Re-ID and achieve state-of-the-art performance on four benchmarks. 3) We compare the difference of GAP with and without spatial attention analytically for a better understanding.

2 Related Work

Weakly Supervised Learning.

There are many weakly supervised learning approaches using various pooling operations to localize objects, thus making use of more details on the feature map for better classification

[6, 16, 13, 15].

Zhou et al. [40] first showed the ability of GAP to localize the most discriminative image region. Instead of GAP, WELDON [7] provides a more robust and automatic activation selection strategy for pooling by selecting multiple high and low score regions from the last feature map, which is a generalization of the min + max prediction function in [6]. However, they just selected top positive and negative instances with the highest and lowest activations and then aggregated them. Different from them, we use all of the activations on the last feature map while having a fixed budget for the total importance by the softmax function. Note, that both are parameter-free methods.

[5] also adopted a similar idea for spatial pooling by introducing an extra hyper-parameter to trade off relative importance between positive and negative instances. But still, it doesn’t take all activations into consideration. In our case, the importance of an activation is determined by its relative magnitude to others.

Based on the fact that lower convolution layers learn redundant filters to extract both positive and negative information due to the positive output from ReLU, Shang

et al. [18] and Blot et al. [2]

proposed a Concatenated Rectified Linear Unit to preserve both positive and negative information by concatenating a negated copy of the same feature map before ReLU, thus preventing the model from learning highly negatively-correlated pairs of the filters. This has a similar effect as our spatial attention layer, which enforces the filters to compete with each other by setting a budget on the total importance with the help of the softmax function.

Figure 3: The proposed architecture formulates the task as a classification. It consists of four components. The yellow region represents the backbone feature extractor. The red region represents the deeply supervised branches (DS). The blue region represents six part classifiers (P) [25]. The two green region represents two sets of spatial attention layers (SA), SA1 is not used for the main results. It only appears in the section 5.3 for the sake of ablation study. Then the total loss is the summation over all deep supervision losses, six part losses and the loss from the backbone. Note that the spatial attention is only added before GAP.

Attention in Person Re-ID. The attention mechanism proposed in [1]

has shown great success for a broad range of computer vision tasks. There also exists many attention-based approaches for Person Re-ID.

One way to introduce attention is to insert a trainable layer into the main body of the model [12]. However, it allows the model to attend not only to the different spatial locations but also to the different channels with various weight, which means the model has to decide for every single activation inside a feature map whether it is worth to pay attention to it. This, on the one hand, gives the model a lot of flexibility to make use of the most valuable features, but on the other hand can be harmful if no proper regularization techniques is applied to reduce overfitting. Another similar work is from Hu et al. [9]

, where the feature map was first squeezed by GAP and then fed through two fully-connected layers before rescaled by a sigmoid function. The resulting vector had the same length as the number of channels of the original feature map, onto which each entry of the vector was multiplied back for the corresponding channel. It differs from our work in three aspects: we use softmax rather than sigmoid for the resulting vector; the attention in their work was mainly spread among the channels rather than the spatial positions; and our parameter-free module is only used before the last GAP layer.

Deep Supervision and Part-Level Feature. The idea of deep supervision was first proposed by Lee et al. in [10], where a progressively decay for the weight was adopted on the auxiliary losses which went to zero eventually. The auxiliary losses can lead to more discriminative intermediate features. This technique was introduced by Wang et al. [28] for the first time to the Person Re-ID task. They took the summation over all the losses as their final loss, which implicitly assigned equal weight to each individual loss. In this work, we use a fixed proportion for the main and auxiliary losses, namely, and . It achieves empirically similar results as the decay strategy in [10].

Part-level features have shown powerful performance in Person Re-ID due to the robustness to partial occlusion and camera view changes [25, 20, 35, 21]. Sun et al. showed a strong baseline model in [25], where the last feature map was uniformly partitioned into six horizontal stripes before being averaged into a single-channel part-level feature. Then six independent losses were applied for them, respectively. The final loss was simply the summation of all six losses. We leverage this technique along with a seventh global loss which is obtained by taking the six horizontal stripes as a whole.

3 Proposed Method

This section will propose a novel architecture for Person Re-ID that is, on the one hand, a novel combination of power-full components proposed in the literature, and, on the other hand, also contains our novel parameter-free spatial attention layer resulting in consistent improvements.

3.1 Network Architecture

Figure 4: Parameter-free spatial attention layer has no trainable parameter. It assigns different importance for different locations. It consists of a summation and a softmax. denotes the summation over the activations along the channel. The softmax values are computed over then multiplied to all activations from the same location.

We formulate the Person Re-ID task as a classification problem. Our overall architecture leverages several established and powerful techniques and is enhanced by our novel spatial attention layer. In particular we leverage the following techniques: we adopt a standard ResNet-50 up to stage 4 for the backbone as it has shown good performance for our task [25, 28, 32]; we also employ deep supervision due to its ability to also influence the lower feature layers of our network [28]; and we also partition the last layer according to PCB [25] which uniformly partitioned the last feature map from the backbone in order to produce a loss for each part-level feature. PCB is proven to give competitive results in Person Re-ID task, too. Therefore, combining the ideas from [28] and [25], we propose a novel architecture as shown in the figure 3. The yellow part is the backbone ResNet-50. The red part represents the deep supervision branches, for which the spatial attention layers (the left green part) are added before GAP. The blue part denotes the part classifiers from [25]. Each part classifier produces a loss based on the part-level feature. Note that the right spatial attention layer in green is not a part of the main model. It is only used in the ablation study in the section 5.3.

The input for every deeply supervised branch is a feature map . The spatial attention layer first assigns different importance for different locations. The processed feature map is then fed into a GAP layer followed by a fully-connected layer before computing the cross-entropy loss . The losses calculated from three intermediate feature maps will in the end be added to the final loss. Therefore, our final loss consists of three auxiliary losses, partition losses and the main loss, that is , where controls the proportion of losses. The uses the same weight with the auxiliary losses because the backbone can be also thought of as another deeply supervised branch. The only difference is that the feature vectors obtained from stage 4 will be used in inference.

During training, the yellow backbone first takes a image as the input, then the extracted feature map whose size is at the last layer is then vertically partitioned into segments, over each of which GAP is then performed. The resulting part-level feature is of size . Afterwards, each part-level feature is convolved with a shared filter into the shape before an independent linear classifier with cross-entropy loss is deployed. Up to now, different partition losses are computed. Taking all reshaped part-level features as a whole, another GAP is applied to obtain a feature which is then used to compute the main loss . The partition loss should be thought of as a dropout [24] for high-level features. It has a significant regularization effect on the model because each partition loss is independent of each other, thus forcing each part-level feature to be discriminative enough to yield the correct prediction. This is more efficient than random erasing [39] since the latter one is applied on the pixel level.

During inference, all the deeply supervised branches and part classifiers on top are dismissed. The backbone will then be used as a feature extractor, which is applied to all the query images and the images in gallery to obtain the features. Then a standard information retrieval is performed based on the Euclidean distance between these features.

3.2 Parameter-Free Spatial Attention Layer

The spatial relation among the activations is helpful to distribute model attention. However, GAP treats the activations on the same feature map equally despite their locations, making the model less robust to the absence of certain features. Our spatial attention layer can add this information back into the model. Figure 4 shows the details. The first operation is a summation over the channels for each spatial position, which indicates the importance of that position. To make the learned filter more competitive, softmax rather than sigmoid is then applied on the summed activations. This encourages different filters to learn different features, thus making the model more robust. The output from softmax is then used to rescale the magnitude of the activations on the original feature map. This implies that the importance of different activations at the same spatial position can be enhanced by other activations along that channel. All activations from the same spatial position are rescaled by the same weight while different spatial positions have different weights. The spatial relation is considered because the weight is proportional to the relative magnitude of the summed activations at the corresponding position.

Formally, given a feature map with size , a summation along the channel axis yields a 2-D matrix with size , where . Then a softmax function is applied to the flattened matrix in order to assign each spatial position a value indicating the degree of importance for that location. The afore-produced values will be multiplied to all the activations along the channel axis of for the corresponding spatial positions. Therefore, the output of a parameter-free spatial attention layer can be written as:


where .

3.3 Impact on the Class Activation Map

The class activation map (CAM) proposed in [40] visualizes the attention distribution over the input image when the model identifies a specific class. As shown in figure 1, the spatial attention layer helps to distribute the focus of the model to other regions besides the most discriminative one. The class activation map of GAP of the first person highlights the waist and the feet of that person. But with spatial attention, other parts of the body which are also informative are highlighted as well.

Following the notations from section 3.2, is the feature map from the last convolution layer. If a fully-connected layer is directly performed on

for classification, the logits

for a specific class is then given as:


where is the weight connecting the logits and . Therefore, the class activation map can be defined as:


If Global Average Pooling is employed, then the result after applying GAP can be computed as . The corresponding logits for class is then:


This leads to the following form of the class activation map:


The advantage of GAP lies in its regularization effect. The parameters in the fully-connected layer become after adding the GAP layer, which means GAP operations reduce the model capacity by enforcing to be the same for any spatial position and , thus preventing overfitting of the model. But from another aspect, it can also be harmful if the spatial relation across the high-level features is completely ignored, which resembles the spirit of bag-of-words methods [26], thus suffering from the same drawback. Ideally, we want to make a compromise between these two extreme cases, where GAP neglects the spatial relation while directly performing fully-connected layer lacks effective restrictions. To this end, the parameter-free spatial attention layer shows a combination of advantages from both sides.

The formula for the logits with the parameter-free spatial attention layer can be written as:


where is the output from the parameter-free spatial attention layer. Therefore, the class activation map will be:


It can be thought of as a factorization of where controls the inter-channel attention and controls the spatial attention. The assumption behind this is that the spatial and channel attention are independent to each other. This factorization decomposes the coupling between them and at the same time maintains the spatial term. The magnitude of depends on the number of activations and the intensity of each activation across all channels. Typically, large indicates a region of interest in the input image which contains lots of desired features. Thus more attention should be payed to it.

3.4 Backpropagation through Parameter-Free Spatial Attention Layer

This section aims to justify the parameter-free spatial attention layer in terms of the gradient flow.

Given the last feature map , the common strategy is to compute , and then logits for a specific class is . Therefore, the gradient for in the model is given as:

When the parameter-free spatial attention layer (SA) is used, then the forward pass will become , , . And the corresponding backward pass will be:

The only difference between (3.4) and (3.4) is an extra intermediate term that scales the gradients before it is passed further.

Following the notation defined in the section 3.2, we can compute the gradient of with respect to

using the chain rule:

where the latter part doesn’t vanish only if .

The derivative of the softmax function is given as:

Plugging in the above equations, we can conclude that

The derivation shows that the gradient which flows backwards from a certain activation will be effected by three different sources, that is, the activations at the same spatial position but different channel, itself and others. Since and are always non-negative, for the first two cases has the same sign as , which means that the activations from other channels but the same spatial position boost the gradient flow according to . Other activations, on the contrary, always have the opposite sign to

indicating a cancel-out effect to the gradient. Therefore, it stabilizes the gradient in a competitive way so that the training procedure suffers less from the vanishing or exploding gradient problem, which leads to fast convergence.

4 Experimental Setup

Datasets and Protocols. The experiments are evaluated on four large-scale datasets Market-1501 [34], DukeMTMC-ReID [36] [17] and CUHK03-NP [11].

Market-1501 contains 32,668 annotated bounding boxes of 1,501 identities which are taken from six different camera views. The identities are split into 751 training IDs and 750 query IDs. There are in total 3,368 query images, each of which is randomly selected from each camera so that cross-camera search can be performed. The retrieval for each query image is conducted over a gallery of 19,732 images including 6,796 junk images. DukeMTMC-ReID dataset uses the the same format and evaluation protocols as Market-1501. It contains 16,522 training images of 702 identities, 2,228 query images of the other 702 identities and 17,661 gallery images from 8 high-resolution cameras. CUHK03-NP is re-formulated from the old CUHK03 dataset. It is designed for the new training/testing protocol proposed by Market-1501 [34]. It contains 767 identities with 7,368 images and 700 identities with 5,328 for training and testing respectively. Each identity is observed from two non-overlapping cameras. The difficulty is thus increased since there are less ground truth images for a single query.

The evaluation protocol proposed in [34] is used for Market-1501 and DukeMTMC-ReID. The Cumulative Matching Characteristic (CMC) for rank-1, rank-5 and the mean average precision (mAP) are measured. The new protocol from [11] is employed for CUHK03-NP dataset. Note that all the following results are evaluated under the single-query mode.

Datasets Market-1501 DukeMTMC-ReID CUHK03 labeled CUHK03 detected
Metrics(%) R-1 mAP R-1 mAP R-1 mAP R-1 mAP
SVD-net [39] 89.1 83.9 84.0 78.3 - - 41.5 37.6
OIM loss [30] 82.1 60.9 68.1 47.4 - - - -
DPFL [3] 88.9 73.1 79.2 60.6 43.0 40.5 40.7 37.0
PAN [37] 82.2 63.3 71.6 51.6 36.9 35.0 36.3 34.0
FMN [4] 87.9 80.6 79.5 72.8 46.0 47.5 47.5 48.5
HPM [8] 94.2 82.7 86.6 74.3 - - 63.9 57.5
PCB+RPP [25] 93.8 81.6 83.3 69.2 - - 63.7 57.5
HA-CNN [12] 91.2 75.7 80.5 63.8 44.4 41.0 41.7 38.6
Mancs [27] 93.1 82.3 84.9 71.8 69.0 63.9 65.5 60.5
DaRe(R)+RE+RR [29] 90.8 85.9 84.4 79.6 72.9 73.7 69.8 71.2
Ours 94.7 91.7 89.0 85.9 74.9 76.5 69.7 72.2
Table 1: Comparison between our model and other state-of-the-art methods. The highest value for each category is marked in blue.

Implementation Details.We use a similar setting to [25]

. The backbone model is a pre-trained ResNet-50 from ImageNet. All three deeply supervised branches take the feature map from the corresponding convolution stage as input. The spatial attention layer and the GAP layer are applied before the fully-connected layer. Dropout is not used in the deeply supervised branches.

The feature map after the fourth convolution stage is treated differently according to [25]. It is first partitioned into six part-level features. Then average pooling is performed within each feature. The resulting feature map is then convolved by a filter to reduce the dimension of each feature vector from to before being processed further. The six part-level features also produce six independent losses.

The input person images are all resized to . Random erasing [39] and random horizontal flipping with 0.5 are applied as the data augmentation. Re-ranking strategy from [38] is also used to further improve the performance. We adopt the recommended number of partitioning from [25]

. Dropout rate for Market-1501 is 0.47 and 0.5 for other datasets. Pre-trained weights from Imagenet are used. Batch size is set to 48. Stochastic Gradient Descent (SGD) with momentum is deployed as the optimizer. The base learning rate starts from 0.1 and decays to 0.01 after 40 epochs. The learning rate for all pre-trained layers is set to 0.1 times the base learning rate. All models are trained until convergence. The final loss consists of 0.8 partition losses and 0.2 deeply supervised losses.

5 Experimental Results and Discussions

5.1 Main Results

Evaluated on the four aforementioned datasets, the proposed architecture achieves state-of-the-art performance on Market-1501, DukeMTMC-ReID and CUHK03-NP-labeled datasets regarding both rank-1 accuracy and mAP as shown in table 1. Especially on DukeMTMC-ReID, it shows a dramatic improvement of 3.3% for R-1 and 6.3% for mAP. For the CUHK03-NP-detected dataset, the model is slightly worse (0.1pp) than the best model but is still competitive and outperforms the third best method by a significant margin. But for the other datasets, our model achieves top performance by a large margin.

5.2 Analysis of the Class Activation Map

Figure 5: We visualize CAM of four stages for the model with and without the spatial attention. In both examples, the model with spatial attention shows a properly distributed focus for every stage.
Figure 6: Given a query image and two test images, (a) is taken from the same camera view with the query and (b) is from another one. The third column of each example shows the original class activation maps. The second column shows the class activation maps with the spatial attention. The failure cases are marked with a red box. All are taken from the stage 2.

Zhou et al. [40] used class activation maps (CAM) to illustrate the distribution of the contribution of different image regions when a certain class of object is recognized. Following the same scheme, we visualize the CAM of different stages and different examples in figure 5 and figure 6. As mentioned before, GAP with the spatial attention layers (SA) distributes the focus of the model, thus covers more details of the image. Since the feature map from each stage of the proposed model has a deeply supervised branch, we can visualize the CAM for different stages given an input image. Figure 5 compares the CAM using GAP with or without SA for each stage. GAP with SA shows a pattern of less concentrated attention in general no matter how deep the feature map is. The activations from stage 1 and 2 represents some low-level and intermediate-level features. However, in the stage 3 and 4, the receptive field of the activations can cover the entire width of the image. Therefore, the highlighted area on CAM is more elongated. But still, GAP with SA is more scattered along the vertical axis.

Since the CAM of stage 2 shows a more clear comparison, we study some failure cases of GAP under cross-camera matching and visualize the corresponding CAM for stage 2 in figure 6. The first test image is from the same camera view as the query image and the second test image is from another camera view. In each example, GAP and GAP with SA are both able to recognize the image from the same camera view. However, GAP fails to identify the person from another camera view while GAP with SA still recognizes it. The second and third rows show the CAM of the model with or without the spatial attention layer respectively. Because of lack of spatial relation, the attention of the model without SA is restricted to some certain region of the image. Therefore, the model tends to focus on the most discriminative part of the person, for example, the handbag of the lady in the first example. However, this can be harmful if the expected feature is missing in another camera view. From the class activation map at the lower right corner of the first example, we can see that even though the model tries to find the handbag on the image, the patterns it found look not so convincing to make the claim that it is the same person since the red color of the handbag from the front view is absent. Therefore it fails to recognize her. This is, however, not the case for GAP with SA since it covers a lot more different regions on the image. For example, the edge of the skirt and the haircut are also involved during the recognition. These features may not be discriminative enough alone, but the combination of them can alter the decision. This redundancy equips the model with robustness. The image (b) only shows the back of the lady, but the model with spatial attention can still recognize her correctly based on the edge of the skirt and the haircut. The same thing happened also in other examples.

Metric(%) SA1 SA2 R-1 mAP
no - 84.8 67.0
Backbone yes - 87.4 68.1
no no 86.2 68.3
no yes 86.6 68.8
Backbone + DS yes yes 87.1 69.4
no no 91.7 77.0
no yes 92.4 78.2
Backbone + P + DS yes yes 92.6 78.3
Table 2: The results are evaluated on Market-1501. P represents part classifiers; DS represents deep supervision; SA1 represents the spatial attention layer on the backbone; SA2 represents the spatial attention layer on the deeply supervised branches.

5.3 Ablation Study

We show in this section that the spatial attention improves the performance consistently. The ablation study is conducted on the Market-1501 dataset because it gives relatively stable results due to its large scale. Random erasing and re-ranking are not used here. All models are trained until convergence. For the baseline, we use the blue part in figure 3. It is basically just a normal ResNet-50 with one additional convolution layer at the very end. This convolution layer is to shrink the 2048-channel feature map to a 256-channel feature map. This baseline model gives 84.8% R-1 and 67.0% mAP. Based on this, we evaluate the performance gain of the spatial attention layers. The results are shown in the table 2. DS denotes the deep supervision branch in figure 3. P denotes the usage of part classifiers. SA1 and SA2 denote the spatial attention layer for the backbone and the deeply supervised branches respectively. The models with the spatial attention layers show a consistent improvement to the ones without it. This demonstrates the effectiveness of our proposed spatial attention.

Another experiment is shown in table 3 and compares the proposed architecture with or without the spatial attention layer across four datasets. Random erasing and horizontal flipping are used as data augmentation. The other settings are the same as the previous experiment. From table 3, a consistent improvement for models with spatial attention can be observed again. On the CUHK03 labeled dataset, the model equipped with spatial attention outperforms the attention-free model by a large margin of 2.4% for the rank-1 accuracy. The largest gain on mAP of 1.8% is for DukeMTMC-ReID.

Metrics(%) +SA -SA
R-1 93.3 92.5
Market-1501 mAP 81.7 79.6
R-1 84.3 83.4
DukeMTMC-ReID mAP 72.1 70.3
R-1 64.3 61.9
CUHK03 labeled mAP 61.4 60.3
R-1 58.9 57.8
CUHK03 detected mAP 57.5 55.8
Table 3: The proposed architecture from section 3.1 is used here. The models with the spatial attention shows a consistent improvement over the ones without it. All the models are trained with random erasing until convergence. Re-ranking is not applied.

6 Conclusion

We propose a parameter-free spatial attention layer to incorporate the spatial relation among the activations on the same feature map for GAP. It is computationally efficient and produces consistent improvement over the model without it. The mechanism behind this module is studied both analytically and empirically in order to understand the behavior. We expect a broader range of applications for it.