Recognition of pedestrian attributes, gender, age, and clothing style, has drawn extensive attention because of its great potential in video surveillance applications, such as face verification [facever], person retrieval [retrieval2, retrieval1], and person re-identification [layne2012pedestrian, Peng2016JointL, wang2018transferable]
. Recently, methods based on the Convolutional Neural Networks (CNN)[resnet, bn] achieve great success in pedestrian attribute recognition by learning powerful features from images. Some existing works [deepmar, sudowe2015person] treat pedestrian attribute recognition as a multi-label classification problem and extract feature representations only from the whole input images. These holistic methods usually rely on global features, but regional features are more significant for fine-grained attribute classification.
Intuitively, attributes can be localized into some relevant regions in a pedestrian image. As illustrated in Figure 1 (b), when recognizing Longhair, it is reasonable to focus on the head-related regions.
Recent methods attempt to leverage the attention localization to promote learning discriminative features for attribute recognition. A popular solution [hpnet, deepimb, zhuspatialreg] is to employ the visual attention mechanism to capture the most relevant features. These methods usually generate attention masks from certain layers and then multiply them to corresponded feature maps so as to extract the attentive features. However, it is ambiguous which mask encodes a given attribute’s location, and there is no specific mechanism that guarantees the correspondences between attributes and attention masks. As shown in Figure 1 (c), the learned attention mask attends to a broad region which is not specific to the required attribute Longhair. An alternative way is to leverage predefined rigid parts [zhu2015multi] or external part localization modules [li2018pose, liu2018localization, yang2016attribute, zhang2014panda]. Some works apply body-parts detection [zhang2014panda]
, pose estimation[li2018pose, yang2016attribute] and region proposals [liu2018localization] to learn part-based local features. As shown in Figure 1 (d), these methods extract local features from the localized body parts (head, torso, and legs). However, most of them just fuse the part-based features with global features, which still fail to indicate the attribute-region correspondence but require extra computational resources for sophisticated part localization.
Different from these methods, we propose a flexible Attribute Localization Module (ALM) that can automatically discover the discriminative regions and extract region-based feature representations in an attribute-specific manner. Specifically, the ALM consists of a tiny channel-attention sub-network to fully exploit the inter-channel dependencies of the input features, followed by a spatial transformer [stn] to localize the attribute-specific regions adaptively. Moreover, we embed multiple ALMs at different feature levels and introduce a feature pyramid architecture by integrating high-level semantics to reinforce the attribute localization at low-levels. In addition, ALMs at different feature levels are trained by the same set of attribute supervisions, called deep supervision [lee2015deeply, wang2018resource], where the final predictions are obtained through a voting scheme to output the maximum responses across different feature levels. This voting scheme will suggest a best prediction occurs in one feature level that has the most accurate attribute region, without interference of negative features from inappropriate regions. The proposed framework is end-to-end trainable and requires only image-level annotations. The contributions of this work can be summarized as follows:
We propose an end-to-end trainable framework which performs attribute-specific localization at multiple scales to discover the most discriminative attribute regions in a weakly-supervised manner.
We propose a feature pyramid architecture by leveraging both low-level details and high-level semantics to enhance the multi-scale attribute localization and region-based feature learning in a mutually reinforcing manner. The multi-scale attribute predictions are further fused by an effective voting scheme.
We conduct extensive experiments on three publicly available pedestrian attribute datasets (PETA [deng2014pedestrian], RAP [rap], and PA-100K [hpnet]) and achieve significant improvement over the previous state-of-the-art methods.
2 Related Works
Pedestrian Attribute Recognition. Earlier pedestrian attribute recognition methods [deng2014pedestrian, layne2012pedestrian, zhu2013pedestrian] rely on hand-crafted features such as color and texture histograms, and trained separately. However, the performance of these traditional methods is far from satisfactory. More recently, methods based on the Convolutional Neural Networks achieved great success in pedestrian attribute recognition. Wang [wang2019PARSurvey] give a brief review of these methods. Sudowe [sudowe2015person] propose a holistic CNN model to jointly learn different attributes. Li [deepmar]
formulate pedestrian attribute recognition as a multi-label classification problem and propose an improved cross-entropy loss function. However, the performance of these holistic methods is limited due to the lack of consideration of the prior information in attributes. Some recent approaches attempt to exploit the spatial relations and semantic relations among attributes to further improve the recognition performance. These methods can be classified into three basic categories: (1)Relation-based: Some works [wang2017attribute, zhao2018grouping] exploit semantic relations to assist attribute recognition. Wang [wang2017attribute] propose a CNN-RNN based framework to exploit the interdependency and correlation among attributes. Zhao [zhao2018grouping] divide the attributes into several groups and attempt to explore the intra-group and inter-group relationships. However, these methods require manually defined rules, prediction order, attribute group, which are hard to determine in real applications. (2) Attention-based: Some researchers [hpnet, deepimb, deepview, zhuspatialreg] introduce the visual attention mechanism in attribute recognition. Liu [hpnet]
propose a multi-directional attention model to learn multi-scale attentive features for pedestrian analysis. Sarafianos[deepimb] extend the spatial regularization module [zhuspatialreg] to learn effective attention maps at multiple scales. Although recognition accuracy has been improved, these methods are attribute-agnostic and fail to take the attribute-specific information into consideration. (3) Part-based: The part-based methods usually extract features from some localized body-parts. Zhu [zhu2015multi] divide the whole image into 15 rigid patches and fuse features from different patches. Yang [yang2016attribute] and Li [li2018pose]
leverage external pose estimation module to localize body-parts. Liu[liu2018localization] also explore attribute regions in a weakly supervised manner while they assign attribute regions to some fixed proposals generated by EdgeBoxes [Zitnick2014EdgeBL]
in advance, which is not fully-adaptive and end-to-end trainable. These methods rely either on predefined rigid parts or on sophisticated part localization mechanisms, which are less robust to pose variances and require extra computational resources. By contrast, the proposed method localizes the most discriminative regions in an attribute-specific manner, which is not considered in most of the existing works.
Weakly Supervised Attention Localization. In addition to pedestrian attribute recognition, the idea of performing attention localization without region annotations is also extensively investigated in other visual tasks. Jaderberg [stn] propose the well-known Spatial Transformer Network (STN) which can extract attentional regions with any spatial transformation in an end-to-end trainable manner. Some recent works [li2017learning, li2018harmonious] adopt STN to localize body-parts for person re-identification. Fu [fu2017look] attempt to recursively learn discriminative region for fine-grained image recognition. Wang [wang2017multi] search the discriminative regions with STN and LSTM for multi-label classification, while not in a label-specific manner. The proposed method is inspired by these works but can adaptively localize the individual informative regions for each attribute.
Feature Pyramid Architecture. There are several works exploiting top-down or skip connections that incorporate features across levels, U-Net [ronneberger2015u], Stacked hourglass network [newell2016stacked]. The proposed feature pyramid architecture is similar to Feature Pyramid Networks (FPN) [fpn], which have been studied in various object detection and segmentation models [shrivastava2016beyond, zhu2018bidirectional]. To the best of our knowledge, this work is the first attempt of employing these ideas to localize attentive regions for pedestrian attribute recognition.
3 Proposed Method
The overview of the proposed framework is illustrated in Figure 2. As shown, the proposed framework consists of a main network with feature pyramid structures, and a group of Attribute Localization Modules
(ALM) applied to different feature levels. The input pedestrian image is first fed into the main network without additional region annotations, and a prediction vector is obtained at the end of the bottom-up pathway. The details of ALM are shown in Figure3. Each ALM only perform attribute localization and region-based feature learning for one attribute at a single feature level. The ALMs at different feature levels are trained in a deep supervision manner. Formally, given an input pedestrian image along with its corresponding attribute labels where is ths total number of attributes in the dataset and is a binary label that indicates the presence of the -th attribute if , and otherwise. We adopt the BN-Inception [bn] architecture as the backbone network in our framework. In principle, the backbone can be replaced with any other CNN architecture. Implementation details are shown in Appendix A.
3.1 Network Architecture
The key idea of this work is to perform attribute-specific localization for improving attribute recognition. It is well known that features in deeper CNN layers have coarser resolutions. Even though we can precisely localize the attribute regions based on semantically stronger features, it is still difficult to extract region-based discriminative features since some finer details may disappear. In contrast, features in lower layers always capture rich details but poor contextual information, resulting in unreliable attribute localization. Obviously, low-level details and high-level semantics are complementary to each other. Therefore, we propose a feature pyramid architecture, inspired by the FPN alike models [fpn, zhu2018bidirectional], to enhance the attribute localization and region-based feature learning in a mutually reinforcing manner. As illustrated in Figure 2, the proposed feature pyramid architecture consists of a bottom-up pathway and a top-down pathway.
The bottom-up pathway, implemented by BN-Inception network, consists of multiple
inception blocks with different feature levels.
In this paper, we conduct attribute localization with bottom-up features generated from three different levels: the
block respectively, where they have strides ofpixels with respect to the input image. The selected
inceptionblocks are both at the end of their corresponded stages, where blocks of the same stage keep the same feature maps resolution, since we believe the last block should have strongest features. Given an input image , we denote the bottom-up features generated from the above blocks as . For RGB input images, the spatial size equal to , , and respectively.
In addition, the top-down pathway contains three lateral connections and two top-down connections, as shown in Figure 2. The lateral connections are simply used to reduce the dimensionalities of bottom-up features to , where in our implementation. The higher level features are transmitted through the top-down connections and meanwhile go through an upsampling operation. Afterward, features from adjacent levels are concatenated as follows:
where is a convolutional layer for dimensionality reduction,
refers to upsampling with nearest neighbor interpolation. Since the highest level features have no top-down connection, we only conduct dimensionality reduction for:
The channel size of equal to for . The combined features are used for attribute-specific localization.
3.2 Attribute Localization Module
As mentioned in Section 1, several existing methods attempt to extract local features through attribute-agnostic visual attention, predefined rigid parts or external part localization modules. However, these methods are not the optimal solution since they overlook the significance of attribute-specific localization. As shown in Figure 1 (c,d), attentive regions belong to different attributes are mixed together, which is inconsistent with the original intention that narrowing the attentive region for improving attribute recognition. We believe that attribute-specific localization is a better choice since it can disentangle the confused attention masks into several individual regions, where each region for a specific attribute. Moreover, the learned attribute-specific regions are more interpretable since we can observe the attribute-region correspondence intuitively. What we need is a mechanism that can learn an individual bounding box, representing the discriminative region, in feature maps for a given attribute. The well-known RoI pooling technique [girshick2015fast] is inappropriate since it requires region annotations, which are not available in pedestrian attribute datasets. Inspired by the recent success of Spatial Transformer Network (STN) [stn], we propose a flexible Attribute Localization Module (ALM) to automatically discover the discriminative regions for each attribute in a weakly-supervised manner. The overview of the proposed ALM is illustrated in Figure 3.
As shown, each ALM contains a spatial transformer layer originates from STN. STN is a differentiable module which is capable of applying a spatial transformation to a feature map, cropping, translation, and scaling. In this paper, we adopt a simplified version of STN since we treat the attribute region as a simple bounding box, which can be realized through the following transformation:
where , are scaling parameters, and , are translation parameters, the expected bounding box can be obtained through these four parameters. and are the source coordinates and target coordinates of the -th pixel. To some extent, this simplified spatial transformer can be viewed as a differentiable RoI pooling, which is end-to-end trainable without region annotations. To accelerate the convergence, we simply constrain , to and to by a sigmoid and tanh activation, respectively.
In addition, we also introduce a tiny channel-attention sub-network, as shown in Figure 3. As mentioned above, the ALM takes the features combined from adjacent levels as input, where both finer details and strong semantics take the same proportion (both have channels), which means they equally contribute to attribute localization. However, the expected proportion should vary from attribute to attribute. For example, more details should be paid when recognizing finer attributes. Therefore, we introduce this channel-attention sub-network, similar to SE-Net [hu2018squeeze], to modulate the inter-channel dependencies.
Specifically, the input features pass through a series of linear and nonlinear layers, producing a weight vector for feature recalibration across channels. The reweighted features are obtained by channel-wise multiplying the weight vector with , and an extra residual link is applied to preserve the complementary information. Subsequently, a fully-connected layer is applied to estimate the transformation matrix, denoted as , and then the region-based features sampled by bilinear interpolation are used for attribute classification. We simply formulate the prediction belong to -th attribute at -th level as:
3.3 Deep Supervision
As illustrated in Figure 2, four individual prediction vectors are obtained from three ALM groups and one global branch. We apply the deep supervision [lee2015deeply, wang2018resource] mechanism for training where the four individual predictions are directly supervised by ground-truth labels. During inference, multiple prediction vectors are aggregated through an effective voting scheme that producing the maximum responses across different feature levels. The intuition behind this design is that each ALM should directly take the feedback about whether the localized region is accurate. If we only preserve the supervision of the fused predictions (maximum or averaging), the gradients are not informative enough of how each level performs, such that some branches are trained insufficiently. The maximum voting scheme is applied to choose the best predictions from different levels with the most accurate attribute region.
Specifically, we adopt the weighted binary cross-entropy loss function [deepmar] at each stage, formulated as follow:
where is the loss weight for -th attribute and is the prior class distribution of -th attribute, is the number of attributes, represents the -th branch, where , and refers to the sigmoid activation. The total training loss is calculated by summing over the four individual loss: .
4.1 Datasets and Evaluation Metrics
The proposed method is evaluated on three publicly available pedestrian attribute datasets: (1) The PETA dataset [deng2014pedestrian] consists of 19,000 images with 61 binary attributes and 4 multi-class attributes. Following the previous works [deng2014pedestrian, deepview], the whole dataset is randomly partitioned into three subsets: 9,500 for training, 1,900 for verification and 7,600 for testing. We choose 35 attributes which the positive ratio is higher than for evaluation. (2) The RAP dataset [rap] contains 41,585 images which are collected from 26 indoor surveillance cameras, where each image is annotated with 72 fine-grained attributes. Following the official protocol [rap], we split the whole dataset into 33,268 training images and 8,317 test images. Only 51 binary attributes with the positive ratio higher than are selected for evaluation. (3) The PA-100K dataset [hpnet] is to-date the largest dataset for pedestrian attribute recognition, which contains 100,000 pedestrian images in total collected from outdoor surveillance cameras. Each image is annotated with 26 commonly used attributes. According to the official setting [hpnet], the whole dataset is randomly split into 80,000 training images, 10,000 validation images and 10,000 test images.
We adopt two types of metrics for evaluation [rap]: (1) Label-based: we calculate the mean accuracy (mA) as the mean of positive accuracy and negative accuracy for each attribute. The mA criterion can be formulated as:
where is the number of examples and is the number of attributes; and are the number of positive examples and correctly predicted positive examples of the -th attribute respectively; and are defined similarly. (2) Instance-based: we adopt four well-known criteria: accuracy, precision, recall and F1 score, details are omitted.
4.2 Effectiveness of Critical Components
As shown in Table 1, starting with the BN-Inception baseline, we gradually append each component and meanwhile compare it with several variants.
(1) Attribute Localization Module: We first evaluate the contribution of the simplified ALM (without channel-attention sub-network) by embedding ALMs at the final layer (
The increased mA and F1 scores demonstrate the effectiveness of attribute-specific localization.
Based on this fact, we further embed multiple ALMs at different feature levels (
incep_3b,4d,5b), and a greater improvement is achieved ( and in mA and F1, respectively).
Considering the model complexity, we limit the number of levels to three in our framework.
(2) Top-down Guidance: Secondly, we evaluate the impact of the proposed feature pyramid architecture by comparing with three variants, which are different in how to combine features from different levels.
The first one is implemented by element-wise adding the features from different levels, like the original FPN [fpn], but the performance decreases.
The poor results suggest that some essential information may disappear if we disregard the feature mismatching problem.
The improved concatenation version achieves better results (improves in mA), which shows the success of high-level top-down guidance.
Moreover, the introduced channel-attention sub-network further improves mA a lot to by modulating the inter-channel dependencies.
(3) Deep Supervision: As mentioned in Section 3.3, the obtained gradients with only the supervision of fused predictions are not informative enough of how each level performs, while some branches are trained insufficiently.
To address this problem, ALMs at different levels are trained with deep supervision mechanism.
For inference, the experimental results suggest that element-wise maximum is a superior ensemble method than averaging since some weaker existences are ignored in averaging.
Removing all ALMs while keeping others unchanged results in a significant drop (last row in Table 1), which further confirmed the effectiveness of ALMs. Compared with the baseline, the final model achieves a remarkable performance, improving and in mA and F1 metrics, respectively. Figure 4 shows the attribute-wise mA comparison between the proposed method and baseline model on RAP dataset. As shown, the proposed method achieves significant improvement on a number of attributes, especially some fine-grained attributes, BaldHead(), Hat() and Muffler(). The accurate recognition of these attributes shows the effectiveness of the proposed attribute-specific localization module.
|ALM at Single Level (5b)||77.45||79.14|
|ALM at Multiple Levels (3b,4d,5b)||78.89||79.50|
|Top-down (Channel Attention)||80.61||79.98|
|Deep Supervision (Averaging)||80.70||80.04|
|Deep Supervision (Maximum) (Ours)||81.87||80.16|
|Ours w/o ALMs||78.91||79.55|
4.3 Visualization of Attribute Localization
Through the above quantitative evaluation, we can observe significant improvements on some fine-grained attributes. In this subsection, we visualize the localized attribute regions from different feature levels for qualitative analysis. In our implementation, the attribute regions are located within the feature maps, while the correspondence between a feature map pixel and an image pixel is not unique. For a relatively coarse visualization, we simply map a feature-level pixel to the center of the receptive field on the input image, like SPPNet [he2015spatial]. As shown in Figure 5, we display several examples belong to six different attributes, covering both abstract and concrete attributes. As we can see, the proposed ALMs can successfully localize these concrete attributes, Backpack, PlasticBag, and Hat, into the corresponded informative regions, despite the extreme occlusions (a, c) or pose variances (e). While recognizing the more abstract attributes Clerk and BodyFat, the ALMs tend to explore the larger regions, since they often require high-level semantics from the whole image. In addition, a failure case is also provided, as shown in Figure 5(d). The ALMs fail to localize the expected regions at two lower levels when recognizing BaldHead. We believe that this problem originates from the highly imbalanced data distribution, where only percent of images are annotated with BaldHead in the RAP dataset. Although these localized attribute regions are relatively coarse, it is still acceptable for recognizing attributes because they indeed capture these most discriminative regions with large overlap.
4.4 Different Attribute-Specific Methods
The most significant contribution of this work is the idea of localizing an individual informative region for each attribute, which we called attribute-specific and was not well investigated in previous works. In this subsection, we conduct experiments to demonstrate the advantages of our proposed method by comparing with other attribute-specific localization methods, such as visual attention and predefined parts. Different from the attribute-agnostic attention masks and body-parts illustrated in Figure 1, we extend them to an attribute-specific version for comparison. Firstly, we replace the proposed ALM with a spatial attention module while keeping others unchanged for a fair comparison. In detail, we generate individual attention masks for each attribute through a global cross-channel averaging layer and a convolutional layer, like HA-CNN [li2018harmonious]. For another comparison model, we divide the whole image into three rigid parts (head, torso, and legs) and extract part-based features with an RoI pooling layer, then manually define the attribute-part relations, recognizing hat only from the head part. More details about the compared methods are shown in Appendix B. Experimental results are listed in Table 2. As expected, the proposed method largely outperforms the other two methods (improving and in mA, respectively).
To better understanding the differences, we visualize these localization results in Figure 6. As we can see, the attribute regions generated by ALMs are the most accurate and discriminative one. Although the attention-based model achieves a not-bad result, the generated attention masks may attend to the irrelevant or biased regions. While recognizing Box, the attention masks fail to cover the expected regions, and we also observed that they tend to localize almost the same regions wherever the boxes are. By contrast, the proposed method can successfully handle the location uncertainties and pose variances. We provide more visualization results in Figure S4.
To some extent, the methods relying on attention masks and rigid parts are at two extremes. The former attempts to completely cover the informative pixels in a highly adaptive way, but mostly fails since we have only image-level annotations. The latter one just totally discards the adaptive factors, which are less robust to pose variances. Therefore, the proposed method attempts to achieve a balance between these two extremes, by constraining the attentional regions to several bounding boxes, which relatively coarse but more interpretable and controllable.
Quantitative comparisons against previous methods on PETA and RAP datasets. We divide these methods into four groups: holistic methods, relation-based methods, attention-based methods, and part-based methods, from top to bottom. JRL* is the single model version of JRL. The precision and recall metrics are not so reliable in class-imbalanced datasets while the mA and F1 score are more convictive. Best results are inbold. For RAP dataset, we further provide comparisons on the number of parameters (#P) and complexity (GFLOPs).
4.5 Comparison with State-of-the-art Methods
In this subsection, we compare the performance of our proposed method against several state-of-the-art methods. As mentioned in Section 2, we divide these methods into four categories: (1) Holistic methods including ACN [sudowe2015person] and DeepMar [deepmar], which first take CNN to jointly learn multiple attributes. (2) Relation-based methods including JRL [wang2017attribute] and GRL [zhao2018grouping], which both exploit the semantic relations by a CNN-RNN based model. (3) Attention-based methods including HP-Net [hpnet] and DIAA [liu2018localization] relying on multi-scale attention mechanism, and VeSPA [deepview] which perform view-specific attribute prediction through a coarse view predictor. (4) Part-based methods including recently proposed PGDM [li2018pose] and LG-Net [liu2018localization], which relying on external pose estimation or region proposal module.
Table 3 and Table 4 show the comparison results on three different datasets. The results suggest that our proposed method achieves superior performances compared with existing works under both label-based and instance-based metrics on all three datasets. Compared with the previous methods relying on attribute-agnostic attention or extra part localization mechanism, the proposed method can achieve a significant improvement across all datasets, which demonstrates the effectiveness of attribute-specific localization. Although a slightly lower mA score is achieved than the relation-based method GRL on PETA dataset, due to their stronger Inception-v3 backbone network (with twice as many parameters as ours), we can still outperform them on other metrics and datasets. On the more challenging dataset PA-100K, the proposed method largely outperforms all previous works, improving and in mA and F1, respectively, over the second best results. Notably, the proposed method surpasses the baseline model with a significant margin, especially on the label-based metric mA (, , and on three datasets, respectively). Note that the proposed method often achieve a lower precision but higher recall, while these two metrics are not so reliable, especially in class-imbalanced datasets. Moreover, the two metrics are inversely correlated, , increase in one metric always leads to decrease in another (, by modulating the class weights in the loss function). The mA and F1 metrics are more appropriate in measuring the performance of an attribute recognition model. Our method consistently achieves the best results in these two metrics.
We provide a comparison of the computational cost for different methods (rightmost columns in Table 3) on RAP dataset. For the number of parameters, theoretically, there are totally trainable parameters in each ALM: from the STN module, from the channel-attention module, where is the number of input channels. As shown, the proposed model has much fewer trainable parameters than previous models. In terms of model complexity, even with 51 attributes, the proposed model is still light-weight as only 0.17 GFLOPs are added to the backbone network. The reason is that ALM contains only FC-layers (or 11 Conv), which involves much fewer FLOPs than 33 Conv-layers. In general, the entire model is much more efficient than previous models.
We propose an end-to-end framework for pedestrian attribute recognition, which can automatically localize the attribute-specific regions at multiple feature levels. Moreover, we apply a feature pyramid architecture to enhance the attribute localization and region-based feature learning in a mutually reinforcing manner. Experimental results on PETA, RAP, and PA-100K datasets show that the proposed method can significantly outperform most of the existing methods. The extensive analysis suggests that the proposed method can successfully localize the most informative region for each attribute in a weakly-supervised manner.
Acknowledgements This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700904, in part by the National Natural Science Foundation of China under Grant 61836014, and Grant 61620106010.
Appendix A Implementation Details
We adopt the BN-Inception model pretrained from ImageNet as the backbone network. The proposed framework is implemented with PyTorch framework and trained end-to-end with only image-level annotations. We adopt Adam optimizer since it converges faster than SGD in our experiments with momentum set toand a weight decay equals to . The initial learning rate equals to and the batch size is set to
. For RAP and PA-100K dataset, we train the model for 30 epochs and the learning rate decays byevery epochs. For the smaller PETA dataset, we double the training epochs. For data preprocessing, we resize the input pedestrian images to and apply random horizontal mirroring and data shuffling for data augmentation.
Appendix B Different Attribute-Specific Methods
In Section 4.4, we compare the proposed method against the other two attribute-specific localization methods, including visual attention and rigid parts. Different from most existing attribute-agnostic attention-based and part-based methods, we build two attribute-specific models based on these ideas for comparison. Here we show the details of the compared models.
Attention Masks Model. We replace the proposed ALM with a spatial attention module while keeping others unchanged for fair comparison. The spatial attention module is implemented by a tiny 3-layers sub-network, as shown in Figure S2, which is inspired by HA-CNN [li2018harmonious]. The input features at the -th level (a certain layer in the backbone network, totally three levels) are first fed into a cross-channel averaging layer. A
Conv-BatchNorm-ReLU block is followed to generate the expected attention mask, which is used for localizing the -th attribute at the -th level. All channels share the identical spatial attention mask. Subsequently, the attentive features are obtained by channel-wise multiplying the attention mask with the input features, and the corresponding prediction is calculated as follows:
where denotes a fully-connected layer. Each spatial attention module only serves one attribute at a singe level, the same as Figure 3.
Rigid Parts Model. For attribute-specific part-based model, we replace ALM with a body-parts guided module, as shown in Figure S3. The key idea is to associate each attribute with a predefined body region, including head, torso, legs, and the whole image, , the LongHair attribute is associated with the head part. Since the body-part annotations are unavailable on most pedestrian attribute datasets, we adopt an external pose estimation model to localize the body parts, which is inspired by SpindleNet [Zhao_2017_CVPR]. Specifically, we localize 14 human body keypoints for each pedestrian image using a pretrained pose estimation model [Zhao_2017_CVPR]. The pedestrian image is then divided into three body-part regions based on these keypoints, as shown in Figure S1. In the body-parts guided module (Figure S3), the body-part-based local features are extracted from the input features through an RoI pooling layer [girshick2015fast]. For attribute prediction, the most relevant features are selected according to the attribute-region correspondence, as listed in Table S1, recognizing hat using features only from the head part.
We provide more localization results belong to different attributes, as shown in Figure S4.