Improving Pedestrian Attribute Recognition With Weakly-Supervised Multi-Scale Attribute-Specific Localization

10/10/2019 ∙ by Chufeng Tang, et al. ∙ Beihang University 23

Pedestrian attribute recognition has been an emerging research topic in the area of video surveillance. To predict the existence of a particular attribute, it is demanded to localize the regions related to the attribute. However, in this task, the region annotations are not available. How to carve out these attribute-related regions remains challenging. Existing methods applied attribute-agnostic visual attention or heuristic body-part localization mechanisms to enhance the local feature representations, while neglecting to employ attributes to define local feature areas. We propose a flexible Attribute Localization Module (ALM) to adaptively discover the most discriminative regions and learns the regional features for each attribute at multiple levels. Moreover, a feature pyramid architecture is also introduced to enhance the attribute-specific localization at low-levels with high-level semantic guidance. The proposed framework does not require additional region annotations and can be trained end-to-end with multi-level deep supervision. Extensive experiments show that the proposed method achieves state-of-the-art results on three pedestrian attribute datasets, including PETA, RAP, and PA-100K.



There are no comments yet.


page 1

page 6

page 7

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recognition of pedestrian attributes, gender, age, and clothing style, has drawn extensive attention because of its great potential in video surveillance applications, such as face verification [facever], person retrieval [retrieval2, retrieval1], and person re-identification [layne2012pedestrian, Peng2016JointL, wang2018transferable]

. Recently, methods based on the Convolutional Neural Networks (CNN)

[resnet, bn] achieve great success in pedestrian attribute recognition by learning powerful features from images. Some existing works [deepmar, sudowe2015person] treat pedestrian attribute recognition as a multi-label classification problem and extract feature representations only from the whole input images. These holistic methods usually rely on global features, but regional features are more significant for fine-grained attribute classification.

Intuitively, attributes can be localized into some relevant regions in a pedestrian image. As illustrated in Figure 1 (b), when recognizing Longhair, it is reasonable to focus on the head-related regions.

Figure 1: Attentive regions generated by different methods when recognizing the attribute Longhair. (a) The original input image. (b) Attribute-specific region generated by our proposed method, which is indeed localized into a head-related region. (c) Attention mask generated by attribute-agnostic attention methods [hpnet, deepimb, zhuspatialreg], which covers a broad region but not specific to Longhair. (d) Body parts generated by part-based methods [li2018pose, liu2018localization, yang2016attribute, zhang2014panda], which extract features from these body parts.

Recent methods attempt to leverage the attention localization to promote learning discriminative features for attribute recognition. A popular solution [hpnet, deepimb, zhuspatialreg] is to employ the visual attention mechanism to capture the most relevant features. These methods usually generate attention masks from certain layers and then multiply them to corresponded feature maps so as to extract the attentive features. However, it is ambiguous which mask encodes a given attribute’s location, and there is no specific mechanism that guarantees the correspondences between attributes and attention masks. As shown in Figure 1 (c), the learned attention mask attends to a broad region which is not specific to the required attribute Longhair. An alternative way is to leverage predefined rigid parts [zhu2015multi] or external part localization modules [li2018pose, liu2018localization, yang2016attribute, zhang2014panda]. Some works apply body-parts detection [zhang2014panda]

, pose estimation

[li2018pose, yang2016attribute] and region proposals [liu2018localization] to learn part-based local features. As shown in Figure 1 (d), these methods extract local features from the localized body parts (head, torso, and legs). However, most of them just fuse the part-based features with global features, which still fail to indicate the attribute-region correspondence but require extra computational resources for sophisticated part localization.

Different from these methods, we propose a flexible Attribute Localization Module (ALM) that can automatically discover the discriminative regions and extract region-based feature representations in an attribute-specific manner. Specifically, the ALM consists of a tiny channel-attention sub-network to fully exploit the inter-channel dependencies of the input features, followed by a spatial transformer [stn] to localize the attribute-specific regions adaptively. Moreover, we embed multiple ALMs at different feature levels and introduce a feature pyramid architecture by integrating high-level semantics to reinforce the attribute localization at low-levels. In addition, ALMs at different feature levels are trained by the same set of attribute supervisions, called deep supervision [lee2015deeply, wang2018resource], where the final predictions are obtained through a voting scheme to output the maximum responses across different feature levels. This voting scheme will suggest a best prediction occurs in one feature level that has the most accurate attribute region, without interference of negative features from inappropriate regions. The proposed framework is end-to-end trainable and requires only image-level annotations. The contributions of this work can be summarized as follows:

  • [noitemsep]

  • We propose an end-to-end trainable framework which performs attribute-specific localization at multiple scales to discover the most discriminative attribute regions in a weakly-supervised manner.

  • We propose a feature pyramid architecture by leveraging both low-level details and high-level semantics to enhance the multi-scale attribute localization and region-based feature learning in a mutually reinforcing manner. The multi-scale attribute predictions are further fused by an effective voting scheme.

  • We conduct extensive experiments on three publicly available pedestrian attribute datasets (PETA [deng2014pedestrian], RAP [rap], and PA-100K [hpnet]) and achieve significant improvement over the previous state-of-the-art methods.

Figure 2: Overview of the proposed framework. The input pedestrian image is fed into the main network with both bottom-up and top-down pathways. Features combined from different levels are fed into multiple Attribute Localization Modules (Figure 3), which perform attribute-specific localization and region-based feature learning. Outputs from different branches are trained with deep supervision and aggregated through an element-wise maximum operation for inference. is the total number of attributes. Best viewed in color.
Figure 3: Details of the proposed Attribute Localization Module (ALM), which consists of a tiny channel-attention sub-network and a simplified spatial transformer. The ALM takes the combined features as input and produces an attribute-specific prediction. Each ALM only serves one attribute at a singe level.

2 Related Works

Pedestrian Attribute Recognition. Earlier pedestrian attribute recognition methods [deng2014pedestrian, layne2012pedestrian, zhu2013pedestrian] rely on hand-crafted features such as color and texture histograms, and trained separately. However, the performance of these traditional methods is far from satisfactory. More recently, methods based on the Convolutional Neural Networks achieved great success in pedestrian attribute recognition. Wang [wang2019PARSurvey] give a brief review of these methods. Sudowe [sudowe2015person] propose a holistic CNN model to jointly learn different attributes. Li [deepmar]

formulate pedestrian attribute recognition as a multi-label classification problem and propose an improved cross-entropy loss function. However, the performance of these holistic methods is limited due to the lack of consideration of the prior information in attributes. Some recent approaches attempt to exploit the spatial relations and semantic relations among attributes to further improve the recognition performance. These methods can be classified into three basic categories: (1)

Relation-based: Some works [wang2017attribute, zhao2018grouping] exploit semantic relations to assist attribute recognition. Wang [wang2017attribute] propose a CNN-RNN based framework to exploit the interdependency and correlation among attributes. Zhao [zhao2018grouping] divide the attributes into several groups and attempt to explore the intra-group and inter-group relationships. However, these methods require manually defined rules, prediction order, attribute group, which are hard to determine in real applications. (2) Attention-based: Some researchers [hpnet, deepimb, deepview, zhuspatialreg] introduce the visual attention mechanism in attribute recognition. Liu [hpnet]

propose a multi-directional attention model to learn multi-scale attentive features for pedestrian analysis. Sarafianos

[deepimb] extend the spatial regularization module [zhuspatialreg] to learn effective attention maps at multiple scales. Although recognition accuracy has been improved, these methods are attribute-agnostic and fail to take the attribute-specific information into consideration. (3) Part-based: The part-based methods usually extract features from some localized body-parts. Zhu [zhu2015multi] divide the whole image into 15 rigid patches and fuse features from different patches. Yang [yang2016attribute] and Li [li2018pose]

leverage external pose estimation module to localize body-parts. Liu

[liu2018localization] also explore attribute regions in a weakly supervised manner while they assign attribute regions to some fixed proposals generated by EdgeBoxes [Zitnick2014EdgeBL]

in advance, which is not fully-adaptive and end-to-end trainable. These methods rely either on predefined rigid parts or on sophisticated part localization mechanisms, which are less robust to pose variances and require extra computational resources. By contrast, the proposed method localizes the most discriminative regions in an attribute-specific manner, which is not considered in most of the existing works.

Weakly Supervised Attention Localization. In addition to pedestrian attribute recognition, the idea of performing attention localization without region annotations is also extensively investigated in other visual tasks. Jaderberg [stn] propose the well-known Spatial Transformer Network (STN) which can extract attentional regions with any spatial transformation in an end-to-end trainable manner. Some recent works [li2017learning, li2018harmonious] adopt STN to localize body-parts for person re-identification. Fu [fu2017look] attempt to recursively learn discriminative region for fine-grained image recognition. Wang [wang2017multi] search the discriminative regions with STN and LSTM for multi-label classification, while not in a label-specific manner. The proposed method is inspired by these works but can adaptively localize the individual informative regions for each attribute.

Feature Pyramid Architecture. There are several works exploiting top-down or skip connections that incorporate features across levels, U-Net [ronneberger2015u], Stacked hourglass network [newell2016stacked]. The proposed feature pyramid architecture is similar to Feature Pyramid Networks (FPN) [fpn], which have been studied in various object detection and segmentation models [shrivastava2016beyond, zhu2018bidirectional]. To the best of our knowledge, this work is the first attempt of employing these ideas to localize attentive regions for pedestrian attribute recognition.

3 Proposed Method

The overview of the proposed framework is illustrated in Figure 2. As shown, the proposed framework consists of a main network with feature pyramid structures, and a group of Attribute Localization Modules

(ALM) applied to different feature levels. The input pedestrian image is first fed into the main network without additional region annotations, and a prediction vector is obtained at the end of the bottom-up pathway. The details of ALM are shown in Figure

3. Each ALM only perform attribute localization and region-based feature learning for one attribute at a single feature level. The ALMs at different feature levels are trained in a deep supervision manner. Formally, given an input pedestrian image along with its corresponding attribute labels where is ths total number of attributes in the dataset and is a binary label that indicates the presence of the -th attribute if , and otherwise. We adopt the BN-Inception [bn] architecture as the backbone network in our framework. In principle, the backbone can be replaced with any other CNN architecture. Implementation details are shown in Appendix A.

3.1 Network Architecture

The key idea of this work is to perform attribute-specific localization for improving attribute recognition. It is well known that features in deeper CNN layers have coarser resolutions. Even though we can precisely localize the attribute regions based on semantically stronger features, it is still difficult to extract region-based discriminative features since some finer details may disappear. In contrast, features in lower layers always capture rich details but poor contextual information, resulting in unreliable attribute localization. Obviously, low-level details and high-level semantics are complementary to each other. Therefore, we propose a feature pyramid architecture, inspired by the FPN alike models [fpn, zhu2018bidirectional], to enhance the attribute localization and region-based feature learning in a mutually reinforcing manner. As illustrated in Figure 2, the proposed feature pyramid architecture consists of a bottom-up pathway and a top-down pathway.

The bottom-up pathway, implemented by BN-Inception network, consists of multiple inception blocks with different feature levels. In this paper, we conduct attribute localization with bottom-up features generated from three different levels: the incep_3b, incep_4d, and incep_5b

block respectively, where they have strides of

pixels with respect to the input image. The selected inception blocks are both at the end of their corresponded stages, where blocks of the same stage keep the same feature maps resolution, since we believe the last block should have strongest features. Given an input image , we denote the bottom-up features generated from the above blocks as . For RGB input images, the spatial size equal to , , and respectively.

In addition, the top-down pathway contains three lateral connections and two top-down connections, as shown in Figure 2. The lateral connections are simply used to reduce the dimensionalities of bottom-up features to , where in our implementation. The higher level features are transmitted through the top-down connections and meanwhile go through an upsampling operation. Afterward, features from adjacent levels are concatenated as follows:


where is a convolutional layer for dimensionality reduction,

refers to upsampling with nearest neighbor interpolation. Since the highest level features have no top-down connection, we only conduct dimensionality reduction for



The channel size of equal to for . The combined features are used for attribute-specific localization.

3.2 Attribute Localization Module

As mentioned in Section 1, several existing methods attempt to extract local features through attribute-agnostic visual attention, predefined rigid parts or external part localization modules. However, these methods are not the optimal solution since they overlook the significance of attribute-specific localization. As shown in Figure 1 (c,d), attentive regions belong to different attributes are mixed together, which is inconsistent with the original intention that narrowing the attentive region for improving attribute recognition. We believe that attribute-specific localization is a better choice since it can disentangle the confused attention masks into several individual regions, where each region for a specific attribute. Moreover, the learned attribute-specific regions are more interpretable since we can observe the attribute-region correspondence intuitively. What we need is a mechanism that can learn an individual bounding box, representing the discriminative region, in feature maps for a given attribute. The well-known RoI pooling technique [girshick2015fast] is inappropriate since it requires region annotations, which are not available in pedestrian attribute datasets. Inspired by the recent success of Spatial Transformer Network (STN) [stn], we propose a flexible Attribute Localization Module (ALM) to automatically discover the discriminative regions for each attribute in a weakly-supervised manner. The overview of the proposed ALM is illustrated in Figure 3.

As shown, each ALM contains a spatial transformer layer originates from STN. STN is a differentiable module which is capable of applying a spatial transformation to a feature map, cropping, translation, and scaling. In this paper, we adopt a simplified version of STN since we treat the attribute region as a simple bounding box, which can be realized through the following transformation:


where , are scaling parameters, and , are translation parameters, the expected bounding box can be obtained through these four parameters. and are the source coordinates and target coordinates of the -th pixel. To some extent, this simplified spatial transformer can be viewed as a differentiable RoI pooling, which is end-to-end trainable without region annotations. To accelerate the convergence, we simply constrain , to and to by a sigmoid and tanh activation, respectively.

In addition, we also introduce a tiny channel-attention sub-network, as shown in Figure 3. As mentioned above, the ALM takes the features combined from adjacent levels as input, where both finer details and strong semantics take the same proportion (both have channels), which means they equally contribute to attribute localization. However, the expected proportion should vary from attribute to attribute. For example, more details should be paid when recognizing finer attributes. Therefore, we introduce this channel-attention sub-network, similar to SE-Net [hu2018squeeze], to modulate the inter-channel dependencies.

Specifically, the input features pass through a series of linear and nonlinear layers, producing a weight vector for feature recalibration across channels. The reweighted features are obtained by channel-wise multiplying the weight vector with , and an extra residual link is applied to preserve the complementary information. Subsequently, a fully-connected layer is applied to estimate the transformation matrix, denoted as , and then the region-based features sampled by bilinear interpolation are used for attribute classification. We simply formulate the prediction belong to -th attribute at -th level as:


3.3 Deep Supervision

As illustrated in Figure 2, four individual prediction vectors are obtained from three ALM groups and one global branch. We apply the deep supervision [lee2015deeply, wang2018resource] mechanism for training where the four individual predictions are directly supervised by ground-truth labels. During inference, multiple prediction vectors are aggregated through an effective voting scheme that producing the maximum responses across different feature levels. The intuition behind this design is that each ALM should directly take the feedback about whether the localized region is accurate. If we only preserve the supervision of the fused predictions (maximum or averaging), the gradients are not informative enough of how each level performs, such that some branches are trained insufficiently. The maximum voting scheme is applied to choose the best predictions from different levels with the most accurate attribute region.

Specifically, we adopt the weighted binary cross-entropy loss function [deepmar] at each stage, formulated as follow:


where is the loss weight for -th attribute and is the prior class distribution of -th attribute, is the number of attributes, represents the -th branch, where , and refers to the sigmoid activation. The total training loss is calculated by summing over the four individual loss: .

4 Experiments

4.1 Datasets and Evaluation Metrics

The proposed method is evaluated on three publicly available pedestrian attribute datasets: (1) The PETA dataset [deng2014pedestrian] consists of 19,000 images with 61 binary attributes and 4 multi-class attributes. Following the previous works [deng2014pedestrian, deepview], the whole dataset is randomly partitioned into three subsets: 9,500 for training, 1,900 for verification and 7,600 for testing. We choose 35 attributes which the positive ratio is higher than for evaluation. (2) The RAP dataset [rap] contains 41,585 images which are collected from 26 indoor surveillance cameras, where each image is annotated with 72 fine-grained attributes. Following the official protocol [rap], we split the whole dataset into 33,268 training images and 8,317 test images. Only 51 binary attributes with the positive ratio higher than are selected for evaluation. (3) The PA-100K dataset [hpnet] is to-date the largest dataset for pedestrian attribute recognition, which contains 100,000 pedestrian images in total collected from outdoor surveillance cameras. Each image is annotated with 26 commonly used attributes. According to the official setting [hpnet], the whole dataset is randomly split into 80,000 training images, 10,000 validation images and 10,000 test images.

We adopt two types of metrics for evaluation [rap]: (1) Label-based: we calculate the mean accuracy (mA) as the mean of positive accuracy and negative accuracy for each attribute. The mA criterion can be formulated as:


where is the number of examples and is the number of attributes; and are the number of positive examples and correctly predicted positive examples of the -th attribute respectively; and are defined similarly. (2) Instance-based: we adopt four well-known criteria: accuracy, precision, recall and F1 score, details are omitted.

4.2 Effectiveness of Critical Components

As shown in Table 1, starting with the BN-Inception baseline, we gradually append each component and meanwhile compare it with several variants. (1) Attribute Localization Module: We first evaluate the contribution of the simplified ALM (without channel-attention sub-network) by embedding ALMs at the final layer (incep_5b). The increased mA and F1 scores demonstrate the effectiveness of attribute-specific localization. Based on this fact, we further embed multiple ALMs at different feature levels (incep_3b,4d,5b), and a greater improvement is achieved ( and in mA and F1, respectively). Considering the model complexity, we limit the number of levels to three in our framework. (2) Top-down Guidance: Secondly, we evaluate the impact of the proposed feature pyramid architecture by comparing with three variants, which are different in how to combine features from different levels. The first one is implemented by element-wise adding the features from different levels, like the original FPN [fpn], but the performance decreases. The poor results suggest that some essential information may disappear if we disregard the feature mismatching problem. The improved concatenation version achieves better results (improves in mA), which shows the success of high-level top-down guidance. Moreover, the introduced channel-attention sub-network further improves mA a lot to by modulating the inter-channel dependencies. (3) Deep Supervision: As mentioned in Section 3.3, the obtained gradients with only the supervision of fused predictions are not informative enough of how each level performs, while some branches are trained insufficiently. To address this problem, ALMs at different levels are trained with deep supervision mechanism. For inference, the experimental results suggest that element-wise maximum is a superior ensemble method than averaging since some weaker existences are ignored in averaging.

Removing all ALMs while keeping others unchanged results in a significant drop (last row in Table 1), which further confirmed the effectiveness of ALMs. Compared with the baseline, the final model achieves a remarkable performance, improving and in mA and F1 metrics, respectively. Figure 4 shows the attribute-wise mA comparison between the proposed method and baseline model on RAP dataset. As shown, the proposed method achieves significant improvement on a number of attributes, especially some fine-grained attributes, BaldHead(), Hat() and Muffler(). The accurate recognition of these attributes shows the effectiveness of the proposed attribute-specific localization module.

ComponentMetric mA F1
Baseline 75.76 78.20
ALM at Single Level (5b) 77.45 79.14
ALM at Multiple Levels (3b,4d,5b) 78.89 79.50
Top-down (Addition) 78.51 79.42
Top-down (Concatenation) 79.93 79.91
Top-down (Channel Attention) 80.61 79.98
Deep Supervision (Averaging) 80.70 80.04
Deep Supervision (Maximum) (Ours) 81.87 80.16
Ours w/o ALMs 78.91 79.55
Table 1: Performance comparisons on RAP dataset when gradually adding each proposed component to the baseline model (except the last row). Variants of the same component lie in the same group. Bold means the setting adopted in our final framework.
Figure 4: Attribute-wise mA comparison on RAP dataset between our proposed method and the baseline model. The bars are sorted in descending order according to the larger mA between the two models. We can observe significant improvements on some fine-grained attributes, BaldHead, Hat and Muffler.
Figure 5: Visualization of attribute localization results at different feature levels. Best viewed in color.

4.3 Visualization of Attribute Localization

Through the above quantitative evaluation, we can observe significant improvements on some fine-grained attributes. In this subsection, we visualize the localized attribute regions from different feature levels for qualitative analysis. In our implementation, the attribute regions are located within the feature maps, while the correspondence between a feature map pixel and an image pixel is not unique. For a relatively coarse visualization, we simply map a feature-level pixel to the center of the receptive field on the input image, like SPPNet [he2015spatial]. As shown in Figure 5, we display several examples belong to six different attributes, covering both abstract and concrete attributes. As we can see, the proposed ALMs can successfully localize these concrete attributes, Backpack, PlasticBag, and Hat, into the corresponded informative regions, despite the extreme occlusions (a, c) or pose variances (e). While recognizing the more abstract attributes Clerk and BodyFat, the ALMs tend to explore the larger regions, since they often require high-level semantics from the whole image. In addition, a failure case is also provided, as shown in Figure 5(d). The ALMs fail to localize the expected regions at two lower levels when recognizing BaldHead. We believe that this problem originates from the highly imbalanced data distribution, where only percent of images are annotated with BaldHead in the RAP dataset. Although these localized attribute regions are relatively coarse, it is still acceptable for recognizing attributes because they indeed capture these most discriminative regions with large overlap.

4.4 Different Attribute-Specific Methods

The most significant contribution of this work is the idea of localizing an individual informative region for each attribute, which we called attribute-specific and was not well investigated in previous works. In this subsection, we conduct experiments to demonstrate the advantages of our proposed method by comparing with other attribute-specific localization methods, such as visual attention and predefined parts. Different from the attribute-agnostic attention masks and body-parts illustrated in Figure 1, we extend them to an attribute-specific version for comparison. Firstly, we replace the proposed ALM with a spatial attention module while keeping others unchanged for a fair comparison. In detail, we generate individual attention masks for each attribute through a global cross-channel averaging layer and a convolutional layer, like HA-CNN [li2018harmonious]. For another comparison model, we divide the whole image into three rigid parts (head, torso, and legs) and extract part-based features with an RoI pooling layer, then manually define the attribute-part relations, recognizing hat only from the head part. More details about the compared methods are shown in Appendix B. Experimental results are listed in Table 2. As expected, the proposed method largely outperforms the other two methods (improving and in mA, respectively).

To better understanding the differences, we visualize these localization results in Figure 6. As we can see, the attribute regions generated by ALMs are the most accurate and discriminative one. Although the attention-based model achieves a not-bad result, the generated attention masks may attend to the irrelevant or biased regions. While recognizing Box, the attention masks fail to cover the expected regions, and we also observed that they tend to localize almost the same regions wherever the boxes are. By contrast, the proposed method can successfully handle the location uncertainties and pose variances. We provide more visualization results in Figure S4.

To some extent, the methods relying on attention masks and rigid parts are at two extremes. The former attempts to completely cover the informative pixels in a highly adaptive way, but mostly fails since we have only image-level annotations. The latter one just totally discards the adaptive factors, which are less robust to pose variances. Therefore, the proposed method attempts to achieve a balance between these two extremes, by constraining the attentional regions to several bounding boxes, which relatively coarse but more interpretable and controllable.

MethodMetric mA F1
Rigid Part 76.56 78.84
Attention Mask 78.35 79.51
Attribute Region 81.87 80.16
Table 2: Experimental results of different attribute-specific localization methods evaluated on RAP dataset.
Figure 6: Case studies of different attribute-specific localization methods on three different attributes: Boots (Top), Glasses (Middle), and Box (Bottom). Different from Figure 1, the attention masks and body-parts are applied in an attribute-specific manner.
Dataset PETA RAP
MethodMetric mA Accu Prec Recall F1 mA Accu Prec Recall F1 #P GFLOPs
ACN [sudowe2015person] 81.15 73.66 84.06 81.26 82.64 69.66 62.61 80.12 72.26 75.98 - -
DeepMar [deepmar] 82.89 75.07 83.68 83.14 83.41 73.79 62.02 74.92 76.21 75.56 58.5M 0.72
JRL [wang2017attribute] 85.67 - 86.03 85.34 85.42 77.81 - 78.11 78.98 78.58 - -
JRL* [wang2017attribute] 82.13 - 82.55 82.12 82.02 74.74 - 75.08 74.96 74.62 - -
GRL [zhao2018grouping] 86.70 - 84.34 88.82 86.51 81.20 - 77.70 80.90 79.29 50M 10
HP-Net [hpnet] 81.77 76.13 84.92 83.24 84.07 76.12 65.39 77.33 78.79 78.05 - -
VeSPA [deepview] 83.45 77.73 86.18 84.81 85.49 77.70 67.35 79.51 79.67 79.59 17.0M
DIAA [deepimb] 84.59 78.56 86.79 86.12 86.46 - - - - - - -
PGDM [li2018pose] 82.97 78.08 86.86 84.68 85.76 74.31 64.57 78.86 75.90 77.35 87.2M 1
LG-Net [liu2018localization] - - - - - 78.68 68.00 80.36 79.82 80.09 20M
BN-Inception 82.66 77.73 86.68 84.20 85.57 75.76 65.57 78.92 77.49 78.20 10.3M 1.78
Ours 86.30 79.52 85.65 88.09 86.85 81.87 68.17 74.71 86.48 80.16 17.1M 1.95
Table 3:

Quantitative comparisons against previous methods on PETA and RAP datasets. We divide these methods into four groups: holistic methods, relation-based methods, attention-based methods, and part-based methods, from top to bottom. JRL* is the single model version of JRL. The precision and recall metrics are not so reliable in class-imbalanced datasets while the mA and F1 score are more convictive. Best results are in

bold. For RAP dataset, we further provide comparisons on the number of parameters (#P) and complexity (GFLOPs).
Dataset PA-100K
Method mA Accu Prec Recall F1
DeepMar [deepmar] 72.70 70.39 82.24 80.42 81.32
HP-Net [hpnet] 74.21 72.19 82.97 82.09 82.53
PGDM [li2018pose] 74.95 73.08 84.36 82.24 83.29
VeSPA [deepview] 76.32 73.00 84.99 81.49 83.20
LG-Net [liu2018localization] 76.96 75.55 86.99 83.17 85.04
BN-Inception 77.47 75.05 86.61 85.34 85.97
Ours 80.68 77.08 84.21 88.84 86.46
Table 4: Quantitative comparisons on PA-100K dataset.

4.5 Comparison with State-of-the-art Methods

In this subsection, we compare the performance of our proposed method against several state-of-the-art methods. As mentioned in Section 2, we divide these methods into four categories: (1) Holistic methods including ACN [sudowe2015person] and DeepMar [deepmar], which first take CNN to jointly learn multiple attributes. (2) Relation-based methods including JRL [wang2017attribute] and GRL [zhao2018grouping], which both exploit the semantic relations by a CNN-RNN based model. (3) Attention-based methods including HP-Net [hpnet] and DIAA [liu2018localization] relying on multi-scale attention mechanism, and VeSPA [deepview] which perform view-specific attribute prediction through a coarse view predictor. (4) Part-based methods including recently proposed PGDM [li2018pose] and LG-Net [liu2018localization], which relying on external pose estimation or region proposal module.

Table 3 and Table 4 show the comparison results on three different datasets. The results suggest that our proposed method achieves superior performances compared with existing works under both label-based and instance-based metrics on all three datasets. Compared with the previous methods relying on attribute-agnostic attention or extra part localization mechanism, the proposed method can achieve a significant improvement across all datasets, which demonstrates the effectiveness of attribute-specific localization. Although a slightly lower mA score is achieved than the relation-based method GRL on PETA dataset, due to their stronger Inception-v3 backbone network (with twice as many parameters as ours), we can still outperform them on other metrics and datasets. On the more challenging dataset PA-100K, the proposed method largely outperforms all previous works, improving and in mA and F1, respectively, over the second best results. Notably, the proposed method surpasses the baseline model with a significant margin, especially on the label-based metric mA (, , and on three datasets, respectively). Note that the proposed method often achieve a lower precision but higher recall, while these two metrics are not so reliable, especially in class-imbalanced datasets. Moreover, the two metrics are inversely correlated, , increase in one metric always leads to decrease in another (, by modulating the class weights in the loss function). The mA and F1 metrics are more appropriate in measuring the performance of an attribute recognition model. Our method consistently achieves the best results in these two metrics.

We provide a comparison of the computational cost for different methods (rightmost columns in Table 3) on RAP dataset. For the number of parameters, theoretically, there are totally trainable parameters in each ALM: from the STN module, from the channel-attention module, where is the number of input channels. As shown, the proposed model has much fewer trainable parameters than previous models. In terms of model complexity, even with 51 attributes, the proposed model is still light-weight as only 0.17 GFLOPs are added to the backbone network. The reason is that ALM contains only FC-layers (or 11 Conv), which involves much fewer FLOPs than 33 Conv-layers. In general, the entire model is much more efficient than previous models.

5 Conclusion

We propose an end-to-end framework for pedestrian attribute recognition, which can automatically localize the attribute-specific regions at multiple feature levels. Moreover, we apply a feature pyramid architecture to enhance the attribute localization and region-based feature learning in a mutually reinforcing manner. Experimental results on PETA, RAP, and PA-100K datasets show that the proposed method can significantly outperform most of the existing methods. The extensive analysis suggests that the proposed method can successfully localize the most informative region for each attribute in a weakly-supervised manner.

Acknowledgements  This work was supported in part by the National Key Research and Development Program of China under Grant 2017YFA0700904, in part by the National Natural Science Foundation of China under Grant 61836014, and Grant 61620106010.



Appendix A Implementation Details

We adopt the BN-Inception model pretrained from ImageNet as the backbone network. The proposed framework is implemented with PyTorch framework and trained end-to-end with only image-level annotations. We adopt Adam optimizer since it converges faster than SGD in our experiments with momentum set to

and a weight decay equals to . The initial learning rate equals to and the batch size is set to

. For RAP and PA-100K dataset, we train the model for 30 epochs and the learning rate decays by

every epochs. For the smaller PETA dataset, we double the training epochs. For data preprocessing, we resize the input pedestrian images to and apply random horizontal mirroring and data shuffling for data augmentation.

Appendix B Different Attribute-Specific Methods

In Section 4.4, we compare the proposed method against the other two attribute-specific localization methods, including visual attention and rigid parts. Different from most existing attribute-agnostic attention-based and part-based methods, we build two attribute-specific models based on these ideas for comparison. Here we show the details of the compared models.

Attention Masks Model. We replace the proposed ALM with a spatial attention module while keeping others unchanged for fair comparison. The spatial attention module is implemented by a tiny 3-layers sub-network, as shown in Figure S2, which is inspired by HA-CNN [li2018harmonious]. The input features at the -th level (a certain layer in the backbone network, totally three levels) are first fed into a cross-channel averaging layer. A

Conv-BatchNorm-ReLU block is followed to generate the expected attention mask

, which is used for localizing the -th attribute at the -th level. All channels share the identical spatial attention mask. Subsequently, the attentive features are obtained by channel-wise multiplying the attention mask with the input features, and the corresponding prediction is calculated as follows:


where denotes a fully-connected layer. Each spatial attention module only serves one attribute at a singe level, the same as Figure 3.

Rigid Parts Model. For attribute-specific part-based model, we replace ALM with a body-parts guided module, as shown in Figure S3. The key idea is to associate each attribute with a predefined body region, including head, torso, legs, and the whole image, , the LongHair attribute is associated with the head part. Since the body-part annotations are unavailable on most pedestrian attribute datasets, we adopt an external pose estimation model to localize the body parts, which is inspired by SpindleNet [Zhao_2017_CVPR]. Specifically, we localize 14 human body keypoints for each pedestrian image using a pretrained pose estimation model [Zhao_2017_CVPR]. The pedestrian image is then divided into three body-part regions based on these keypoints, as shown in Figure S1. In the body-parts guided module (Figure S3), the body-part-based local features are extracted from the input features through an RoI pooling layer [girshick2015fast]. For attribute prediction, the most relevant features are selected according to the attribute-region correspondence, as listed in Table S1, recognizing hat using features only from the head part.

Region Attributes
BaldHead, LongHair, BlackHair, Hat, Glasses,
Muffler, Calling
Shirt, Sweater, Vest, TShirt, Cotton, Jacket,
Suit-Up, Tight, ShortSleeve, LongTrousers,
Skirt, ShortSkirt, Dress, Jeans, TightTrousers,
CarryingbyArm, CarryingbyHand
LeatherShoes, SportShoes, Boots, ClothShoes,
Female, AgeLess16, Age17-30, Age31-45, BodyFat,
BodyNormal, BodyThin, Customer, Clerk, Backpack,
SSBag, HandBag, Box, PlasticBag, PaperBag,
HandTrunk, OtherAttchment, Talking, Gathering,
Holding, Pushing, Pulling
Table S1: Attribute-region correspondence in RAP dataset.
Figure S1: Illustration of body-parts generation. We divide a pedestrian image into three body-part regions (head, torso, and legs) based on 14 human body keypoints.

We provide more localization results belong to different attributes, as shown in Figure S4.

Figure S2: Details of the spatial attention module for one attribute at a singe level. The expected attention mask follows a cross-channel averaging layer and a Conv-BN-ReLU block.
Figure S3: Details of the body-parts guided module for one attribute at a singe level. The three body-part regions are calculated based on several human body keypoints predicted by a pretrained pose estimation model. The local features belonging to different body-parts are extracted by an RoI pooling layer. The most relevant features are selected for attribute classification according to the predefined attribute-region correspondence (Table S1).
Figure S4: Case studies of different attribute-specific localization methods for five different attributes.