Multi-Task Learning via Co-Attentive Sharing for Pedestrian Attribute Recognition

04/07/2020 ∙ by Haitian Zeng, et al. ∙ Tsinghua University 8

Learning to predict multiple attributes of a pedestrian is a multi-task learning problem. To share feature representation between two individual task networks, conventional methods like Cross-Stitch and Sluice network learn a linear combination of features or feature subspaces. However, linear combination rules out the complex interdependency between channels. Moreover, spatial information exchanging is less-considered. In this paper, we propose a novel Co-Attentive Sharing (CAS) module which extracts discriminative channels and spatial regions for more effective feature sharing in multi-task learning. The module consists of three branches, which leverage different channels for between-task feature fusing, attention generation and task-specific feature enhancing, respectively. Experiments on two pedestrian attribute recognition datasets show that our module outperforms the conventional sharing units and achieves superior results compared to the state-of-the-art approaches using many metrics.



There are no comments yet.


page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recognizing person attributes has attracted great interest recently. Given an image of a single person, the aim is to recognize a series of semantic attributes, e.g. age, gender, clothing style, etc. It has a wide range of application scenarios since it creates a rich profile of human traits which can facilitate person retrieval [sun2017svdnet] or re-identification [lin2019improving].

Compared to the conventional image classification where an image only belongs to a single class, recognizing person attributes can be regarded as a multi-task (MT) learning problem, since a person is usually described with multiple attributes. To recognize these attributes simultaneously, one popular approach is to build multiple attribute classifiers upon a shared backbone

[li2015multi], as shown in Fig. 1(a), which is known as the hard parameter sharing structure [ruder2017overview]. However, this structure is prone to the negative transfer problem [pan2009survey, he2017adaptively, wang2019characterizing], i.e., the feature representation of one attribute may be impeded by dissimilar attributes if they are learned together. To alleviate this problem, a vanilla approach is to ensemble two independent networks, as Fig. 1(b), and each network predicts a more closely related subset of attributes [hand2017attributes, cao2018partially]. Nevertheless, in this structure there is no communication between two networks so that some useful correlations may be ruled out. A more holistic paradigm is the soft parameter sharing structure [ruder2017overview], as illustrated in Fig. 1(c), which absorbs advantages from both hard-sharing structure and vanilla structure. It utilizes a certain module to decide what to share and what not to share with the other task at each layer.

(a) Hard-Sharing structure
(b) Vanilla structure
(c) Soft-Sharing structure
Figure 1: Different structures for multi-attribute recognition.
Figure 2: Framework of the Co-Attentive Sharing module. It starts with two features from both networks in layer . For each , three channel attention are obtained with global average pooling (GAP) and fully connected (FC) layers, and are utilized in three branches. The synergetic branch produces enhanced features and spatial attention maps using concantenated of two networks, while the attentive branch outputs a global attention and the task-specific branch outputs . Finally, outcomes of three branches are aggregated to get as the input for layer .

Apparently, sharing module is the most important component in soft-sharing structure. Previous works like Cross Stitch [misra2016cross] module and Sluice [ruder122017sluice] module utilize linear interactions to enable feature sharing. Cross Stitch module calculates the feature for next layer as a learnable linear combination of two input features. Sluice module further divides the features into subspaces, and obtains the new feature using a linear combination of subspaces. However, in these methods, the interaction of features from different tasks is simply element-wise summation, so that the selection for discriminative channels is neglected. Moreover, attributes usually correlate to different spatial locations of images [li2018pose, liu2018localization]. In other words, spatial information is non-trivial in recognizing pedestrian attributes. However, the element-wise summation fails to utilize such information. These two factors limit the performance of feature sharing.

In order to better handle those challenges, we propose a novel Co-Attentive Sharing (CAS) module which extracts discriminative channels and spatial regions for more effective feature sharing between two task networks in pedestrian attribute recognition. It consists of three branches: synergetic branch, attentive branch and task-specific

branch. They exploit three different channel attentions generated from a shared intermediate vector and play different roles. The synergetic branch fuses the selected features from each task to generate enhanced features and spatial attention maps. The attentive branch computes a global feature attention and the task-specific branch highlights important channels within each task. Finally, the results of three branches is combined together as the outputs of the module. Experiments on two large pedestrian attribute recognition datasets demonstrate the effectiveness of the CAS module.

2 Related Works

2.1 CNNs for Image Classification

After the success of convolutional neural networks (CNNs) like AlexNet

[krizhevsky2012imagenet], ResNet [he2016deep] on image classification and numerous other vision tasks, SENet [hu2018squeeze] further improves the performance of CNNs by introducing a channel attention branch which models sophisticated channel inter-dependency. Moreover, BAM [park2018bam] and CBAM [woo2018cbam] propose novel methods to implement spatial attention in a convolutional module integrated in the networks. As recognizing attributes of a pedestrian is substantially a classification problem, our work can benefit from their insights like emphasizing discriminative feature channels and spatial regions.

2.2 Multi-task Learning for Facial Attribute Recognition

In multi-task learning [caruana1997multitask], models which are trained for correlative tasks share complementary information with each other using certain mechanism. Previous works like Cross Stitch [misra2016cross] and Sluice [ruder122017sluice] exploit automatically-learned linear combination to fuse features. Multi-task learning gains popularity in facial attribute recognition since attributes are often semantically and spatially correlated. Hand et al. [hand2017attributes] propose a multi-task network which has shared bottom layers and bifurcates into branches for facial attribute classification. Cao et al. [cao2018partially] introduce a Partially-Shared structure for multi-task learning, where four task-specific networks exchanged complementary information with each other through one shared network. However, in most existing works features from different tasks interact linearly so that the complex channel interdependency is less considered. Moreover, their sharing mechanisms are not capable to properly handle spatial information.

2.3 Pedestrian Attribute Recognition

Early studies on pedestrian attribute recognition adopt hand-craft features like color and texture channels [deng2015learning]. With the success of CNNs, recent works are usually based on deep models. DeepMAR [li2015multi] predicts several person attributes simultaneously with a shared backbone network. Wang et al. [wang2017attribute] propose an encoder-decoder architecture called JRL, where one label is predicted given the context of all previously predicted labels and attentive features. HydraPlus-Net [liu2017hydraplus] introduces a mechanism to select multi-scale and multi-semantic-level features for pedestrian attribute recognition. PGDM [li2018pose]

uses an auxiliary pose estimation to better localize body-parts and enhance the performance. In these works, the semantic correlation between attributes are well-modeled, and the spatial information are also proved to be helpful for attribute recognition. Nevertheless, a shared backbone is widely adopted and the influence of negative transfer

[pan2009survey, he2017adaptively, wang2019characterizing] has not received enough attention.

3 Methodology

In this section, we first introduce the Co-Attentive Sharing module, then describe the instantiated multi-task network.

3.1 Co-Attentive Sharing Module

Consider that two individual networks are trained for Task-A and Task-B respectively. The output features of layer of each network, designated as and , are used as the inputs of the CAS module. For each input, an intermediate vector is first obtained and serves as the basis for the following three branches, so that each branch can generate a channel attention vector from . Next, synergetic branch combines the selected features from both task to obtain enhanced features and spatial attention maps . The attentive branch computes a global feature attention based on , and the task-specific branch highlights important channels within each task. Finally, the results of three branches are aggregated as the input features for layer of each task. In following parts we introduce the details and omit the superscript of task and layer unless it is necessary for simplicity.

Intermediate Vector. First, the input convolutional feature is first pooled into a channel vector using global average pooling (GAP). Then is fed into a linear (or fully connected, FC) layer

followed by a ReLU function to acquire the intermediate vector

, where the number of channels is reduced with a ratio . Note that this process is just like the squeeze step in [hu2018squeeze], which produces a powerful channel descriptor for following branches. In short, the intermediate vector is calculated as:


Synergetic Branch. This branch aims to extract discriminative features and spatial attention maps given the selected information from both tasks. Towards this goal, a channel attention vector is calculated by applying a linear layer

and the sigmoid function on

, that is:


where is the sigmoid function. So the selected feature for cross-task sharing is computed as:


where stands for the element-wise multiplication. Note that the dimension of is and the dimension of is , during multiplication the attention vector are broadcasted (namely copied) along first two dimensions, and it is similar for other operations.

Next, from two networks are concatenated along the channel dimension:


In order to fully utilize the information in , two components are extracted from it for each network. The first one is , which contains discriminative feature representation from both networks. It is calculated by applying a convolution layer with kernel on :


The second one is a spatial attention map . We follow Woo et al. [woo2018cbam] to produce this map:


where , are mean and maximum functions across the channel, is a convolution with kernel.

Attentive Branch. Even though a spatial attention map is informative, intuitively it may be useful only for certain channels which need spatial regularization. Thus, in this branch, the spatial attention map from synergetic branch and a channel weight is used together to obtain a global attention . is another attention based on , which is calculated by:


where is a linear layer. And the global attention is:


Task-specific Branch. Besides the synergy branch where two networks exchange complementary information, this branch further improves the feature by strengthening the own feature of each task. A vector is is obtained similarly as before:


where is also a linear layer. So that the outcome of the task-specific branch is given by:


Final Aggregation. In order to get the final enhanced feature , we aggregate the results from each branch by first summing up , , and then multiply with , that is:


where denotes the element-wise addition.

Figure 3: Examples of the global-local grouping scheme.

3.2 Instantiation

We instantiate the multi-task network using two pretrained ResNet-34 [he2016deep]. The CAS module is inserted on layer 1 to 4 for soft parameter sharing, as depicted in Fig. 1(c). The output of backbone network is fed into a linear layer after global average pooling for predicting two groups of attributes.

4 Experiments

4.1 Experiment Setup

Datasets. We evaluate our method on two pedestrian attribute recognition datasets: (1) PA-100K [liu2017hydraplus] contains 100,000 pedestrian images with 26 annotated attributes in total, which is the largest public pedestrian attribute recognition dataset as far as we know. The dataset is split by 811 for training, validation and testing. (2) PETA [deng2014pedestrian] is a large-scale person attribute dataset with 19,000 images and 35 attributes. There are 9,500, 1,900 and 7,600 images in training, validation and test set, respectively.

Implementation. We divide the attributes of both dataset into one global group and one local group for Task-A and Task-B respectively, shown in Fig. 3. Moreover, grouping scheme will not substantially affect the performance, see Sec 4.4. We adopt label-based metric and instance-based metric , , and following [li2015multi]

. The network is trained for 70 epochs with cross entropy loss and stochastic gradient descent. The learning rate is 0.02 and decays to 0.002 after 40 epoch. Reduction rate

is set to 16.

4.2 Baselines and Competitors

We setup four strong baselines for comparison: (1) Hard-Sharing: a single ResNet-34 network which predicts all the attributes simultaneously. (2) MT-Vanilla: the Vanilla structure with two independent networks that has no communication between each network. (3) MT-Cross-Stitch: a soft-sharing structure with Cross Stitch module. We integrate Cross Stitch module at the same places as our approach. (4) MT-Sluice: a soft-sharing structure with Sluice module. Our method is also compared to other pedestrian attribute recognition approaches [wang2017attribute, liu2018localization, sarfraz2017deep, li2015multi, liu2017hydraplus, li2018pose, ji2019image].

Method mA Acc. Prec. Recall F1
DeepMAR [li2015multi] 72.70 70.39 82.24 80.42 81.32
HP-Net [liu2017hydraplus] 74.21 72.19 82.97 82.09 82.53
VeSPA [sarfraz2017deep] 76.32 73.00 84.99 81.49 83.20
PGDM [li2018pose] 74.95 73.08 84.36 82.24 83.29
LG-Net [liu2018localization] 76.96 75.55 86.99 83.17 85.04
Hard-Sharing 75.43 76.53 87.85 83.38 85.56
MT-Vanilla 75.74 76.95 88.35 83.43 85.82
MT-Cross-Stitch [misra2016cross] 76.55 77.30 88.39 83.83 86.05
MT-Sluice [ruder122017sluice] 76.26 77.35 88.34 84.10 86.17
MT-CAS(ours) 77.20 78.09 88.46 84.86 86.62
Table 1: Results on PA-100K.
Method mA Acc. Prec. Recall F1
DeepMAR [li2015multi] 82.89 75.07 83.68 83.14 83.41
HP-Net [liu2017hydraplus] 81.77 76.13 84.92 83.24 84.07
JRL [wang2017attribute] 85.67 - 86.03 85.34 85.42
VeSPA [sarfraz2017deep] 83.45 77.73 86.18 84.81 85.49
PGDM [li2018pose] 82.97 78.08 86.86 84.68 85.76
IA-Net [ji2019image] 84.13 78.62 85.73 86.07 85.88
Hard-Sharing 81.63 76.99 86.89 83.56 85.20
MT-Vanilla 82.54 77.27 87.21 83.79 85.47
MT-Cross-Stitch [misra2016cross] 82.02 77.66 87.17 84.40 85.76
MT-Sluice [ruder122017sluice] 82.35 78.04 87.19 84.60 85.87
MT-CAS(ours) 83.17 78.78 87.49 85.35 86.41
Table 2: Results on PETA.

4.3 Experiment Results

The results are shown in Table 1 and Table 2. The highest score of each metric is marked in bold, and the second best one is underlined. It is worth noting that our baselines have already surpassed a number of competitors. We also provides qualitative results by visualizing the spatial attention from each layer and each network, as shown in Fig. 4.

Comparison with Hard-Sharing. The MT-CAS model surpasses the Hard-Sharing baseline by 1.06% on PA-100K and 1.21% on PETA under metric, which demonstrates the effectiveness of the proposed framework.

Comparison with MT-Vanilla. The MT-CAS model achieves a 0.80% higher on PA-100K and 0.94% higher on PETA, compared to MT-Vanilla. These results verify the effectiveness of CAS. Moreover, we also notice that MT-Vanilla slightly outperforms Hard-Sharing, however it does not fundamentally improve the results.

Comparison with MT-Cross-Stitch and MT-Sluice. Our module also outperforms the Cross-Stitch and Sluice by a margin of about 0.5% in score on PA-100K and about 0.6% on PETA. It shows that with the exploitation of channel and spatial attentions, our module is capable to share more discriminative features than previous methods.

4.4 Ablation Study

To understand the effectiveness of each key component of CAS module and the influence of other factors, we conduct further analysis using PA-100K dataset.

Synergetic Branch. We replace the original feature exchange operation with two alternates. One is to use element-wise summation instead of concatenation to aggregate and , that is . The other one is to further remove the which leads to . We refer to them as and . Results in Table 3 reveal that concatenating and convolution operation are important for improvement.

Attentive Branch. We remove the and use as the global attention through broadcasting in final aggregation. In another case, we remove the entire attention branch from the module. These two modifications denotes and , respectively. In Tabel 3, we observe that without attentive branch, the performance drops by about 0.3%. We also notice that the spatial attention does not improve the performance if it is not used with channel attention. It indicates that channel attention is critical in spatial regularization.

Task-specific Branch. We ablate the whole Task-specific branch, which is designated as . The influence of removing this branch is about 0.2% on , shown in Table 3.

Method mA Acc. Prec. Recall F1
Synergetic 76.77 77.70 88.24 84.53 86.35
Synergetic 76.53 77.64 88.26 84.38 86.27
Attentive 76.90 77.62 88.33 84.40 86.32
Attentive 77.02 77.67 88.25 84.51 86.34
TS 77.18 77.87 88.27 84.63 86.41
Channel 76.99 77.61 88.02 84.52 86.23
Full Module 77.20 78.09 88.46 84.86 86.62
Table 3: Ablation study results of key components.

Channel Attention. We remove all channel attentions , , so that , and use broadcasted as . This case denotes in Table 3, which demonstrates the effectiveness of selecting discriminative channels.

Grouping Scheme. We setup three additional schemes by different principles mentioned in [wang2017attribute]: rare-frequent, top-down and random. In first two well-designed schemes, attributes in each group are more relevant, while in the random scheme dissimilar attributes are more likely to be put into one group. Results of first two schemes are close to original MT-CAS while the random one has a decline of 0.3% on score, indicating that a proper grouping scheme is helpful.

Grouping Scheme mA Acc. Prec. Recall F1
rare-frequent 77.34 78.07 88.61 84.64 86.58
top-down 76.96 77.96 88.40 84.69 86.51
random 77.01 77.71 88.26 84.53 86.35
Table 4: Influence of grouping scheme.

Reduction rate. Reduction rate controls the dimension reduction of with respect to the original channel number . We try from 2 to 32. The results in Table 5 show that generally the most appropriate choice is .

Reduction rate mA Acc. Prec. Recall F1
77.39 78.02 88.55 84.64 86.55
77.24 77.71 88.40 84.33 86.32
77.11 77.92 88.46 84.63 86.50
77.20 78.09 88.46 84.86 86.62
77.16 77.82 88.18 84.70 86.41
Table 5: Influence of reduction rate.

Module integration. We modify the integration positions of the module. As the number of combinations is not small, we experiment with cases where modules are placed consecutively. From Table 6, we see that the most effective positions for integration are layer 2 and 3. Generally, the more CAS modules are inserted in a network, the higher it achieves.

Layer 1 Layer 2 Layer 3 Layer 4 F1
Table 6: Influence of module integration positions.
Figure 4: Visualization of the spatial attention map . The maps of two networks are similar in layer 1 and 2, and diverge in layer 3 and 4. In layer 3, Task A (upper) focus on the whole body, while Task B (below) emphasizes various smaller regions. In layer 4, the upper-body region is discriminative for the final prediction of Task A. All predictions are correct.

5 Conclusion

In this paper, we introduce the CAS module which enables soft feature sharing between two networks for person attribute recognition. The CAS module consists of three branches of different functions. It leverages channel and spatial information in two-task feature sharing, which is less-considered in previous works. The experimental results on two large pedestrian attribute recognition datasets show that the module outperforms the hard-sharing structure and two representative soft-sharing structures. Furthermore, extensive studies verify the effectiveness of each key component of the module.

6 Acknowledgement

This work was supported by the Natural Science Foundation of China (Project Number 61521002 and 60673107).