Learning Semantic Neural Tree for Human Parsing

12/20/2019 ∙ by Ruyi Ji, et al. ∙ Institute of Software, Chinese Academy of Sciences University at Albany 6

The majority of existing human parsing methods formulate the task as semantic segmentation, which regard each semantic category equally and fail to exploit the intrinsic physiological structure of human body, resulting in inaccurate results. In this paper, we design a novel semantic neural tree for human parsing, which uses a tree architecture to encode physiological structure of human body, and designs a coarse to fine process in a cascade manner to generate accurate results. Specifically, the semantic neural tree is designed to segment human regions into multiple semantic subregions (e.g., face, arms, and legs) in a hierarchical way using a new designed attention routing module. Meanwhile, we introduce the semantic aggregation module to combine multiple hierarchical features to exploit more context information for better performance. Our semantic neural tree can be trained in an end-to-end fashion by standard stochastic gradient descent (SGD) with back-propagation. Several experiments conducted on four challenging datasets for both single and multiple human parsing, i.e., LIP, PASCAL-Person-Part, CIHP and MHP-v2, demonstrate the effectiveness of the proposed method. Code can be found at https://isrc.iscas.ac.cn/gitlab/research/sematree.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 6

page 7

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human parsing aims to recognize each semantic part, e.g., arms, legs and clothes, which is one of the most fundamental and critical problems in analyzing human with various applications, such as video surveillance, human-computer interaction, and person re-identification.

With the development of convolutional neural networks (CNN) on semantic segmentation task, human parsing has obtained significant accuracy improvement recently. However, the majority of existing algorithms

[36, 10, 4, 28] formulate the task as semantic segmentation, i.e., assign each pixel with a class label, such as arm and leg, which regards each category equally and fails to exploit the intrinsic physiological structure of human body, leading to inaccurate results. For example, it is difficult to simultaneously distinguish the torso, upper-arms, lower-arms, and background pixels, especially in the cluttered scenarios.

Figure 1: Category hierarchy used in the PASCAL-Person-Part dataset [5]. Seven labels are annotated in the dataset and the labels in blue font are defined as the virtual categories.

Inspired from human perception [18], we argue that the coarse to fine process in a hierarchical design is helpful to improve the performance of human parsing. As an example in Figure 1, we introduce a virtual category upper-body, and first distinguish the upper-body from the head and lower-body pixels. After that, we segment the torso, upper-arms, and lower-arms regions from the segmented upper-body region. In this way, the hierarchical design in the cascade manner can generate more accurate results.

In this paper, we design a novel semantic neural tree (SNT) for human parsing, which uses a tree architecture to encode physiological structure of human body and design a coarse to fine process in a cascade manner. As shown in Figure 1, it is natural to divide the virtual category human into three parts including head, upper-body, and lower-body, each of which shares similar semantic context information. According to the topology structure of annotations in different datasets, we can design different tree architecture in a similar spirit. For the lead node of each path in the tree, our goal is to distinguish just a few categories. In general, the proposed semantic neural tree consists of four components, i.e.

, the backbone network for feature extraction, attention routing modules for sub-category partition, attention aggregation modules for discriminative feature representation and prediction modules for generating parsing result, laid in several levels. That is, we segment human regions into multiple semantic subregions in a hierarchical way using the attention routing module. After that, we introduce the semantic aggregation module to combine multiple hierarchical features to exploit rich context information. We generate the parsing result by aggregating the discriminative feature maps from each leaf node. Our SNT is trained in an end-to-end fashion using the standard stochastic gradient descent (SGD) with back-propagation

[20].

Several experiments are conducted on four challenging datasets, i.e., LIP [23], Pascal-Person-Part [5], CIHP [10] and MHP-v2 [37], demonstrating that our SNT method outperforms the state-of-the-art methods for both single and multiple human parsing. Meanwhile, we also carry out ablation experiments to validate the effectiveness of the components in our SNT. The main contributions are summarized as follows. (1) We propose a semantic neural tree for human parsing, which integrates the physiological structure of human body into a tree architecture, and design a coarse to fine process in a cascade manner to generate accurate results. (2) We introduce the semantic aggregation module to combine multiple hierarchical features to exploit rich context information. (3) The experimental results on several challenging single and multiple human parsing datasets demonstrate that the proposed method surpasses the state-of-the-art methods.

2 Related Work

Semantic segmentation. Semantic segmentation is one of the most relevant research directions to human parsing, which aims to assign each pixel with a class label, such as car, flower, and person. Several previous methods [36, 26, 3, 1, 4] use the fully convolutional network (FCN) to generate accurate segmentation results. Specifically, Zhao et al. [36] propose the pyramid scene parsing network (PSPNet) to capture the capability of global context information by different-region based on context aggregation. In [26]

, the multi-path refinement network is developed to extract all the information available along the down-sampling process to enable high-resolution prediction using long-range residual connections. Besides, Chen

et al. [3] introduce atrous spatial pyramid pooling (ASPP) to segment objects at multiple scales accurately. Improved from [3], they apply the depth-wise separable convolution to both ASPP and decoder modules to refine the segmentation results especially along object boundaries [4]. Recently, the meta-learning technique is applied in image prediction focused on the tasks of scene parsing, person-part segmentation, and semantic image segmentation, resulting in better performance [2]. Bilinski and Prisacariu [1] propose a cascaded architecture with feature-level long-range skip connections, which incorporates the structure of ResNeXt’s residual building blocks. However, these semantic segmentation methods are constructed without considering the relations among semantic sub-categories, resulting in limited performance for human parsing with fine-grained sub-categories.

Human parsing. Furthermore, human parsing can be regarded as a fine-grained semantic segmentation task. To adapt to the human parsing task, more useful modules are proposed and combined in the semantic segmentation methods. Ruan et al. [28] improve the PSPNet [36] by using the global context embedding module for multi-scale context information. Zhao et al. [37] employ three Generative Adversarial Network-like networks to perform semantic saliency prediction, instance-agnostic parsing and instance-aware clustering respectively. However, the aforementioned methods prefer to construct complex network for more discriminative representation, but consider little about semantic structure of human body when designing the network.

The semantic structure information is essential in human parsing. Gong et al. [10] consider instance-aware edge detection to group semantic parts into distinct person instances. Liang et al. [23]

propose a novel joint human parsing and pose estimation network, which imposes human pose structures into the parsing results without resorting to extra supervision. In 

[9]

, the hierarchical graph transfer learning is incorporated upon the parsing network to encode the underlying label semantic structures and propagate relevant semantic information. Different from them without exploring human hierarchy, we take full use of the category label hierarchy and propose a new tree architecture to learn semantic regions in a coarse to fine process.

Neural tree.

The decision tree (DT) is an effective model and widely applied in machine learning tasks. As the inherent of the interpretability, it is usually regard as an auxiliary tool to insight into the mechanism of neural network. However, the simplicity of identity function used in these methods means that input data is never transformed and thus each path from root to leaf node on the tree does not perform representation learning, limiting their performance. To integrate non-linear transformations into DTs, Kontschieder 

et al. [19] propose the stochastic and differentiable decision tree model based neural decision forest. Similarly, Xiao et al. [34]

develop a neural decision tree with a multilayer perceptron network at the root transformer. Different from above methods, our model pays more attention to the “topology structure” of annotations (see Figure

1). That is, the proposed model have a flexible semantic topology depending on certain dataset. Moreover, we introduce the semantic aggregation module to combine multiple hierarchical features for more robustness.

Figure 2: The tree architecture of our SNT model used on the LIP dataset [23], which consists of four modules, i.e., the backbone network, (b) the attention routing module, (c) the semantic aggregation module, (d) the prediction module. The blue dashed lines indicate that the semantic aggregation modules in each level aggregate the features from different layers in the backbone. The purple double arrows denote the supervision for the attention routing and prediction modules. Best view in color.

3 Methodology

The goal of the proposed Semantic Neural Tree (SNT) method is to classify local parts of human along the path from root to leaf, and then fuse the feature maps before each leaf node to form the global representation for parsing prediction. We depart each sample

with the parsing label . Notably, our model is not a full binary tree, because the topology of model is determined by the semantics of dataset. Based on our tree architecture, we group the parsing label into category label hierarchy. For example, as shown in Figure 2(a), the virtual category label head consists of several child category labels face, hair and hat in the LIP dataset [23]. Our model consists of four modules, the backbone network, the attention routing module, the semantic aggregation module, and the prediction module. We describe each module in detail in the following sections.

3.1 Architecture

Backbone network. Similar to the previous works, we rely on residual blocks of ResNet-101 network [14] to extract discriminative features of human in each sub-category. Our SNT can also work on other pre-trained networks, such as DenseNet [16] and Inception [31].

Specifically, we remove the global average pooling and fully connected layers from the network and use the truncated ResNet-101 network [14], i.e., Res-, as the backbone. Meanwhile, followed by the backbone, we add one convolutional layer with the kernel size

and stride size

to reduce the channels of feature maps Res-. Notably, as shown in Figure 2, we employ multi-scale feature representation as a powerful tool to improve the ResNet-101 backbone in the dense prediction task with highly localized discriminative regions in fine-grained categories.

Attention routing module. After the backbone network, we need to solve how to split the tree structure. Given the sample , in each level of the tree architecture, we employ the attention routing module to split the higher-level category labels and output the corresponding intermediate masks. That is, the -th attention routing module at the -th level is fed with the feature maps at the -th level. To this end, we supervise based on the labels of pre-set virtual categories.

As shown in Figure 2(b), the attention rounting module starts from one convolutional layer with the kernel size and one Squeeze-and-Excitation (SE) layer [15]. Thus we can reduce the computational complexity and enforce the model to pay more attention to discriminative regions. After that, we use one dropout layer with the drop rate , one convolutional layer with the kernel size

and one softmax layer to output the mask of the pixel-level human parts

such that . Notably, the channels of the mask consists of foreground channels and background channel, where denotes the channel number of . The foreground channels denote the sub-category labels at node while background channel is defined as the other labels excluded from the sub-category labels at node . With supervision on the masks, we can guide and split the feature maps at the -th level into several semantic sub-categories, i.e., .

Semantic aggregation module. Followed by the attention routing module , our goal is to extract discriminative feature representation for sub-categories. To this end, multi-scale feature representation is an important and effective strategy, e.g., skip-connections in the U-Net architecture [4]. On the other hand, the convolution with stride larger than one and the pooling operations will shrink feature maps, resulting in information loss in details such as the edge or small parts.

To alleviate these issues, we introduce the semantic aggregation module to deal with the feature maps . Specifically, we first adapt atrous spatial pyramid pooling (ASPP) [3] to concatenate the features from multiple atrous convolutional layers with different dilation rates arranged in parallel. Specifically, the ASPP module is built to deal with the guided feature maps after the semantic router with dilation rates to form multi-scale features. To aggregate multi-scale feature, we also use the upsampling layer to increase the spatial size of feature while halve the number of channels. After that, we use the addition operation to fuse the multi-scale features from the ASPP module and the residual features of the backbone Res- at the -th stage (see Figure 2(c)). Thus we can learn more discriminative feature maps for prediction.

Prediction module. Based on the feature maps after semantic aggregation , we use the prediction modules in different levels to generate the parsing result for each sub-category. As shown in Figure 2(d), the prediction module includes one deformable convolutional layer [39] with the kernel size , one SE layer [15]

, one batch normalization layer, one dropout layer with drop rate

and another deformable convolutional layer [39] with the kernel size . Finally, the softmax layer is used to output an estimate for conditional distribution for each pixel. For each leaf node at the -th level, we can predict the local part parsing result .

Moreover, we combines all the feature maps of each leaf node . Specifically, we remove the background channel in every leaf feature map and then concatenate the rest foreground channels, i.e., background, head, torso, arms, legs and shoes in Figure 2(a), such that the overall number of channels is equal to the number of categories. Thus we can predict the final parsing result by using the prediction module .

3.2 Loss function

As discussed above, we use three loss terms on the attention routing module, each leaf node, and the final output after prediction modules to train the whole network in an end-to-end manner, which is computed as:

(1)

where denotes the cross-entropy loss between the masks generated by the attention routing module and the corresponding ground-truth at the -th level. denotes the cross-entropy loss between the output map by the leaf node and the corresponding ground-truth map at the -th level. denotes the cross-entropy loss between the final parsing result and the global parsing label . It is worth noting that the channel number of is equal to the number of sub-category labels of node at the -th level, and the channel number of is equal to the total number of labels.

3.3 Handling multiple human parsing

To handle multiple human parsing, we integrate our method with the off-the-shelf instance segmentation framework, as similar as in [28]. Specifically, we first employ the Mask R-CNN [13] pre-trained on MS-COCO dataset [27] to segment human instances from images. Then, we train three SNT sub-models to obtain global and local human parsing results with different size of input images, i.e., one global sub-model and two local sub-models. Specifically, the global sub-model is trained on the whole images without distinguishing each instance; while the other two local sub-models are input by segmented instance patches from Mask R-CNN [13]

and ground-truth respectively. Notably, we use the same architecture for the three sub-models. Finally, both the global and local results from these sub-models are combined to output multiple human parsing results by late fusion. That is, we concatenate the feature maps before leaf node on each sub-branches in our network. Followed by the prediction module, we can estimate the categories for each pixel under the supervision of cross-entropy loss function.

4 Experiment

Following the previous works [28, 10, 38, 9], we compare our method with other state-of-the-arts on the validation set of two single human parsing datasets (i.e., LIP [23] and Pascal-Person-Part [5]) and two multiple human parsing datasets (i.e., CIHP [10] and MHP-v2 [37]

). First of all, we introduce the implementation details of our method and the evaluation metrics as follows. Then, we conduct the ablation study to demonstrate the effectiveness of the proposed modules in the tree architecture.

4.1 Implementation Details

We implement the proposed framework in PyTorch. The source code of the proposed method will be made publicly available after the paper is accepted. All models are trained on a workstation with a 3.26 GHz Intel processor, 32 GB memory, and one Nvidia V100 GPU.

Following the previous works, we adopt the ResNet-101 [14]

that is pre-trained on the ImageNet dataset

[6] as the backbone network. For a fair comparison, we set input size of images for single person parsing while for multiple person parsing. For data argumentation, we adopt the strategy of random scaling (from to ), random rotation, random croping and left-right flipping the training data. We use the SGD algorithm to train the network with momentum, and weight decay. The learning rate is initialized to and declined by in every epochs. Notably, the warming up policy is applied for training. That is, we use the learning rate of to warm up the model in the first epochs, and then increase learning rate up to linearly. The model is optimized in epochs, where the dropout operation is valid only in the training phase. The topology of the proposed network is designed based on the sub-categories in different datasets, which is described in the appendix in detail.

4.2 Metrics

First, we employ the mean IoU metric (mIOU) to evaluate the global-level predictions in single human parsing datasets (i.e., LIP [23] and Pascal-Person Part [5]). Then, we use three metrics (i.e., , and PCP) to evaluate the instance-level predictions in multiple human parsing. The score denotes the area under the precision-recall curve based on the limitation of different IoU thresholds (e.g., , , ) [12]. PCP elaborates how many body parts are correctly predicted of a certain person [21]. computes the pixel-level IoU of semantic part categories within a person. Similar to the previous works, we use the metrics of mIoU and to evaluate the performance on the CIHP dataset [10] while PCP and to evaluate the performance on the MHP-v2 dataset [37].

4.3 Single Human Parsing

We compare the performance of single human parsing of our proposed method with other state-of-the-arts on the LIP [23] and Pascal-Person-Part [5] datasets. The qualitative human parsing results are visualized in Figure 3.

Figure 3: Some visualized examples for single human parsing: (a) the LIP dataset [23] and (b) the Pascal-Person-Part dataset [5].

Evaluation on LIP Dataset.

The LIP dataset defines body parts and clothes categories, including images with pixel-level annotations. Specifically, there are full-body images, upper-body images, lower-body images, head-missing images, backview images and images with occlusions. training and validation images are provided with publicly available annotations. As shown in Figure 2, we construct the tree architecture in -level.

As presented in Table 1, we can conclude that our method achieves the best performance in terms of all the three metrics. Since semantic segmentation methods (e.g., DeepLab [3] and PSPNet [36]) consider little about fine-grained classification in the human parsing task, they perform not well. Moreover, the CE2P method [28] improves PSPNet [36] by adding the context embedding branch, achieving mIOU score. Our method exceeds the current state-of-the-art CE2P [28] by in terms of mIOU score. It indicates that our method can learn discriminative representation of each sub-category for human parsing. Moreover, as shown in Table 2, our method obtain the best mIOU score in each sub-category. Notably, our method achieves considerable accuracy improvement compared with the other methods in some ambiguous sub-categories, e.g., glove, j-suit, and shoe.

Method pixel acc. mean acc. mIoU
Attention+SSL [11] - - 44.73
DeepLab [3] 84.09 55.62 44.80
MMAN [29] - - 46.81
SS-NAN [35] 87.60 56.00 47.92
MuLA [30] 88.50 60.50 49.30
PSPNet [36] 86.23 61.33 50.56
JPPNet [22] 86.39 62.32 51.37
CE2P [28] - - 52.56
CE2P(w flip) [28] 87.37 63.20 53.10
Ours 88.05 66.42 54.73
Table 1: The evaluation results on the validation set of LIP [23].
Method bkg. hat hair glove glasses u-clothes dress coat socks pants j-suit scarf skirt face l-arm r-arm l-leg r-leg l-shoe r-shoe mIoU
Attention+SSL [11] 84.6 59.8 67.3 29.0 21.6 65.3 29.5 51.9 38.5 68.0 24.5 14.9 24.3 71.0 52.6 55.8 40.2 38.8 28.1 29.0 44.7
DeepLab [3] 84.1 59.8 66.2 28.8 23.9 65.0 33.7 52.9 37.7 68.0 26.1 17.4 25.2 70.0 50.4 53.9 39.4 38.3 27.0 28.4 44.8
PSPNet [36] 86.1 63.5 68.0 39.1 23.8 68.1 31.7 56.2 44.5 72.7 28.7 15.7 25.7 70.8 59.7 62.3 54.9 54.5 42.3 42.9 50.6
MMAN [29] 84.8 57.7 65.6 30.1 20.0 64.2 28.4 52.0 41.5 71.0 23.6 9.7 23.2 69.5 55.3 58.1 51.9 52.2 38.6 39.0 46.8
JPPNet [22] 86.3 63.6 70.2 36.2 23.5 68.2 31.4 55.7 44.6 72.2 28.4 18.8 25.1 73.4 62.0 63.9 58.2 58.0 44.0 44.1 51.4
CE2P [28] 87.4 64.6 72.1 38.4 32.2 68.9 32.2 55.6 48.8 73.5 27.2 13.8 22.7 74.9 64.0 65.9 59.7 58.0 45.7 45.6 52.6
Ours 88.2 66.9 72.2 42.7 32.3 70.1 33.8 57.5 48.9 75.2 32.5 19.4 27.4 74.9 65.8 68.1 60.3 59.8 47.6 48.1 54.7
Table 2: The evaluation results on the validation set of LIP [23] in each category.

Evaluation on Pascal-Person-Part Dataset.

The PASCAL-Person-Part dataset [5] is originally from the PASCAL VOC-2010 dataset [7], and then extended for human parsing with coarse body part labels (i.e., head, torso, upper-/lower-arms, and upper-/lower-legs). It consists of training images and testing images ( images in total). As shown in Figure 1, we construct the tree architecture in -level. Specifically, the virtual category human consists of three sub-categories, i.e., head, upper-body including torso, upper-arms and lower-arms and lower-body including upper-legs and lower-legs.

We report the performance on the Pascal-Person-Part dataset in Table 3. Similar to the trend in the LIP dataset [23], the semantic segmentation methods, e.g., DeepLab [3] and DeepLab v3+ [4], perform inferior mIoU score, i.e., less than . Moreover, the Graphonomy method [9] learns and propagates compact high-level graph representation among the labels within one dataset, resulting in better mIoU score. Besides, DPC [2] achieves state-of-the-art performance with mIoU score. This is because it employs meta-learning to search optimal efficient multi-scale network for human parsing. Our SNT method obtains the best overall mIoU score of and best mIoU scores in terms of u-arms, u-legs and l-legs among all the compared methods, which indicates the effectiveness of our proposed tree network.

Method head torso u-arms l-arms u-legs l-legs bkg. mIoU
HAZN [32] 80.79 80.76 45.65 43.11 41.21 37.74 93.78 57.54
Attention+SSL [11] 83.26 62.40 47.80 45.58 42.32 39.48 94.68 59.36
Graph LSTM [25] 82.69 62.68 46.88 47.71 45.66 40.93 94.59 60.16
SE LSTM [24] 82.89 67.15 51.42 48.72 51.72 45.91 97.18 63.57
Part FCN [33] 85.50 67.87 54.72 54.30 48.25 44.76 95.32 64.39
DeepLab [3] - - - - - - - 64.94
MuLA [30] - - - - - - - 65.10
SAN [17] 86.12 73.49 59.20 56.20 51.39 49.58 96.01 64.72
WSHP [8] 87.15 72.28 57.07 56.21 52.43 50.36 97.72 67.60
DeepLab v3+ [4] - - - - - - - 67.84
PGN [10] 90.89 75.12 55.83 64.61 55.42 41.57 95.33 68.40
Bilinski et al. [1] - - - - - - - 68.60
Graphonomy [9] - - - - - - - 69.12
DPC [2] 88.81 74.54 63.85 63.73 57.24 54.55 96.66 71.34
Ours 89.15 74.76 63.90 63.95 57.53 54.62 96.84 71.59
Table 3: The evaluation results on the validation set of Pascal-Person-Part [5].

4.4 Multiple Human Parsing

Furthermore, we evaluate the proposed method on two large-scale multiple human parsing datasets, i.e., CIHP [10] and MHP-v2 [37]. For a fair comparison, we apply same Mask R-CNN model to output instance segmentation masks. Then, we use the global parsing and two local parsing models for human parsing as in [28]. Following the [28], final results are obtained by fusing the results from three branch models with a refinement process. Some visual results are shown in Figure 4, which indicates that our method can also generate precise and fine-grained results in multiple human parsing scenes.

Figure 4: Some visualized examples for multiple human parsing: (a) the CIHP dataset [10] and (b) the MHP-v2 dataset [37].

Evaluation on CIHP Dataset. The CIHP dataset [10] is the largest multi-person human parsing dataset with diverse human images, i.e., training, validation and test images. It is labeled with pixel-wise annotations on categories and instance-level identification. We use the same topology (i.e., -level tree structure as shown in Figure 2) in the LIP dataset [23] to perform human parsing because the two datasets share the same sub-category semantic annotations.

As shown in Table 4, our method outperforms other compared methods (i.e., PGN [10] and M-CE2P [28]), achieving score of . It is worth mentioning that SNT outperforms M-CE2P [28] in terms of score by considerable improvement, i.e., vs. . It indicates that our method facilitates improving the segmentation accuracy of human instances.

Method mIoU
PGN [10] 55.89 35.80 28.60 20.50 33.60
M-CE2P [28] 59.50 48.69 40.13 29.74 42.83
Ours 60.87 49.27 41.98 33.00 43.96
Table 4: The evaluation results on the validation set of CIHP [10]. denotes the mean value.

Evaluation on MHP-v2 Dataset. The MHP-v2 dataset [37] includes elaborately annotated images with fine-grained semantic category labels, involving persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. Since this dataset has more labels than the LIP dataset [23], we construct the tree architecture in -level.

As shown in Table 5, the semantic segmentation method Mask R-CNN [13] only obtains score of and score of on the challenging multiple human parsing dataset. NAN [38] achieves the best score of , but much inferior performance in both and scores. Our method achieves comparable state-of-the-art performance whth M-CE2P [28] in terms of three metrics. It indicates that the coarse to fine process in a hierarchical design can facilitate improving the accuracy.

Method
Mask R-CNN [13] 25.12 14.50 -
MH-Parser [24] 26.91 18.05 -
NAN [38] 34.37 24.87 42.77
M-CE2P [28] 43.77 34.47 42.70
Ours 43.50 34.36 42.51
Table 5: The performance on the validation set of MHP-v2 [37]. denotes the mean value.

4.5 Ablation study

We study the influence of some important parameters and components of our SNT method, i.e., the height of the tree, the attention routing module, the semantic aggregation module and the predition module. The experiment is conducted on the LIP dataset [23].

Height of the tree. The height of the tree indicates the complexity of the network. To explore the optimal height, we design five variants with different heights of the tree, see Figure 2(a). If the height is equal to , only the ResNet-101 backbone is used for human parsing. As presented in Table 6, we can observe there is a sharp decline in mean accuracy and mIoU score. We find that our method with -level achieves the best performance, i.e., mIoU score. This is attributed to two reasons. First, the model is not fine enough to predict the labels based on limited number of parameters in our model when the height is less than , resulting in limited performance. Second, too deep tree (i.e., ) corresponds to many parameters. However, training on limited data may cause over-fitting of our model to decrease the accuracy slightly.

height of the tree pixel acc. (%) mean acc. (%) mIoU (%)
0 84.81 57.12 46.34
1 86.84 64.03 52.15
2 87.42 65.58 53.32
3 88.05 66.42 54.73
4 86.92 64.34 51.42
Table 6: Effect of the height of the tree on the LIP dataset [23].

Effectiveness of prediction module. To analyze prediction module in the proposed network, we construct two variants of our method, i.e., “ours w/o dconv” and “ours w/o pred”. As shown in Figure 2(d), the “ours w/o dconv” method indicates that we use traditional convolutional layers instead of deformable convolutional layers in the prediction module; while the “ours w/o pred” method indicates that we combine the prediction results of each leaf node for final parsing result without the prediction module.

From Table 7, our method performs better than the “ours w/o dconv” method with improvement in terms of mIoU. It indicates that the deformable convolutional layers can facilitate align the semantic information in different channels of feature maps. If we do not use the prediction module to generate the final parsing result, we can observe a sharp decrease in mIoU score, i.e., vs. . It is essential to achieve accurate parsing result based on the context information among every sub-categories.

Effectiveness of semantic aggregation module. To verify the effectiveness of the semantic aggregation module, we construct the “ours w/o skip” method, which indicates that we do not combine the residual blocks from the backbone in attention aggregation (see Figure 2(c)). Based on the comparison between our method and the “ours w/o skip” method, we can conclude that the skip-connection from the backbone (see the dashed blue lines in Figure 2(a)) can bring mIoU improvement. This is because the skip-connection in our network can exploit multi-scale representation for sub-categories.

Effectiveness of attention routing module. To study the effect of the attention routing module, the “ours w/o mask” indicates that we further remove the attention mask in the attention routing module from the “ours w/o skip” method (see Figure 2(b)). That is, we directly split the feature maps into several semantic maps for the next level. As presented in Table 7, the “ours w/o skip” method achieves improvement in mIoU score compared with the “ours w/o mask” method. It demonstrates the attention mask can enforce the tree network focus on discriminative representation for specific sub-category semantic information.

variant pixel acc. (%) mean acc. (%) mIoU (%)
Ours w/o mask 86.84 64.03 52.15
Ours w/o skip 87.42 65.58 53.32
Ours w/o pred 85.34 63.22 50.02
Ours w/o dconv 87.61 66.05 54.42
Ours 88.05 66.42 54.73
Table 7: Variants of the SNT method on the LIP dataset [23].

5 Conclusion

In this paper, we propose a novel semantic tree network for human parsing. Specifically, the proposed tree architecture can encode physiological structure of human body and segment multiple semantic subregions in a hierarchical way. Extensive experiment on four challenging single and multiple human parsing datasets indicates the effectiveness of the proposed semantic tree structure. Our method can learn discriminative feature representation and exploit more context information for sub-categories effectively. For future work, we plan to optimize the tree architecture for better performance by neural architecture search techniques.

Appendix A The Category Label Definition in the LIP and CIHP Datasets

Since the LIP [23] and CIHP [10] datasets use the same annotations, we adopt the same architecture of our neural tree, shown in Figure 1. We report the category label definition in our neural tree in Table 8.

Leaf Label
Background
Hat, Hair, Sunglasses, Face
Scarf, Upper-clothes, Coat, Dress
Left-arm, Right-arm, Glove
Skirt, Pants, Jumpsuit, Left-leg, Right-leg
Socks, Left-shoe, Right-shoe
Table 8: The category label definition used in the LIP [23] and CIHP [10] datasets.

Appendix B The Architecture of Our Semantic Neural Tree in the Pascal-Person-Part Dataset

The PASCAL-Person-Part dataset [5] is originally from the PASCAL VOC-2010 dataset [7], and is extended for human parsing with coarse body part labels (i.e., head, torso, upper-/lower-arms, and upper-/lower-legs). As shown in Figure 5, we construct a neural tree with -level. We summarize the category label definition in Table 9.

Figure 5: The architecture of our semantic neural tree used in the Pascal-Person-Part dataset [5].
Leaf Label
Background
Head
Torso
U-arms
L-arms
U-legs
L-legs
Table 9: The category label definition used in the Pascal-Person-Part dataset.

Appendix C The Architecture of Our Semantic Neural Tree in the MHP-v2 Dataset

The MHP-v2 dataset [37] includes elaborately annotated images with fine-grained semantic category labels, involving persons per image and captured in real-world scenes from various viewpoints, poses, occlusion, interactions and background. As shown in Figure 6, we construct a neural tree with -level. We summarize the category label definition in Table 10.

Figure 6: he architecture of our semantic neural tree used in the MHP-v2 dataset [37].
Leaf Label
Background
Backpack, Protector, Ball, Bats, Bottle, Carrybag, Cases,
Umbrella, Wallet/Purse
Other-full-body-clothes, Other-accessary, Other-upper-body-clothes,
Other-lower-body-clothes
Cap/Hat, Helmet, Hair, Sunglasses, Face, Headwear, Eyewear
Bikini/Bra, Jacket/Windbreaker/Hoodie, Tee-shirt, Polo-shirt,
Sweater, Singlet, Torso-skin, Robe, Coat, Dress, Tie, Scarf, Belt
Glove, Watch, Wristband, Left-arm, Right-arm, Left-hand, Right-hand
Left-leg, Right-leg, Jumpsuit, Pants, Shorts/Swim-shorts, Skirt
Stockings, Socks, Left-boot, Right-boot, Left-shoe, Right-shoe
Left-higheel, Right-higheel, Left-sandal, Right-sandal, Left-foot,
Right-foot
Table 10: The category label definition used in the MHP-v2 dataset.

References

  • [1] P. Bilinski and V. Prisacariu (2018) Dense decoder shortcut connections for single-pass semantic segmentation. In CVPR, pp. 6596–6605. Cited by: §2, Table 3.
  • [2] L. Chen, M. D. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam, and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. In NeurIPS, Vol. abs/1809.04184. Cited by: §2, §4.3, Table 3.
  • [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §2, §3.1, §4.3, §4.3, Table 1, Table 2, Table 3.
  • [4] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV, pp. 833–851. Cited by: §1, §2, §3.1, §4.3, Table 3.
  • [5] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. L. Yuille (2014) Detect what you can: detecting and representing objects using holistic models and body parts. In CVPR, pp. 1979–1986. Cited by: Figure 5, Appendix B, Figure 1, §1, Figure 3, §4.2, §4.3, §4.3, Table 3, §4.
  • [6] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Vol. , pp. 248–255. Cited by: §4.1.
  • [7] M. Everingham, L. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. IJCV 88 (2), pp. 303–338. Cited by: Appendix B, §4.3.
  • [8] H. Fang, G. Lu, X. Fang, J. Xie, Y. Tai, and C. Lu (2018) Weakly and semi supervised human body part parsing via pose-guided knowledge transfer. CoRR abs/1805.04310. Cited by: Table 3.
  • [9] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin (2019) Graphonomy: universal human parsing via graph transfer learning. In CVPR, pp. 7450–7459. Cited by: §2, §4.3, Table 3, §4.
  • [10] K. Gong, X. Liang, Y. Li, Y. Chen, M. Yang, and L. Lin (2018) Instance-level human parsing via part grouping network. In ECCV, pp. 805–822. Cited by: Table 8, Appendix A, §1, §1, §2, Figure 4, §4.2, §4.4, §4.4, §4.4, Table 3, Table 4, §4.
  • [11] K. Gong, X. Liang, D. Zhang, X. Shen, and L. Lin (2017) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. In CVPR, pp. 6757–6765. Cited by: Table 1, Table 2, Table 3.
  • [12] B. Hariharan, P. A. Arbeláez, R. B. Girshick, and J. Malik (2014) Simultaneous detection and segmentation. In ECCV, pp. 297–312. Cited by: §4.2.
  • [13] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: §3.3, §4.4, Table 5.
  • [14] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §3.1, §3.1, §4.1.
  • [15] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, pp. 7132–7141. Cited by: §3.1, §3.1.
  • [16] G. Huang, Z. Liu, and K. Q. Weinberger (2016) Densely connected convolutional networks. CoRR abs/1608.06993. Cited by: §3.1.
  • [17] Z. Huang, C. Wang, X. Wang, W. Liu, and J. Wang (2019) Semantic image segmentation by scale-adaptive networks. TIP. Cited by: Table 3.
  • [18] R. Kimchi (1992) Primacy of wholistic processing and global/local paradigm: a critical review.. Psychological bulletin 112 (1), pp. 24. Cited by: §1.
  • [19] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bulò (2015) Deep neural decision forests. In ICCV, Vol. , pp. 1467–1475. Cited by: §2.
  • [20] Y. LeCun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel (1989) Backpropagation applied to handwritten zip code recognition. Neural Computation 1 (4), pp. 541–551. Cited by: §1.
  • [21] J. Li, J. Zhao, Y. Wei, C. Lang, Y. Li, and J. Feng (2017) Towards real world human parsing: multiple-human parsing in the wild. CoRR abs/1705.07206. Cited by: §4.2.
  • [22] X. Liang, K. Gong, X. Shen, and L. Lin (2018)

    Look into person: joint body parsing & pose estimation network and A new benchmark

    .
    CoRR abs/1804.01984. Cited by: Table 1, Table 2.
  • [23] X. Liang, K. Gong, X. Shen, and L. Lin (2019) Look into person: joint body parsing & pose estimation network and a new benchmark. TPAMI 41 (4), pp. 871–885. Cited by: Table 8, Appendix A, §1, Figure 2, §2, §3, Figure 3, §4.2, §4.3, §4.3, §4.4, §4.4, §4.5, Table 1, Table 2, Table 6, Table 7, §4.
  • [24] X. Liang, L. Lin, X. Shen, J. Feng, S. Yan, and E. P. Xing (2017) Interpretable structure-evolving LSTM. CoRR abs/1703.03055. Cited by: Table 3, Table 5.
  • [25] X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan (2016) Semantic object parsing with graph LSTM. In ECCV, pp. 125–143. Cited by: Table 3.
  • [26] G. Lin, A. Milan, C. Shen, and I. D. Reid (2017) RefineNet: multi-path refinement networks for high-resolution semantic segmentation. In CVPR, pp. 5168–5177. Cited by: §2.
  • [27] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: §3.3.
  • [28] T. Liu, T. Ruan, Z. Huang, Y. Wei, S. Wei, Y. Zhao, and T. Huang (2018) Devil in the details: towards accurate single and multiple human parsing. CoRR abs/1809.05996. Cited by: §1, §2, §3.3, §4.3, §4.4, §4.4, §4.4, Table 1, Table 2, Table 4, Table 5, §4.
  • [29] Y. Luo, Z. Zheng, L. Zheng, T. Guan, J. Yu, and Y. Yang (2018) Macro-micro adversarial network for human parsing. In ECCV, Cited by: Table 1, Table 2.
  • [30] X. Nie, J. Feng, and S. Yan (2018) Mutual learning to adapt for joint human parsing and pose estimation. In ECCV, pp. 519–534. Cited by: Table 1, Table 3.
  • [31] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2015)

    Rethinking the inception architecture for computer vision

    .
    CoRR abs/1512.00567. Cited by: §3.1.
  • [32] F. Xia, P. Wang, L. Chen, and A. L. Yuille (2016) Zoom better to see clearer: human and object parsing with hierarchical auto-zoom net. In ECCV, pp. 648–663. Cited by: Table 3.
  • [33] F. Xia, P. Wang, X. Chen, and A. L. Yuille (2017) Joint multi-person pose estimation and semantic part segmentation. In CVPR, pp. 6080–6089. Cited by: Table 3.
  • [34] H. Xiao (2017) NDT: neual decision tree towards fully functioned neural graph. CoRR abs/1712.05934. Cited by: §2.
  • [35] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan (2017) Scale-adaptive convolutions for scene parsing. In ICCV, pp. 2050–2058. Cited by: Table 1.
  • [36] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia (2017) Pyramid scene parsing network. In CVPR, pp. 6230–6239. Cited by: §1, §2, §2, §4.3, Table 1, Table 2.
  • [37] J. Zhao, J. Li, Y. Cheng, T. Sim, S. Yan, and J. Feng (2018) Understanding humans in crowded scenes: deep nested adversarial learning and A new benchmark for multi-human parsing. In ACM MM, pp. 792–800. Cited by: Figure 6, Appendix C, §1, §2, Figure 4, §4.2, §4.4, §4.4, Table 5, §4.
  • [38] J. Zhao, J. Li, Y. Cheng, L. Zhou, T. Sim, S. Yan, and J. Feng (2018) Understanding humans in crowded scenes: deep nested adversarial learning and A new benchmark for multi-human parsing. CoRR abs/1804.03287. Cited by: §4.4, Table 5, §4.
  • [39] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets V2: more deformable, better results. In CVPR, pp. 9308–9316. Cited by: §3.1.