Human parsing can also be considered as human semantic segmentation. This task requires a model to classify every pixel of the human body, Fig.1. Human parsing is critical for the understanding of humans, and it advances other applications, such as dressing style recognition, human behavior recognition, and so on
. At present, human parsing has been significantly improved as the development of deep learning and fully convolutional neural networks
. Recently, graph neural network(GNN) are used in computer vision[14, 15, 6], some people use graph convolution to capture the relationship between human parts[3, 13]. In general, image semantic segmentation can be divided into two types: 1) low-level tasks, such as road extraction from satellite image, which typically lower-level features are enough to process images,2) high-level tasks, such as human parsing, which requires model to extract more semantic information. These two types of tasks require different models. For example, DLinkNet, which proposed for solving the road extraction problem of satellite images, uses a smaller backbone and propose Dblock to increase the receptive field. PSPNet are proposed for a high-level task that has more bigger backbone, and the ASPP module are proposed for blend multi-level features.
Due to the potential for widespread application, research on human parsing has received increasing attention. Human parsing can be thought of as a fine-grained semantic segmentation task. In the early years of this field, many works solved this task through Conditional Random Field(CRF) with pose estimation information. Liang provided a novel Contextualized Convolutional Neural Network architecture, which integrated the cross-layer, global image-level, within-super-pixel, and cross-super-pixel neighborhood context into a unified network. Gong explored a new self-supervising structure-sensitive learning method that does not require additional monitoring information, and derives a wealth of advanced knowledge from a global perspective and improves the analytical results. Liang 
proposed a novel joint human parsing and pose estimation network, which can have a high-quality prediction of human parsing and pose estimation. Liu identified some useful properties such as feature resolution, global context information, and edge information to get better results in human parsing task.
Although the above work has verified the effectiveness of some modules in human segmentation, there is no in-depth analysis of the correlation and difference between common semantic segmentation tasks and human segmentation. In this paper, we try to explore the critical properties of human parsing and experiment with the validity of these properties. For one thing, we propose that the human segmentation task requires the network to extract features that are more robust to deformation and scale changes. For example, in the left part of Fig. 1, two people have different actions and different photographing angles, so the relative sizes between the various components of the human body are significantly different. For another, the human body segmentation task has higher semantic information requirements than the common semantic segmentation task. As shown in the right part of Fig. 1, the person’s upper and lower parts have no obvious texture and color differences but are also split into two parts. Thus, the network should pay more attention to embedding semantic information into features.
Based on the above observations, we propose a unified model C-DLinknet based on DLinkNet, which is a modified version of LinkNet. And we experiment with our model and configurations, our main contributions can be summarized as follows:
We replace the central block by ASPP, which aggregates the contextual information effectively.
We add supervised information to the output of each layer of Decoder, which makes these features of Decoder contains more semantic information.
We propose a Smooth module which can concat all the output of Decoder into a hyper feature, which can give more information about scales.
Ii Proposed Method
We present a unified high-accuracy network for human parsing. This network consists of three parts: Encoder, Decoder, and Refiner. As shown in the Fig. 2, Encoder is a feature extractor that extracts robust features F. Decoder can restore the F to its original input size. Then, we improve the Decoder by adding auxiliary loss to every Decoder layer. Also, we propose a refinement process that effectively aggregates Decoder output.
We use ResNet as the backbone for feature extractors. The network has 5 layers called E1, E2, E3, E4, E5. Each layer consists of different numbers of BottleNeck modules. The residual module adds up the input and output of the module in ResNet, which can effectively alleviate the vanishing gredient. Similar to DeepLab, for preserve the outsize of Encoder and retaining the receptive field, we use atrous convolution with dilate rate 2 on final E5 layer, which preserve the output size be 1/16 of the original image and finally get the feature F.
Atrous Saptial Pyramid Pooling
Global information is very helpful for fine-grained semantic segmentation. Atrous spatial pyramid pooling taking atrous convolutions with different dilation rates to get outputs with various scales. Then, these outputs effectively concat together in order to enhance outputs of Encoder.
Our ASPP module reduces the E5 layer output channels to 1/2 and set atrous convolutions with 12, 24, 36 dilation rates. Then, each output in the ASPP branch is reduced to 1/5 of the original input. After that, we have a bilinear interpolation to restore the features to their original size, and then we concat them together. By using a 1*1 convolution filter, we mix the features with each scale and restore features to the original number of channels before entering the module. The number of channels of feature maps F varies:
There are two differences between our ASPP module and DeepLab:
To reduce the amount of computation and the number of channels, we reduce the number of the channel of input feature.
The channel numbers of output of our ASPP module are the same as the input. In this way, the output from Decoder part and Encoder part are symmetric, so it’s easy for the feature maps fusion from Encoder and Decoder.
The purpose of Decoder is mainly to restore the output of Encoder to the original image size, which can alleviate the problem of information loss caused by down-sampling from Encoder part. We set up 4 layers of Decoder as D5, D4, D3, and D2. We can gradually increase the size of feature maps and reduce the number of channels so that the outputs size of each Encoder layer and Decoder layer are same, so we can blend the two features by simple summation. Decoder block passes the features through a 1*1 convolution filter, so it reduces the number of channels to the original 1/4, and then it takes a deconvolution layer with stride 2 and a 1*1 convolution. This method can effectively help reduce the calculation of the model.
To add more semantic information to the output characteristics of Decoder, we add intermediate supervision to each output of Decoder. Before getting the intermediate supervision, we let the different Decoder output pass through the two-layer 3*3 convolution and get the final output by 1*1 convolution. Then we upsample this output of each Decoder layer to the ground truth size and finally get the loss, as shown in the Fig. 3.
Intermediate supervision can not only help features to add semantic information, but also make features more robust to different scales of objects. The loss can be formulated as:
Feature matters. Decoder part directly adds the output from Encoder and Decoder, which leads to the aliasing effect of the feature maps.
Inspired by HyperNet, we use the Decoder block to project the output of each layer of Decoder to the size of the feature maps from the D2 layer. After concating them together, the 3*3 convolutions blend these features and reduce the number of channels to 512. This operation effectively blends multi-scale features and increases the robustness of the network to changes in human motion. Unlike the hyper feature in HyperNet, we concat the features in Decoder. In addition to the integration of multi-scale features, it also effectively combines high-level and low-level features.
In this section, we evaluate our method on Look Into Person (LIP) dataset with some representative methods. In addition, we also implement some comparisons with different modules by ablation study.
Iii-a Lip Dataset
Liang proposed Look in Person (LIP benchmark dataset and related competitions). To further promote the frontier of semantic segmentation, and focused more on the fine-grained understanding of the human body.
The dataset is an example of a crop from the Microsoft COCO dataset. The dataset has more than 50,000 images, which are more than 30,000 training images, 10,000 validation images and 10,000 testing images. The LIP has good annotations, which are fine pixel annotations, with 19 semantic body parts and background.
LIP images are collected from real scenes and include people with challenging poses, angles, occlusions, various looks and various resolutions. As shown in Fig. 1, we can see that people’s actions are various, the scenes are various and the scale changes obviously.
Iii-B Ablation studies
In the ablation study, we implement the proposed framework based on ResNet-101, which is pre-trained on ImageNet. The network is training on the training set and validates on the validation dataset. The input size of the network is 256*192 during training and testing. We use similar training strategies with Deeplab
, i.e., “Poly” learning rate policy with base learning rate 0.002. We training the networks for approximately 120 epochs, and we train this model on 2 GTX1080, the batch size is 16. For data augmentation, we apply the random scaling (from 0.5 to 1.5), cropping, rotation, and left-right flipping during training and use flip for better performance. Cross-entropy is used as the loss function.
To explore the effectiveness of the modified module, we report the performance under several variants in Tab. I. We use DLinkNet as our baseline, which proposed for the satellite image segmentation, and we can see the baseline model reaches 50.92% accuracy. After analyzing the score of the model on each class of objects, we can observe the following problems: 1) poor recognition of small objects, such as hat, socks, and sunglasses. 2) The left and right parts of the object are more confused about the model. We replaced the D-Block module by ASPP. As shown in Tab.II, we can find it brings about 0.5% improvements on mIoU, which demonstrates that the multi-scale context information can assist the fine-grained parsing. Particularly, it shows significant boosts on smaller objects such as glove (2.3%), sunglasses (6.5%).
|B + A||86.87||63.9||70.33||37.13||27.56||67.29||33.91||54.95||44.93||73.23||28.52||17.28||26.95||73.34||62.09||64.11||56.44||55.81||41.44||42.23||51.42|
|B + S||86.9||64.45||70.74||36.8||27.39||67.04||32.63||54.34||45.35||72.91||28.02||15.6||26.28||74.01||62.25||64.45||59.3||57.84||45.55||46.62||51.93|
|B + A + S||87.22||63.64||71.35||38.95||31.33||68.08||34.13||55.07||46.47||73.15||28.74||20.28||24.11||74.12||63.12||65.38||57.39||56.96||42.49||42.52||52.22|
|B + S + A + L||87.04||63.29||70.48||40.06||31.69||68.29||39.89||55.37||49.06||73.17||31.49||22.44||25.59||73.4||62.23||65.14||59.1||58||44.21||44.14||53.05|
Then we experiment our Smooth module on the baseline model, as shown in Tab.II, with the Smooth module added, mIou has increased by 1.01 points. This module also improves the accuracy of small object segmentation, the scores for the left and right parts are improved, this demonstrate that the Smooth module not only makes the model more robust to multi-scale objects, but also better combines low-level and high-level features.
In the third experiment, we used the ASPP module together with the Smooth module, which is 1.3 percentage points higher than the baseline. In the last experiment of the ablation study, we added additional supervised loss to each output layer of Decoder, which not only makes the model more robust to multi-scale objects, but also adds semantic information to the features of Decoder output, helping model mix high and low features. As in Tab.II, multi-scale loss information helped the model increase by 0.8 in mIou, especially in the category of socks, and left-right part such as left-leg, right-leg. This means that the model not only classifies small objects more accurately but also has stronger semantic features.
Iii-C Comparison with state of the art
We evaluate the performance of C-DLinkNet on the validation dataset of LIP and compare it to other state-of-the-art approaches. Without any bells and whistles, as shown in Tab.II, with almost half the input size, we get a comparable performance with CE2P which input size is 473, and our method outperforms JPPNet by 1.68% in terms of mIoU. It is worth noting that we did not use any additional annotation information. For example, as shown in the upper part of Fig.4, we have better segmentation performance about coat than CE2P.
|Method||pixel acc.||mean acc.||mIoU|
In this paper, we try to explore the differences and connections between human parsing and general semantic segmentation models, and use these features to build a model C-DLinkNet which effectively extracts with richer semantic information features and is more robust to multi-scale objects. We achieve comparable performance with smaller input size and no other external annotation information.
-  (2012) Parsing clothing in fashion photographs. Cited by: §I.
-  (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In IEEE, Cited by: §III-B, TABLE II.
Graphonomy: universal human parsing via graph transfer learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2015) Deep residual learning for image recognition. Cited by: §II-A.
-  (2017) Look into person: self-supervised structure-sensitive learning and a new benchmark for human parsing. Cited by: §I, §III-A.
-  (2019) Can gcns go as deep as cnns?. CoRR abs/1904.03751. External Links: Cited by: §I.
-  (2018) Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §I, §III-A, TABLE II.
-  (2015) Human parsing with contextualized convolutional neural network. In IEEE International Conference on Computer Vision, Cited by: §I.
-  (2014) Fully convolutional networks for semantic segmentation. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Cited by: §I.
Devil in the details- towards accurate single and multiple human parsing.
the Association for the Advance of Artificial Intelligence, Cited by: §I, §III-C, TABLE II.
-  (2014) A high performance crf model for clothes parsing. Cited by: §I.
-  LIP, look into person. Note: http://sysu-hcp.net/lip/index.phpAccessed May 24, 2019 Cited by: §III-A.
-  (2019-10) Learning compositional neural information fusion for human parsing. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §I.
-  (2019) Multi-graph transformer for free-hand sketch recognition. arXiv preprint arXiv:1912.11258. Cited by: §I.
-  (2020) Deep learning for free-hand sketch: a survey. arXiv preprint arXiv:2001.02600. Cited by: §I.
-  (2017) Pyramid scene parsing network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §I.
-  (2017-07) Self-supervised neural aggregation networks for human parsing. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §I.
-  (2018-06) D-linknet: linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, Cited by: §I.