Code repository for Self-supervised Structure-sensitive Learning, CVPR'17
Human parsing and pose estimation have recently received considerable interest due to their substantial application potentials. However, the existing datasets have limited numbers of images and annotations and lack a variety of human appearances and coverage of challenging cases in unconstrained environments. In this paper, we introduce a new benchmark named "Look into Person (LIP)" that provides a significant advancement in terms of scalability, diversity, and difficulty, which are crucial for future developments in human-centric analysis. This comprehensive dataset contains over 50,000 elaborately annotated images with 19 semantic part labels and 16 body joints, which are captured from a broad range of viewpoints, occlusions, and background complexities. Using these rich annotations, we perform detailed analyses of the leading human parsing and pose estimation approaches, thereby obtaining insights into the successes and failures of these methods. To further explore and take advantage of the semantic correlation of these two tasks, we propose a novel joint human parsing and pose estimation network to explore efficient context modeling, which can simultaneously predict parsing and pose with extremely high quality. Furthermore, we simplify the network to solve human parsing by exploring a novel self-supervised structure-sensitive learning approach, which imposes human pose structures into the parsing results without resorting to extra supervision. The dataset, code and models are available at http://www.sysu-hcp.net/lip/.READ FULL TEXT VIEW PDF
Code repository for Self-supervised Structure-sensitive Learning, CVPR'17
Code repository for Joint Body Parsing & Pose Estimation Network, T-PAMI 2018
Comprehensive human visual understanding of scenarios in the wild, which is regarded as one of the most fundamental problems in computer vision, could have a crucial impact in many higher-level application domains, such as person re-identification, video surveillance , human behavior analysis [16, 22] and automatic product recommendation . Human parsing (also named semantic part segmentation) aims to segment a human image into multiple parts with fine-grained semantics (e.g., body parts and clothing) and provides a more detailed understanding of image contents, whereas human pose estimation focuses on determining the precise locations of important body joints. Human parsing and pose estimation are two critical and correlated tasks in analyzing images of humans by providing both pixel-wise understanding and high-level joint structures.
Recently, convolutional neural networks (CNNs) have achieved exciting success in human parsing [23, 26, 25] and pose estimation [33, 44]. Nevertheless, as demonstrated in many other problems, such as object detection  and semantic segmentation , the performance of such CNN-based approaches heavily relies on the availability of annotated images for training. To train a human parsing or pose network with potential practical value in real-world applications, it is highly desired to have a large-scale dataset that is composed of representative instances with varied clothing appearances, strong articulation, partial (self-)occlusions, truncation at image borders, diverse viewpoints and background clutters. Although training sets exist for special scenarios, such as fashion pictures [47, 14, 23, 26] and people in constrained situations (e.g., upright) , these datasets are limited in their coverage and scalability, as shown in Fig. 2. The largest public human parsing dataset  to date only contains 17,000 fashion images, while others only include thousands of images. The MPII Human Pose dataset  is the most popular benchmark for evaluating articulated human pose estimation methods, and this dataset includes approximately 25K images that contain over 40K people with annotated body joints. However, all these datasets only focus on addressing different aspects of human analysis by defining discrepant annotation sets. There are no available unified datasets with both human parsing and pose annotations for holistic human understanding, until our work fills this gap.
Furthermore, to the best of our knowledge, no attempt has been made to establish a standard representative benchmark that aims to cover a wide range of challenges for the two human-centric tasks. The existing datasets do not provide an evaluation server with a secret test set to avoid potential dataset over-fitting, which hinders further development on this topic. With the new benchmark named ”Look into Person (LIP)”, we provide a public server for automatically reporting evaluation results. Our benchmark significantly advances the state-of-the-art in terms of appearance variability and complexity, and it includes 50,462 human images with pixel-wise annotations of 19 semantic parts and 16 body joints.
has been achieved by improving the feature representations using CNNs and recurrent neural networks. To capture rich structure information, these approaches combine CNNs and graphical models (e.g., conditional random fields (CRFs)), similar to the general object segmentation approaches[52, 6, 43]. However, when evaluated on the new LIP dataset, the results of some existing methods [3, 31, 6, 8] are unsatisfactory. Without imposing human body structure priors, these general approaches based on bottom-up appearance information occasionally tend to produce unreasonable results (e.g., right arm connected with left shoulder), as shown in Fig. 1. Human body structural information has previously been well explored in human pose estimation [49, 10], where dense joint annotations are provided. However, since human parsing requires more extensive and detailed predictions than pose estimation, it is difficult to directly utilize joint-based pose estimation models in pixel-wise predictions to incorporate the complex structure constraints. We demonstrate that the human joint structure can facilitate the pixel-wise parsing prediction by incorporating higher-level semantic correlations between human parts.
For pose estimation, increasing research efforts [11, 4, 12, 35] have been devoted to learning the relationships between human body parts and joints. Some studies [11, 4] explored encoding part constraints and contextual information for guiding the network to focus on informative regions (human parts) to predict more precise locations of the body joints, which achieved state-of-the-art performance. Conversely, some pose-guided human parsing methods [13, 46] also sufficiently utilized the peculiarity and relationships of these two correlated tasks. However, previous works generally solve these two problems separately or alternatively.
|Dataset||#Total||#Training||#Validation||#Test||Parsing Categories||Body Joints|
|Fashionista  (Parsing)||685||456||-||229||56||-|
|PASCAL-Person-Part  (Parsing)||3533||1,716||-||1,817||7||-|
|ATR  (Parsing)||17,700||16,000||700||1,000||18||-|
|LSP  (Pose)||2,000||1,000||-||1,000||-||14|
|MPII  (Pose)||24987||18079||-||6908||-||16|
|J-HMDB  (Pose)||31838||-||-||-||2||13|
In this work, we aim to seamlessly integrate human parsing and pose estimation under a unified framework. We use a shared deep residual network for feature extraction, after which there are two distinct small networks to encode and predict the contextual information and results. Then, a simple yet efficient refinement network tailored for both parsing and pose prediction is proposed to explore efficient context modeling, which makes human parsing and pose estimation mutually beneficial. In our unified framework, we propose a scheme to incorporate multi-scale feature combinations and iterative location refinement together, which are often posed as two different coarse-to-fine strategies that are widely investigated for human parsing and pose estimation separately. To highlight the merit of unifying the two highly correlated and complementary tasks within an end-to-end framework, we name our framework the joint human parsing and pose estimation network (JPPNet).
However, note that annotating both pixel-wise labeling maps and pose joints is unrealized in previous human-centric datasets. Therefore, in this work, we also design a simplified network suited to general human parsing datasets and networks with no need for pose annotations. To explicitly enforce the produced parsing results to be semantically consistent with the human pose and joint structures, we propose a novel structure-sensitive learning approach for human parsing. In addition to using the traditional pixel-wise part annotations as the supervision, we introduce a structure-sensitive loss to evaluate the quality of the predicted parsing results from a joint structure perspective. This means that a satisfactory parsing result should be able to preserve a reasonable joint structure (e.g., the spatial layouts of human parts). We generate approximated human joints directly from the parsing annotations and use them as the supervision signal for the structure-sensitive loss. This self-supervised structure-sensitive network is a simplified verson of our JPPNet, denoted as SS-JPPNet, which is appropriate for the general human parsing datasets without pose annotations
Our contributions are summarized in the following three aspects. 1) We propose a new large-scale benchmark and an evaluation server to advance the human parsing and pose estimation research, in which 50,462 images with pixel-wise annotations on 19 semantic part labels and 16 body joints are provided. 2) By experimenting on our benchmark, we present detailed analyses of the existing human parsing and pose estimation approaches to obtain some insights into the successes and failures of these approaches and thoroughly explore the relationship between the two human-centric tasks. 3) We propose a novel joint human parsing and pose estimation network, which incorporates the multi-scale feature connections and iterative location refinement in an end-to-end framework to investigate efficient context modeling and then enable parsing and pose tasks that are mutually beneficial to each other. This unified framework achieves state-of-the-art performance for both human parsing and pose estimation tasks. The simplified network for human parsing task with self-supervised structure-sensitive learning also significantly surpasses the previous methods on both the existing PASCAL-Person-Part dataset  and our new LIP dataset.
Human parsing and pose datasets: The commonly used publicly available datasets for human parsing and pose are summarized in Table I. For human parsing, the previous datasets were labeled with a limited number of images or categories. The largest dataset  to date only contains 17,000 fashion images with mostly upright fashion models. These small datasets are unsuitable for training models with complex appearance representations and multiple components [20, 39, 12], which could perform better. For human pose, the LSP dataset  only contains sports people, and it fails to cover real-life challenges. The MPII dataset  has more images and a wider coverage of activities that cover real-life challenges, such as truncation, occlusions, and variability of imaging conditions. However, this dataset only provides 2D pose annotations. J-HMDB  provides densely annotated image sequences and a larger number of videos for 21 human actions. Although the puppet mask and human pose are annotated in the all 31838 frames, detailed part segmentations are not labeled.
Our proposed LIP benchmark dataset is the first effort that focuses on the two human-centric tasks. Containing 50,462 images annotated with 20 parsing categories and 16 body joints, our LIP dataset is the largest and most comprehensive human parsing and pose dataset to date. Some other datasets in the vision community were dedicated to the tasks of clothes recognition, retrieval  and fashion modeling , whereas our LIP dataset only focuses on human parsing and pose estimation.
Human parsing approaches: Recently, many research efforts have been devoted to human parsing [26, 48, 47, 37, 29, 45, 8]. For example, Liang et al.  proposed a novel Co-CNN architecture that integrates multiple levels of image contexts into a unified network. In addition to human parsing, there has also been increasing research interest in the part segmentation of other objects, such as animals or cars [41, 43, 32]. To capture the rich structure information based on the advanced CNN architecture, common solutions include the combination of CNNs and CRFs [7, 52] and adopting multi-scale feature representations [7, 8, 45]. Chen et al.  proposed an attention mechanism that learns to weight the multi-scale features at each pixel location.
Pose estimation approaches: Articulated human poses are generally modeled using a combination of a unary term and pictorial structures  or graph model, e.g., mixture of body parts [50, 10, 34]. With the introduction of DeepPose , which formulates the pose estimation problem as a regression problem using a standard convolutional architecture, research on human pose estimation began to shift from classic approaches to deep networks. For example, Wei et al.  incorporated the inference of the spatial correlations among body parts within ConvNets. Newell et al. proposed a stacked hourglass network  using a repeated pooling down and upsampling process to learn the spatial distribution.
Some previous works [13, 46] explored human pose information to guide human parsing by generating “pose-guided” part segment proposals. Additionally, some works [4, 11] generated attention maps of the body part to guide pose estimation. To further utilize the advantages of parsing and pose and their relationships, our focus is a joint human parsing and pose estimation network, which can simultaneously predict parsing and pose with extremely high quality. Additionally, to leverage the human joint structure more effortlessly and efficiently, we simplify the network and propose a self-supervised structure-sensitive learning approach.
The rest of this paper is organized as follows. We present the analysis of existing human parsing and pose estimation datasets and then introduce our new LIP benchmark in Section 3. In Section 4, we present the empirical study of current methods based on our LIP benchmark and discuss the limitations of these methods. Then, to address the challenges raised by LIP, we propose a unified framework for simultaneous human parsing and pose estimation in Section 5. At last, more detailed comparisons between our approach and state-of-the-art methods are exhibited in Section 6.
In this section, we introduce our new “Look into Person (LIP)” dataset, which is a new large-scale dataset that focuses on semantic understanding of human bodies and that has several appealing properties. First, with 50,462 annotated images, LIP is an order of magnitude larger and more challenging than previous similar datasets [48, 9, 26]. Second, LIP is annotated with elaborate pixel-wise annotations with 19 semantic human part labels and one background label for human parsing and 16 body joint locations for pose estimation. Third, the images collected from the real-world scenarios contain people appearing with challenging poses and viewpoints, heavy occlusions, various appearances and in a wide range of resolutions. Furthermore, the background of the images in the LIP dataset is also more complex and diverse than that in previous datasets. Some examples are shown in Fig. 2. With the LIP dataset, we propose a new benchmark suite for human parsing and pose estimation together with a standard evaluation server where the test set will be kept secret to avoid overfitting.
The images in the LIP dataset are cropped person instances from Microsoft COCO  training and validation sets. We defined 19 human parts or clothes labels for annotation, which are hat, hair, sunglasses, upper clothes, dress, coat, socks, pants, gloves, scarf, skirt, jumpsuit, face, right arm, left arm, right leg, left leg, right shoe, and left shoe, as well as a background label. Similarly, we provide rich annotations for human poses, where the positions and visibility of 16 main body joints are annotated. Following , we annotate joints in a “person-centric” manner, which means that the left/right joints refer to the left/right limbs of the person. At test time, this requires pose estimation with both a correct localization of the limbs of a person along with the correct match to the left/right limb.
We implemented an annotation tool and generate multi-scale superpixels of images to speed up the annotation. More than 100 students are trained well to accomplish annotation work which lasts for five months. We supervise the whole annotation process and check the results periodically to control the annotation quality. Finally, we conduct a second-round check for each annotated image and selecte 50,000 usable and well-annotated images strictly and carefully from over 60,000 submitted images.
In total, there are 50,462 images in the LIP dataset, including 19,081 full-body images, 13,672 upper-body images, 403 lower-body images, 3,386 head-missing images, 2,778 back-view images and 21,028 images with occlusions. We split the images into separate training, validation and test sets. Following random selection, we arrive at a unique split that consists of 30,462 training and 10,000 validation images with publicly available annotations, as well as 10,000 test images with annotations withheld for benchmarking purposes.
Furthermore, to stimulate the multiple-human parsing research, we collect the images with multiple person instances in the LIP dataset to establish the first standard and comprehensive benchmark for multiple-human parsing and pose estimation. Our LIP multiple-human parsing and pose dataset contains 4,192 training, 497 validation and 458 test images, in which there are 5,147 multiple-person images in total.
In this section, we analyze the images and categories in the LIP dataset in detail. In general, face, arms, and legs are the most remarkable parts of a human body. However, human parsing aims to analyze every detailed region of a person, including different body parts and different categories of clothes. We therefore define 6 body parts and 13 clothes categories. Among these 6 body parts, we divide arms and legs into the left side and right side for a more precise analysis, which also increases the difficulty of the task. For clothes classes, we not only have common clothes such as upper clothes, pants, and shoes but also have infrequent categories, such as skirts and jumpsuits. Furthermore, small-scale accessories such as sunglasses, gloves, and socks are also taken into account. The numbers of images for each semantic part label are presented in Fig. 3.
In contrast to other human image datasets, the images in the LIP dataset contain diverse human appearances, viewpoints, and occlusions. Additionally, more than half of the images suffer from occlusions of different degrees. An occlusion is considered to have occurred if any of the semantic parts or body joints appear in the image but are occluded or invisible. In more challenging cases, the images contain person instances in a back view, which gives rise to more ambiguity in the left and right spatial layouts. The numbers of images of different appearances (i.e., occlusion, full body, upper body, head missing, back view and lower body) are summarized in Fig. 4.
|Method||Overall accuracy||Mean accuracy||Mean IoU|
|DeepLab (VGG-16) ||82.66||51.64||41.64|
|DeepLab (ResNet-101) ||84.09||55.62||44.80|
|JPPNet (with pose info)||86.39||62.32||51.37|
|Method||Overall accuracy||Mean accuracy||Mean IoU|
|DeepLab (VGG-16) ||82.89||51.53||41.56|
|DeepLab (ResNet-101) ||84.25||55.64||44.96|
|JPPNet (with pose info)||86.48||62.25||51.36|
In this section, we analyze the performance of leading human parsing or semantic object segmentation and pose estimation approaches on our benchmark. We take advantage of our rich annotations and conduct a detailed analysis of the various factors that influence the results, such as appearance, foreshortening, and viewpoints. The goal of this analysis is to evaluate the robustness of the current approaches in various challenges for human parsing and pose estimation and identify the existing limitations to stimulate further research advances.
In our analysis, we consider fully convolutional networks  (FCN-8s), a deep convolutional encoder-decoder architecture  (SegNet), deep convolutional nets with atrous convolution and multi-scale  (DeepLab (VGG-16), DeepLab (ResNet-101)) and an attention mechanism  (Attention), which all have achieved excellent performance on semantic image segmentations in different ways and have completely available codes. For a fair comparison, we train each method on our LIP training set until the validation performance saturates, and we perform evaluation on the validation set and the test set. For the DeepLab methods, we remove the post-processing, dense CRFs. Following [8, 45], we use the standard intersection over union (IoU) criterion and pixel-wise accuracy for evaluation.
We begin our analysis by reporting the overall human parsing performance of each approach, and the results are summarized in Table II and Table III. On the LIP validation set, among the five approaches, DeepLab (ResNet-101) 
with the deepest networks achieves the best result of 44.80% mean IoU. Benefitting from the attention model that softly weights the multi-scale features, Attention also performs well with 42.92% mean IoU, whereas both FCN-8s  (28.29%) and SegNet  (18.17%) perform significantly worse. Similar performance is observed on the LIP test set. The interesting outcome of this comparison is that the achieved performance is substantially lower than the current best results on other segmentation benchmarks, such as PASCAL VOC . This result suggests that detailed human parsing due to the small parts and diverse fine-grained labels is more challenging than object-level segmentation, which deserves more attention in the future.
We further analyze the performance of each approach with respect to the following five challenging factors: occlusion, full body, upper body, head missing and back view (see Fig. 5). We evaluate the above five approaches on the LIP validation set, which contains 4,277 images with occlusions, 5,452 full-body images, 793 upper-body images, 112 head-missing images and 661 back-view images. As expected, the performance varies when the approaches are affected by different factors. Back view is clearly the most challenging case. For example, the IoU of Attention 
decreases from 42.92% to 33.50%. The second most influential factor is the appearance of the head. The scores of all approaches are considerably lower on head-missing images than the average score on the entire set. The performance also greatly suffers from occlusions. The results of full-body images are the closest to the average level. By contrast, upper body is relatively the easiest case, where fewer semantic parts are present and the part regions are generally larger. From these results, we can conclude that the head (or face) is an important cue for the existing human parsing approaches. The probability of ambiguous results will increase if the head part disappears in the images or in the back view. Moreover, the parts or clothes on the lower body are more difficult than those on the upper body because of the existence of small labels, such as shoes or socks. In this case, the body joint structure can play an effective role in guiding human parsing.
To discuss and analyze each of the 20 labels in the LIP dataset in more detail, we further report the performance of per-class IoU on the LIP validation set, as shown in Table IV. We observe that the results with respect to labels with larger regions such as face, upper clothes, coats, and pants are considerably better than those on the small-region labels, such as sunglasses, scarf, and skirt. DeepLab (ResNet-101)  and Attention  perform better on small labels thanks to the utilization of deep networks and multi-scale features.
The qualitative comparisons of the five approaches on our LIP validation set are visualized in Fig. 6. We present example parsing results of the five challenging factor scenarios. For the upper-body image (a) with slight occlusion, the five approaches perform well with few errors. For the back-view image (b), all five methods mistakenly label the right arm as the left arm. The worst results occur for the head-missing image (c). SegNet  and FCN-8s  fail to recognize arms and legs, whereas DeepLab (VGG-16)  and Attention  have errors on the right and left arms, legs and shoes. Furthermore, severe occlusion (d) also greatly affects the performance. Moreover, as observed from (c) and (d), some of the results are unreasonable from the perspective of human body configuration (e.g., two shoes on one foot) because the existing approaches lack the consideration of body structures. In summary, human parsing is more difficult than general object segmentation. Particularly, human body structures should receive more attention to strengthen the ability to predict human parts and clothes with more reasonable configurations. Consequently, we consider connecting human parsing results and body joint structure to determine a better approach for human parsing.
|DeepLab (VGG-16) ||57.94||66.11||28.50||18.40||60.94||23.17||47.03||34.51||64.00||22.38||14.29||18.74||69.70||49.44||51.66||37.49||34.60||28.22||22.41||83.25||41.64|
|DeepLab (ResNet-101) ||59.76||66.22||28.76||23.91||64.95||33.68||52.86||37.67||68.05||26.15||17.44||25.23||70.00||50.42||53.89||39.36||38.27||26.95||28.36||84.09||44.80|
|JPPNet (with pose info)||63.55||70.20||36.16||23.48||68.15||31.42||55.65||44.56||72.19||28.39||18.76||25.14||73.36||61.97||63.88||58.21||57.99||44.02||44.09||86.26||51.37|
|JPPNet (with parsing info)||93.3||89.3||84.4||82.5||70.0||78.3||77.7||82.7|
|JPPNet (with parsing info)||93.2||89.3||84.6||82.2||69.9||78.0||77.3||82.5|
Similarly, we consider three state-of-the-art methods for pose estimation, including a sequential convolutional architecture  (CPM) and a repeated bottom-up, top-down network  (Hourglass). ResNet-101 with atrous convolutions  is also taken into account, for which we reserve the entire network and change the output layer to generate pose heatmaps. These approaches achieve top performance on the MPII  and LSP  datasets and can be trained on our LIP dataset using publicly available codes. Again, we train each method on our LIP training set and evaluate on the validation set and the test set. Following MPII 
, the evaluation metric that we used is the percentage of correct keypoints with respect to head (PCKh). PCKh considers a candidate keypoint to be localized correctly if it falls within the matching threshold which is 50% of the head segment length.
We again begin our analysis by reporting the overall pose estimation performance of each approach, and the results are summarized in Table V and Table VI. On the LIP validation set, Hourglass  achieves the best result of 77.5% total PCKh, benefiting from their multiple hourglass modules and intermediate supervision. With a sequential composition of convolutional architectures to learn implicit spatial models, CPM  also obtains comparable performance. Interestingly, the achieved performance is substantially lower than the current best results on other pose estimation benchmarks, such as MPII . This wide gap reflects the higher complexity and variability of our LIP dataset and the significant development potential of pose estimation research. Similar performance on the LIP test set is again consistent with our analysis.
We further analyze the performance of each approach with respect to the four challenging factors (see Fig. 7). We leave head-missing images out because the PCKh metric depends on the head size of the person. In general and as expected, the performance decreases as the complexity increases. However, there are interesting differences. The back-view factor clearly influences the performance of all approaches the most, as the scores of all approaches decrease nearly 10% compared to the average score on the entire set. The second most influential factor is occlusion. For example, the PCKh of Hourglass  is 4.60% lower. These two factors are related to the visibility and orientation of heads in the images, which indicates that similar to human parsing, the existing pose estimation methods strongly depend on the contextual information of the head or face. In this case, exploring and leveraging the correlation and complementation of human parsing and pose estimation is advantageous for reducing this type of dependency.
The qualitative comparisons of the pose estimation results on our LIP validation set are visualized in Fig. 8. We select some challenging images with unusual appearances, truncations and occlusions to analyze the failure cases and obtain some inspiration. First, for the persons standing or sitting sideways, the existing approaches typically failed to predict their occluded body joints, such as the right arm in Col 1, the right leg in Col 2, and the right leg in Col 7. Second, for the persons in back view or head missing, the left and right arms (legs) of the persons are always improperly located, as with those in Cols 3 and 4. Moreover, for some images with strange appearances where some limbs of the person are very close (Cols 5 and 6), ambiguous and irrational results will be generated by these methods. In particular, the performance of the gray images (Col 8) is also far from being satisfactory. Learning from these failure cases, we believe that pose estimation should fall back on more instructional contextual information, such as the guidance from human parts with reasonable configurations.
Summarizing the analysis of human parsing and pose estimation, it is clear that despite the strong connection of these two human-centric tasks, the intrinsic consistency between them will benefit each other. Consequently, we present a unified framework for simultaneous human parsing and pose estimation to explore this intrinsic correlation.
In this section, we first summarize some insights about the limitations of existing approaches and then illustrate our joint human parsing and pose estimation network in detail. From the above detailed analysis, we obtain some insights into the human parsing and pose estimation tasks. 1) A major limitation of the existing human parsing approaches is the lack of consideration of human body configuration, which is mainly investigated in the human pose estimation problem. Meanwhile, the part maps produced by the detection network contain multiple contextual information and structural part relationships, which can effectively guide the regression network to predict the locations of the body joints. Human parsing and pose estimation aim to label each image with different granularities, that is, pixel-wise semantic labeling versus joint-wise structure prediction. The pixel-wise labeling can address more detailed information, whereas joint-wise structure provides more high-level structure, which means that the two tasks are complementary. 2) As learned from the existing approaches, the coarse-to-fine scheme is widely used in both parsing and pose networks to improve accuracy. For coarse-to-fine, there are two different definitions for parsing and pose tasks. For parsing or segmentation, it means using the multi-scale segmentation or attention-to-zoom scheme  for more precise pixel-wise classification. Conversely, for the pose task, it indicates iterative displacement refinement, which is widely used in pose estimation . It is reasonable to incorporate these two distinct coarse-to-fine schemes together in a unified network to further improve the parsing and pose results.
To utilize the coherent representation of human parsing and pose to promote each task, we propose a joint human parsing and pose estimation network, which also incorporates two distinct coarse-to-fine schemes, i.e., multi-scale features and iterative refinement, together. The framework architecture is illustrated in Fig. 9 and the detailed configuration is presented in Table VII. We denote our joint human parsing and pose estimation network as JPPNet.
In general, the basic network of the parsing framework is a deep residual network , while the pose framework prefers a stacked hourglass network . In our joint framework, we use a shared residual network to extract human image features, which is more efficient and concise. Then, we have two distinct networks to generate parsing and pose features and results. They are followed by a refinement network, which takes features and results as input to produce more accurate segmentation and joint localization.
Feature extraction. We employ convolution with upsampled filters, or “atrous convolution” , as a powerful tool to repurpose ResNet-101  in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within DCNNs. It also effectively enlarges the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. The first four stages of ResNet-101 (i.e., Res-1 to Res-4) are shared in our framework. The deeper convolutional layers are different to learn for distinct tasks.
Parsing and pose subnet. We use ResNet-101 with atrous convolution as the basic parsing subnet, which contains atrous spatial pyramid pooling (ASPP) as the output layer to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields of view, thus capturing objects and image context at multiple scales. Furthermore, to generate the context used in the refinement stage, there are two convolutions following Res-5. For the pose subnet, we simply add several convolutional layers to the fourth stage (Res-4) of ResNet-101 to generate pose features and heatmaps.
|Pose refinement||remap-1||pose maps||128|
|Parsing refinement||remap-1||pose maps||128|
Refinement network. We also design a simple but efficient refinement network, which is able to iteratively refine both parsing and pose results. We reintegrate the intermediate parsing and pose predictions back into the feature space by mapping them to a larger number of channels with an additional convolution. Then, we have four convolutional layers with an incremental kernel size that varies from 3 to 9 to capture a sufficient local context and to increase the receptive field size, which is crucial for learning long-term relationships. Next is another convolution to generate the features for the next refinement stage. To refine pose, we concatenate the remapped pose and parsing results and the pose features from the last stage. For parsing, we concatenate the two remapped results and parsing features and use ASPP again to generate parsing predictions. The entire joint network with refinement can be trained end-to-end, feeding the output of the former stage into the next. Following other pose estimation methods that have demonstrated strong performance with multiple iterative stages and intermediate supervision [44, 33, 5], we apply a loss upon the prediction of intermediate maps.
The joint human parsing and pose estimation network (JPPNet) leverages both pixel-wise supervision from human part annotations and high-level structural guidance from joint annotations. However, in some cases, e.g., in previous human parsing datasets, the joint annotations may not be available. In this section, we show that high-level human structure cues can still help the human parsing task even without explicit supervision from manual annotations. We simplify our JPPNet and propose a novel self-supervised structure-sensitive learning for human parsing, which introduces a self-supervised structure-sensitive loss to evaluate the quality of the predicted parsing results from a joint structure perspective, as illustrated in Fig. 10.
Specifically, in addition to using the traditional pixel-wise annotations as the supervision, we generate the approximated human joints directly from the parsing annotations, which can also guide human parsing training. For the purpose of explicitly enforcing the produced parsing results to be semantically consistent with the human joint structures, we treat the joint structure loss as a weight of segmentation loss, which becomes our structure-sensitive loss.
Self-supervised Structure-sensitive Loss: Generally, for the human parsing task, no other extensive information is provided except the pixel-wise annotations. This situation means that rather than using augmentative information, we have to find a structure-sensitive supervision from the parsing annotations. Because the human parsing results are semantic parts with pixel-level labels, we attempt to explore pose information contained in human parsing results. We define 9 joints to construct a pose structure, which are the centers of the regions of head, upper body, lower body, left arm, right arm, left leg, right leg, left shoe and right shoe. The head regions are generated by merging the parsing labels of hat, hair, sunglasses and face. Similarly, upper clothes, coat and scarf are merged to be upper body, and pants and skirt are merged for lower body. The remaining regions can also be obtained by the corresponding labels. Some examples of generated human joints for different humans are shown in Fig. 11. Following , for each parsing result and corresponding ground truth, we compute the center points of regions to obtain joints represented as heatmaps for training more smoothly. Then, we use the Euclidean metric to evaluate the quality of the generated joint structures, which also reflects the structure consistency between the predicted parsing results and the ground truth. Finally, the pixel-wise segmentation loss is weighted by the joint structure loss, which becomes our structure-sensitive loss. Consequently, the overall human parsing networks become self-supervised with the structure-sensitive loss.
Formally, given an image , we define a list of joint configurations , where is the heatmap of the i-th joint computed according to the parsing result map. Similarly, , which is obtained from the corresponding parsing ground truth. Here, is a variable decided by the human bodies in the input images, and it is equal to 9 for a full-body image. For the joints missed in the image, we simply replace the heatmaps with maps filled with zeros. The joint structure loss is the Euclidean (L2) loss, which is calculated as follows:
The final structure-sensitive loss, denoted as , is the combination of the joint structure loss and the parsing segmentation loss, and it is calculated as follows:
where is the pixel-wise softmax loss calculated based on the parsing annotations.
We name this learning strategy “self-supervised” because the above structure-sensitive loss can be generated from existing parsing results without any extra information. Our self-supervised structure-sensitive JPPNet (SS-JPPNet) thus has excellent adaptability and extensibility, which can be injected into any advanced network to help incorporate rich high-level knowledge about human joints from a global perspective.
Network architecture: We utilize the publicly available model DeepLab (ResNet-101) 
as the basic architecture of our JPPNet, which employs atrous convolution, multi-scale inputs with max-pooling to merge the results from all scales, and atrous spatial pyramid pooling. For SS-JPPNet, our basic network is Attention due to its leading accuracy and competitive efficiency.
Training: To train JPPNet, the input image is scaled to
. We first train ResNet-101 on the human parsing task for 30 epochs using the pre-trained models and networks settings from. Then, we train the joint framework end-to-end for another 30 epochs. We apply data augmentation, including randomly scaling the input images (from 0.75 to 1.25), randomly cropping and randomly left-right flipping during training.
When training SS-JPPNet, we use the pre-trained models and network settings provided by DeepLab . The scale of the input images is fixed as for training networks based on Attention . Two training steps are employed to train the networks. First, we train the basic network on our LIP dataset for 30 epochs. Then, we perform the “self-supervised” strategy to fine-tune our model with the structure-sensitive loss. We fine-tune the networks for approximately 20 epochs. We use both human parsing and pose annotations to train JPPNet and only parsing labels for SS-JPPNet.
Inference: To stabilize the predictions, we perform inference on multi-scale inputs (with scales = 0.75, 0.5, 1.25) and also left-right flipped images. In particular, we compute as the final result the average probabilities from each scale and flipped images, which is the same for predicting both parsing and pose. The difference is that we utilize predictions of all stages for parsing, but for pose, we only use the results of the last stage.
|DeepLab (VGG-16) ||48.64||43.97||28.77||40.74||43.02|
|DeepLab (ResNet-101) ||53.28||46.99||31.70||43.14||47.62|
|JPPNet (with pose info)||54.45||53.99||44.56||50.52||52.58|
|JPPNet (with parsing info)||90.4||91.7||86.4||84.0||82.5||76.5||71.3||84.1|
|Joint + MSC||92.9||88.1||82.8||81.0||67.8||74.3||75.7||80.9|
|Joint + S1||93.5||88.8||83.6||81.6||70.2||76.4||76.8||82.1|
|Joint + MSC + S1||93.4||88.9||84.2||82.1||70.8||77.2||77.3||82.5|
|Joint + MSC + S2||93.3||89.3||84.4||82.5||70.0||78.3||77.7||82.7|
We compare our proposed approach with the strong baselines on the LIP dataset, and we further evaluate SS-JPPNet on another public human parsing dataset.
LIP dataset: We report the results and the comparisons with five state-of-the-art methods on the LIP validation set and test set in Table II and Table III. On the validation set, the proposed JPPNet framework improves the best performance from 44.80% to 51.37%. The simplified architecture can also provide a substantial enhancement in average IoU: 3.09% better than DeepLab (VGG-16)  and 1.81% better than Attention . On the test set, the JPPNet also considerably outperforms the other baselines. This superior performance achieved by our methods demonstrates the effectiveness of our joint parsing and pose networks, which incorporate the body joint structure into the pixel-wise prediction.
In Fig. 5, we show the results with respect to the different challenging factors on our LIP validation set. With our unified framework that models the contextual information of body parts and joints, the performance of all kinds of types is improved, which demonstrates that human joint structure is conducive for the human parsing task.
We further report per-class IoU on the LIP validation set to verify the detailed effectiveness of our approach, as presented in Table IV. With the consideration of human body joints, we achieved the best performance on almost all the classes. As observed from the reported results, the proposed JPPNet significantly improves the performance of the labels such as arms, legs, and shoes, which demonstrates its ability to refine the ambiguity of left and right. Furthermore, the labels covering small regions such as socks, and gloves are better predicted with higher IoUs. This improvement also demonstrates the effectiveness of the unified framework, particularly for small labels.
For a better understanding of our LIP dataset, we train all methods on LIP and evaluate them on ATR , as reported in Table VIII (left). As ATR contains 18 categories while LIP has 20, we test the models on the 16 common categories (hat, hair, sunglasses, upper clothes, dress, pants, scarf, skirt, face, right arm, left arm, right leg, left leg, right shoe, left shoe, and background). In general, the performance on ATR is better than those on LIP because the LIP dataset contains instances with more diverse poses, appearance patterns, occlusions and resolution issues, which is more consistent with real-world situations.
Following the MSCOCO dataset , we have conducted an empirical analysis on different object sizes, i.e., small (), medium () and large (). The results of the five baselines and the proposed methods are reported in Table VIII (right). As shown, our methods show substantially superior performance for different sizes of objects, thus further demonstrating the advantage of incorporating the human body structure into the parsing model.
PASCAL-Person-Part dataset . The public PASCAL-Person-Part dataset with 1,716 images for training and 1,817 for testing focuses on the human part segmentation annotated by . Following [8, 45], the annotations are merged to be six person part classes and one background class, which are head, torso, upper / lower arms and upper / lower legs. We train and evaluate all methods using the training and testing data in PASCAL-Person-Part dataset . Table IX shows the performance of our model and comparisons with four state-of-the-art methods on the standard IoU criterion. Our SS-JPPNet can significantly outperform the four baselines. For example, our best model achieves 59.36% IoU, which is 7.58% better than DeepLab-LargeFOV  and 2.97% better than Attention . This large improvement demonstrates that our self-supervised strategy is significantly beneficial for the human parsing task.
LIP dataset: Table V and Table VI report the comparison of the PCKh performance of our JPPNet and previous state-of-the-art at a normalized distance of 0.5. On the LIP test set, our method achieves state-of-the-art PCKh scores of 82.7%. In particular, for the most challenging body parts, e.g., hip and ankle, our method achieves 5.0% and 5.5% improvements compared with the closest competitor, respectively. Similar improvements also occur on the validation set.
We present the results with respect to different challenging factors on our LIP validation set in Fig. 7. As expected, with our unified architecture, the results of all different appearances become better, thus demonstrating the positive effects of the human parsing to pose estimation.
MPII Human Pose dataset : To be more convincing, we also perform evaluations on the MPII dataset. The MPII dataset contains approximately 25,000 images, where each person is annotated with 16 joints. The images are extracted from YouTube videos, where the contents are everyday human activities. There are 18079 images in the training set, including 11431 single person images. We evaluate the models trained on our LIP training set and test on these 11431 single person images from MPII, as presented in Table X. The distance between our approach and others provides evidence of the higher generalization ability of our proposed JPPNet model.
We further evaluate the effectiveness of our two coarse-to-fine schemes of JPPNet, including the multi-scale features and iterative refinement. “Joint” denotes the JPPNet without multi-scale features (“MSC”) or refinement networks. “S1” means one stage refinement and “S2” is noted for two stages. The human parsing and pose estimation results are shown in Table XII and Table XI. From the comparisons, we can learn that multi-scale features greatly improve for human parsing but slightly for pose estimation. However, pose estimation considerably benefits from iterative refinement, which is not quite helpful for human parsing, as two stage refinements will decrease the parsing performance.
|Method||Overall accuracy||Mean accuracy||Mean IoU|
|Joint + MSC||86.18||61.40||50.83|
|Joint + S1||86.09||57.95||49.58|
|Joint + MSC + S1||86.48||62.25||51.36|
|Joint + MSC + S2||86.42||61.12||50.64|
Human parsing: The qualitative comparisons of the parsing results on the LIP validation set are visualized in Fig. 6. As can be observed from these visual comparisons, our methods output more semantically meaningful and precise predictions than the other five methods despite the existence of large appearance and position variations. Taking (b) and (c) for example, our approaches can also successfully handle the confusing labels, such as left arm versus right arm and left leg versus right leg. These regions with similar appearances can be recognized and separated by the guidance from joint structure information. For the most difficult head-missing image (c), the left shoe, right shoe and legs are excellently corrected by our JPPNet approach. In general, by effectively exploiting human body joint structure, our approaches output more reasonable results for confusing labels on the human parsing task.
Pose estimation: The qualitative comparisons of pose results on the LIP validation set are presented in Fig. 8. In Section 4.2.3, we summarize some challenging cases that cause considerable trouble for the previous pose estimation approaches. In contrast, by jointly modeling human parsing and pose estimation, our model can effectively avoid the cumbersome obstacles such sideways, occlusion or other erratic postures, thus leading to more promising and reasonably remarkable results.
Finally, we want to emphasize that our goal is to explore the intrinsic correlation between human parsing and pose estimation. For this purpose, we propose JPPNet, which is a unified model built upon two distinct coarse-to-fine schemes. Separating our framework into different components leads to inferior results, as demonstrated in Table XII and Table XI. Although we use more annotations than methods for individual tasks, the promising results of our framework verify that human parsing and pose estimation are essentially complementary; thus, performing the two tasks simultaneously will enhance the performance of each task.
In this work, we presented “Look into Person (LIP)”, a large-scale human parsing and pose estimation dataset and a carefully designed benchmark to spark progress in human-centric tasks. LIP contains 50,462 images, which are richly labeled with 19 semantic part labels and 16 body joints. It surpasses existing human parsing and pose estimation datasets in terms of scale and richness of annotations. Moreover, we proposed a joint human parsing and pose estimation network to explore the intrinsic connection of the two tasks. The extensive results clearly demonstrate the effectiveness of the proposed approaches. The datasets, code and models are available at http://www.sysu-hcp.net/lip/.
This work was supported by State Key Development Program under Grant 2016YFB1001004, the National Natural Science Foundation of China under Grant 61622214 and Grant U1611461, the Guangdong Natural Science Foundation Project for Research Teams under Grant 2017A030312006, and the Guangdong Science and Technology Planning Program under Grant 2017B010116001.
Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 1014–1021. IEEE, 2009.
Semantic object parsing with local-global long short-term memory.In CVPR, 2016.
Matching-CNN Meets KNN: Quasi-Parametric Human Parsing.In CVPR, 2015.
Multi-source deep learning for human pose estimation.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2329–2336, 2014.