Code for CVPR'18 spotlight "Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer"
Human body part parsing, or human semantic part segmentation, is fundamental to many computer vision tasks. In conventional semantic segmentation methods, the ground truth segmentations are provided, and fully convolutional networks (FCN) are trained in an end-to-end scheme. Although these methods have demonstrated impressive results, their performance highly depends on the quantity and quality of training data. In this paper, we present a novel method to generate synthetic human part segmentation data using easily-obtained human keypoint annotations. Our key idea is to exploit the anatomical similarity among human to transfer the parsing results of a person to another person with similar pose. Using these estimated results as additional training data, our semi-supervised model outperforms its strong-supervised counterpart by 6 mIOU on the PASCAL-Person-Part dataset, and we achieve state-of-the-art human parsing results. Our approach is general and can be readily extended to other object/animal parsing task assuming that their anatomical similarity can be annotated by keypoints. The proposed model and accompanying source code are available at https://github.com/MVIG-SJTU/WSHPREAD FULL TEXT VIEW PDF
Code for CVPR'18 spotlight "Weakly and Semi Supervised Human Body Part Parsing via Pose-Guided Knowledge Transfer"
The task of human body part parsing retrieves a semantic segmentation of body parts from the image of a person. Such pixel-level body part segmentations are not only crucial for activity understanding, but might also facilitate various vision tasks such as robotic manipulation , affordances reasoning  and recognizing human-object interactions 
. In recent years, deep learning has been successfully applied to the human part parsing problem[4, 5, 32].
To fully exploit the power of deep convolutional neural networks, large-scale datasets are indispensable. However, semantic labeling of body parts on a pixel-level is labor intensive. For the task of human body part parsing, the largest dataset  contains less than 2,000 labeled images for training, which is order of magnitudes less than the amount of training data in common benchmarks for image classification, semantic segmentation and keypoint estimation [9, 25, 12]. The small amount of data may lead to over-fitting and degrade the performance in real world scenarios.
On the other hand, an abundance of human keypoints annotations  is readily available. Current state-of-the-art pose estimation algorithms  have also performed well in natural scene. The human keypoints encode structural information of the human body and we believe that such high-level human knowledge can be transfered to the task of human body part parsing.
However, despite the availability of keypoints annotations, few methods have investigated to utilize keypoints as augmented training data for human body part parsing. The main problem is that human keypoints are a sparse representation, while the human body part parsing requires an enormous amount of training data with dense pixel-wise annotations. Consequently, a end-to-end method which only relies on keypoint annotations as labels may not achieve high performance.
In this paper, we propose a novel approach to augment training samples for human parsing. Due to physical anatomical constraints, humans that share the same pose will have a similar morphology. As shown in Fig 1, given a person, we can use his/her pose annotation to search for the corresponding parsing results with similar poses. These collected parsing results are then averaged and form a strong part-level prior. In this way, we can convert the sparse keypoint representation into a dense body part segmentation prior. With the strong part-level prior, we combine it with the input image and forward them through a refinement network to generate an accurate part segmentation result. The generated part segmentation can be used as extra data to train a human parsing network.
We conduct exhaustive experiments on the PASCAL-Part dataset . Our semi supervised method achieves state-of-the-art performance using a simple VGG-16  based network, surpassing the performance of ResNet-101  based counterpart trained on limited part segmentation annotations. When utilizing a model based on the deeper ResNet-101, our proposed method outperforms the state-of-the-art results by 3 mAP.
This paper is closely related to the following areas: semantic part segmentation, joint pose and body part estimation, and weakly supervised learning.
In this subtask of semantic segmentation, fully convolutional network (FCN)  and its variants [4, 5, 32] have demonstrated promising results. In , Chen et al. proposed atrous convolution to capture object features at different scales and they further combined the convolutional neural network (CNN) with a Conditional Random Field (CRF) to improve the accuracy. In 
, the authors proposed an attention mechanism that softly combines the segmentation predictions at different scales according to the context. To tackle the problem of scale and location variance, Xiaet al.  developed a model that adaptively zoom the input image into the proper scale to refine the parsing results.
Another approach to human parsing is the usage of recurrent networks with long short term memory (LSTM) units. The LSTM network can innately incorporate local and global spatial dependencies into their feature learning. In , Liang et al. proposed the local-global LSTM network to incorporate spatial dependencies at different distances to improve the learning of features. In , the authors proposed a Graph LSTM network to fully utilize the local structures (e.g., boundaries) of images. Their network takes arbitrary-shaped superpixels as input and propagates information from one superpixel node to all its neighboring superpixel nodes. To further explore the multi-level correlations among image regions, Liang et al.  proposed a structure-evolving LSTM that can learn graph structure during the optimization of LSTM network. These LSTM networks achieved competitive performance on human body part parsing.
Recent works try to utilize the human pose information to provide high-level structure for human body part parsing. Promising methods include pose-guided human parsing , joint pose estimation and semantic part segmentation [10, 20, 33] and self-supervised structure-sensitive learning . These methods focus on using pose information to regularize part segmentation results, and they lie in the area of strong supervision. Our method differs from theirs in that we aim to transfer the part segmentation annotations to unlabeled data based on pose similarity and generate extra training samples, which emphasizes semi-supervision.
In [29, 7, 28, 8, 24, 2], the idea is to utilize weakly supervised methods for semantic segmentation. In particular, Chen et al.  proposed to learn segmentation priors based on visual subcategories. Dai et al.  harnessed easily obtained bounding box annotations to locate the object and generated candidate segmentation masks using unsupervised region proposal methods. Since bounding box annotations cannot generalize well for background (e.g. water, sky, grasses), Li et al.  further explored training semantic segmentation models using the sparse scribbles. In , Bearman et al. trained a neural network using supervision from a single point of each object. Under some time budget, the proposed supervised method may yield improved result compared to other weak supervised counterparts.
Our goal is to utilize pose information to weakly supervise the training of a human parsing network. Consider a semantic part segmentation problem where we have a dataset of labeled training examples. Let denote an input image, denotes its corresponding part segmentation label, denotes its keypoints annotation, and is training example index. Each example contains at most body parts and keypoints. In practice, is usually small since labeling the human semantic part is very labor intensive.
For a standard semantic part segmentation problem, the objective function can be written as:
where is the pixel index of the image , is the semantic label at pixel , is the fully convolutional network, is the per-pixel labeling produced by the network given parameters , and
is the per-pixel loss function.
Similarly, we can consider another dataset of examples with only keypoints annotations where . Our goal is to generate part segmentations for all images in and utilize these as additional training data when training a FCN by minimizing the objective function in Eqn. 1.
To utilize the keypoints annotations, we directly generate the pixel-wise part segmentations based on keypoints annotations. Given a target image and keypoints annotation where , we first find a subset of images in that have the most similar keypoints annotations (Sec. 3.3). Then, for the clustered images, we apply pose-guided morphing to their part segmentations to align them with the target pose (Sec. 3.4). The aligned body part segmentations are averaged to generate the part-level prior for the target image . Finally, a refinement network is applied to estimate the body part segmentation of (Sec. 3.5). The training method of our refinement network will be detailed in Sec. 3.6. The generated part segmentations can be used as extra training data for human part segmentation (Sec. 3.7). Figure 2 gives an overview of our method.
In order to measure similarity between different poses, we first need to normalize and align all the poses. We normalize the pose size by fixing their torsos to the same length. Then, the hip keypoints are used as the reference coordinate to align with the origin. We measure the Euclidean distances between and every keypoint annotations in . The top- persons in with the smallest distances are chosen and form the pose-similar cluster, which serves as the basis for the following part-level prior generation step. The influence of will be evaluated in Sec. 4.3.
Intuitively, given an image with only keypoint annotations , one may think of an alternative solution to obtain the part-parsing prior by solely morphing the part segmentation result of the person that has the closest pose to
. However, due to the differences between human bodies or possible occlusions, a part segmentation result with distinct boundary may not fit well to another one. Thus, instead of finding the person with the most similar pose, we find several persons that have similar poses and generate part-level prior by averaging their morphed part segmentation results. Such averaged part-level prior can be regarded as a probability map for each body part. It denotes the possibility for each pixel of whether it belongs to a body part based on real data distribution. In Sec.4.3, we show that using the averaged prior achieves better performance than using only the parsing result of the closest neighbor.
The discovered pose-similar cluster forms a solid basis for generating the part-level prior. However, for each cluster, there are some inevitable intra-cluster pose variations, which makes the part parsing results misaligned. Thus, we introduce the pose-guided morphing method.
For persons in the same pose-similar cluster, let us denote their part parsing results as , their keypoints annotations as and the morphed part parsing results as . By comparing the poses in with the target pose , we can compute the transformation parameters and then use them to transform to obtain the morphed part parsing results . We use the affine transformation to morph the part segmentations. This procedure can be expressed as:
is the affine transformation with parameters , and computes according to pose and the target pose .
For the part parsing annotations, we represent them as the combination of several binary masks. Each mask represent the appearance of a corresponding body part. The morphing procedure is conducted on each body part independently. Consider the left upper arm as an example. For the left upper arm segment of pose and the same segment of pose , we have transformation relationship
where is the affine transformation matrix we need to calculate. Since both and are known, we can easily compute the result of and . Then, the morphed part segmentation mask can be obtained by
where and are the coordinates before and after transformation. Figure 3 illustrates our pose-guided morphing method.
After the pose-guided morphing, the transformed part segmentations are averaged to form the part-level prior
Finally, we feed forward our part-level prior through a refinement network together with the original image.
With a coarse prior, the search space for our refinement network is significantly reduced, and thus it can achieve superior results than directly making predictions from a single image input. The discriminative power of a CNN can eliminate the uncertainty at the body part boundary of the part parsing prior based on local image evidence, thus leading to high-quality parsing results. For each image with body part prior , we estimate the part segmentation result by
where is the refinement network with parameters . In the next section, we will elaborate on the learning of . This estimated part segmentation result can be used as extra training data to train the FCN. The semi-supervised regime will be further discussed in Section 3.7.
Previous sections are conducted under the assumption that we have a well-trained refinement network. In this section, we will explain the training algorithm. Fig. 4 depicts a schematic overview of the proposed training pipeline.
The refinement network is a variant of “U-Net” proposed in , which is in the form of an auto-encoder network with skip connections. For such a network, the input will be progressively down-sampled until a bottleneck layer and then gradually up-sampled to restore the input size. To preserve the local image evidence, skip connections are introduced between each layer and layer , assuming the network has layers in total. Each skip connection concatenates the feature maps at layer with those in layer . We refer readers to  for the detailed network structures. The input for our network is an image as well as a set of masks. Each mask is a heatmap ranging from 0 to 1 that represents the probability map for a specific body part. The output for this network is also a set of masks that have the same representation. The label is a set of binary masks indicating the appearance of each body part. Our objective function is the L1 distance between the output and the label. Both input and output size are set as .
To train the refinement network, we utilize the data in with both semantic part segmentation annotations and pose annotations. Given an image , similar to the pipeline for generating part-level prior in Sec. 3.4, we first generate the part parsing prior given . The only difference is that when discovering pose-similar cluster, we find nearest neighbours each time and randomly pick of them to generate part-level prior. This can be regarded as a kind of data augmentation to improve the generalization ability of the refinement network. The impact of will be discussed in Sec. 4.3. Formally, with the part-level prior and semantic part segmentation ground truth for image , we train our refinement network by minimizing the cost function:
In previous sections, we have presented our method to generate pixel-wise part segmentation labels based on keypoints annotations. Now we can train the parsing network for part segmentation in a semi-supervised manner. For our parsing network, we use the VGG-16 based model proposed in  due to its effective performance and simple structure. In this network, multi-scale inputs are applied to a shared VGG-16 based DeepLab model  for predictions. A soft attention mechanism is employed to weight the outputs of the FCN over scales. The training for this network follows the standard process of optimizing the per-pixel regression problem, which is formulated as Eqn. 1
. For the loss function, we use the multinomial logistic loss. During training, the input image is resized and padded to.
We consider updating the parameter of network by minimizing the objective function in Eqn. 1 on with ground truth labels and with generated part segmentation labels.
In this section, we first introduce related datasets and implementation details in our experiments. Then, we report our results and comparisons with state-of-the-art performance. Finally, we conduct extensive experiments to validate the effectiveness of our proposed semi-supervision method.
is a dataset for human semantic part segmentation. It contains 1,716 images for training and 1,817 images for testing. The dataset contains detailed pixel-wise annotations for body parts, including hands, ears, etc. The keypoint annotations for this dataset have been made available by .
is a challenging benchmark for person pose estimation. It contains around 25K images and over 40K people with pose annotations. The training set consists of over 28K people and we select those with full body annotations as extra training data for human part segmentation. After filtering, there remain 10K images with single person pose annotations.
is a part segmentation benchmark for horse and cow images. It contains 294 training images and 227 testing images. The keypoint annotations for this dataset are provided by . In addition, 317 extra keypoint annotations are provided by , which have no corresponding pixel-wise part parsing annotations. We take these keypoint annotations as extra training data for our network.
In our experiments, we merge the part segmentation annotations in 
to be Head, Torso, Left/Right Upper Arms, Left/Right Lower Arms, Left/Right Upper Legs and Left/Right Lower Legs, resulting in 10 body parts. The pose-guided morphing is conducted on the mask of each body part respectively. In order to be consistent with previous works, after the morphing, we merge the masks of each left/right pair by max-pooling and get six body part classes.
For our semi-supervision setting, we first train the refinement network and then fix it to generate synthetic part segmentation data using keypoints annotations in . Note that for simplicity, we only synthesize part segmentation labels for the single person case, and it would easy to extend to a multi-person scenario. To train our refinement network with single person data, we crop those people with at least upper body keypoints annotations in the training list of the part parsing dataset  and get 2,004 images with a single person and corresponding part segmentation label. We randomly sample 100 persons as validation set for the refinement network to set hyper-parameters.
To train the refinement network, we apply random jittering by first resizing input images to and then randomly cropping to . The batch size is set to 1. We use the Adam optimizer with an initial learning rate of and of 0.5. To train our parsing network, we follow the same setting as  by setting the learning rate as 1e-3 and decayed by 0.1 after 2,000 iterations. The learning rate for the last layer is 10 times larger than previous layers. The batch size is set to 30. We use the SGD solver with a momentum of 0.9 and weight decay of
. All experiments are conducted on a single Nvidia Titan X GPU. Our refinement network is trained from scratch for 200 epochs, which takes around 18 hours. The parsing network is initialized with a model pre-trained on COCO dataset which is provided by the author. We train it for 15K iterations, and it takes around 30 hours for training.
We first compare the performance of the parsing network trained with a different number of annotations on validation set and the results are reported in Table 1. When all annotations of semantic part segmentations are used, the result is the original implementation of the baseline model , and our reproduction is slightly (0.5 mIOU) higher. Then we gradually add the data with keypoint annotations. As can be seen, the performance increases in line with the number of keypoint annotations. This suggests that our semi supervision method is effective and scalable. Fig. 6 gives some qualitative comparisons between the predictions of the baseline model and our semi-supervised model.
|supervision||mask anno.#||keypoints anno.#||mIoU|
We evaluate the significance of the part-level prior. First, we compare the performance of refinement network trained with or without part-level prior on the cropped single person part segmentation validation set. Without prior, the refinement network is a regular FCN trained with limited training data. The validation accuracy for both cases is shown in Fig. 7. As we can see, the accuracy of the refinement network with part-level prior is much higher than the case without prior. In Fig. 5, we visualize some predictions from our refinement network trained with and without part-level priors, and the corresponding priors are also visualized.
|strategy||cluster size k||pool size n||mIoU|
|skeleton label map ||-||-||58.26|
|part-level prior, w/o aug||1||-||60.10|
|part-level prior, w aug||3||5||62.60|
During our training process, the quality of part-level prior is important and directly affects the refined part segmentation results. We compare different prior generation strategies and report the final results in Table 2. We first explore using the skeleton label map as prior, which draws a stick with width 7 between neighboring joints, and the result is 58.26 mIOU. Comparing to this method, our proposed part-level prior has a considerable improvement, which indicates the importance of knowledge transfer during prior generation.
For part-level prior, we compare the impact of the size of pose-similar cluster. As aforementioned in Sec. 3.3, if we only choose the person with the nearest pose and take his/her morphed part parsing result as our prior, the performance is limited. But if we choose too many people to generate our part-level prior, the quality of the final results would also decline. We claim that this is because our part segmentation dataset is small and the intra-cluster part appearance variance would increase as the cluster size increases. Then, we explore the influence of adding data augmentation during the training of refinement network. As we can see, randomly sample 3 candidates to generate part-level prior from a larger pool with size 5 is beneficial for training since it increases the sample variances. For the remaining experiments, we set as 3 and as 5.
In Table 3, we report comparisons with the state-of-the-art results on the Pascal-Person-Part dataset . With additional training data generated by our refinement network, our VGG16 based model achieves the state-of-the-art performance.
|Graph LSTM ||82.69||62.68||46.88||47.71||45.66||40.93||94.59||60.16|
|Structure-evolving LSTM ||82.89||67.15||51.42||48.72||51.72||45.91||97.18||63.57|
|Joint (VGG-16, +ms) ||80.21||61.36||47.53||43.94||41.77||38.00||93.64||58.06|
|Joint (ResNet-101, +ms) ||85.50||67.87||54.72||54.30||48.25||44.76||95.32||64.39|
|Ours (ResNet-101, +ms)||87.15||72.28||57.07||56.21||52.43||50.36||97.72||67.60|
Although we focus on exploiting human keypoints annotations as extra supervision, our method applies to other baseline models and can benefit from other parallel improvements. To prove that, we replace our baseline model with ResNet-101  based DeepLabv2  model and follow the same training setting. Without additional keypoints annotations, this baseline achieves mAP. Under our semi-supervised training, this model achieves mIOU. Note that this result is obtained by single scale testing. By performing multi-scale testing (scale = 0.5, 0.75, 1), we can further achieve mIOU, which outperforms previous best result by mIOU.
To see how the refinement network can assist the training of the parsing network, we visualize some predictions of both networks in Fig. 8. In this experiment, the parsing network we use has already been trained under the semi supervision setting. As shown in Fig. 8, due to the guidance of the strong part-level prior, the refinement network makes fewer mistakes on the structure predictions (e.g., the upper legs in (a) and the upper arms in (b)), and produces tighter masks (e.g., the legs in (c)). These improvements obtained by using the part-level prior will be transferred to the parsing network during the semi-supervised training. On the other hand, by leveraging a large number of training data, the parsing network can figure out which are important or irrelevant features and produce predictions that are less noisy (e.g.,(b) and (d)).
Since single person pose estimation is mature enough to be deployed, what if we replace the ground-truth keypoint annotations with pose estimation results? To answer that, we use the pose estimator  trained on MPII dataset to estimate human poses in COCO dataset . Same as previous, we crop those people with full body annotations and collect 10K images. The semi-supervised result achieves mIOU, which is on par with the results trained on ground-truth annotations. It shows that our system is robust to noise and suggests a promising strategy to substantially improve the performance of human semantic part segmentation without extra costs.
. Our baseline model, which is the attention model, has an mIOU of 71.55 for the Horse and an IOU of 68.84 for the Cow. By leveraging the keypoint annotations provided by , we gain improvements of 3.14 and 3.39 mIOU for Horse and Cow respectively. The consistent improvements across different categories indicate that our method is general and applicable to other segmentation tasks where their anatomical similarity can be annotated by keypoints. Finally, by replacing our baseline model with the deeper ResNet-101 based model, we achieve the most state-of-the-art result on the Horse-Cow dataset, yielding the final performances of and mIOU for these two categories respectively.
|Graph LSTM ||91.73||72.89||86.34||69.04||53.76||74.75|
|Structure-evolving LSTM ||92.51||74.89||87.55||71.93||57.45||76.87|
|Graph LSTM ||91.54||73.88||85.92||63.67||35.22||70.05|
|Structure-evolving LSTM ||92.88||77.75||87.91||67.60||42.86||73.80|
In this paper, we propose a novel strategy to utilize the keypoint annotations to train deep network for human body part parsing. Our method exploits the constraints on biological morphology to transfer the part parsing annotations among different persons with similar poses. Experimental results show that it is effective and general. By utilizing a large number of keypoint annotations, we achieve the most state-of-the-art result on human part segmentation.
This work is supported in part by the National Natural Science Foundation of China under Grants 61772332. This work is also support in part by Sensetime Ltd