Towards Real World Human Parsing: Multiple-Human Parsing in the Wild

05/19/2017 ∙ by Jianshu Li, et al. ∙ 0

The recent progress of human parsing techniques has been largely driven by the availability of rich data resources. In this work, we demonstrate some critical discrepancies between the current benchmark datasets and the real world human parsing scenarios. For instance, all the human parsing datasets only contain one person per image, while usually multiple persons appear simultaneously in a realistic scene. It is more practically demanded to simultaneously parse multiple persons, which presents a greater challenge to modern human parsing methods. Unfortunately, absence of relevant data resources severely impedes the development of multiple-human parsing methods. To facilitate future human parsing research, we introduce the Multiple-Human Parsing (MHP) dataset, which contains multiple persons in a real world scene per single image. The MHP dataset contains various numbers of persons (from 2 to 16) per image with 18 semantic classes for each parsing annotation. Persons appearing in the MHP images present sufficient variations in pose, occlusion and interaction. To tackle the multiple-human parsing problem, we also propose a novel Multiple-Human Parser (MH-Parser), which considers both the global context and local cues for each person in the parsing process. The model is demonstrated to outperform the naive "detect-and-parse" approach by a large margin, which will serve as a solid baseline and help drive the future research in real world human parsing.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Human parsing refers to partitioning persons captured in an image into multiple semantically consistent regions, e.g. body parts and clothing items (cf. Fig. 1). As a fine-grained semantic segmentation task, it is more challenging than human segmentation which aims to find silhouettes of persons. Human parsing is very important for human-centric analysis and has lots of industrial applications, e.g. virtual reality [1], video surveillance [2], and human behavior analysis [3, 4].

Figure 1: Annotation examples for our constructed Multiple Human Parsing (MHP) dataset (c) and other existing datasets for human parsing (a: ATR [5]; b: Look into Person (LIP) [6]). In (c), rectangles in different colors indicate distinct person instances. ATR contains images of single persons with upright position; LIP includes more pose variations, but still only contains a single person in each image. The MHP dataset provides images with fine-grained annotations for multiple persons with interaction, occlusion and various poses, aligning better with real-world scenarios.

Remarkable progress has been made in parsing a single person in an image [7, 8, 9]. Single-human parsing features controlled and simplified scenarios without human interaction, occlusion or various poses that however are common in real scenarios. Thus the single-human parsing techniques deviate much from realistic requirements. Although the multi-human parsing problem can be straightforwardly solved by applying person detectors as a preprocessing step, the standard person detectors work best for upright people with simple poses, such as pedestrians. In more realistic scenarios where multiple persons are close to each other and present intimate interaction and body occlusion, person detectors tend to make false negatives, which harms the performance of multi-human parsing. Moreover, although instance semantic segmentation [4, 10] considers the presence of multiple humans, they only provide silhouettes of humans without fine-grained sub-category details, which does not fulfill the requirement of human parsing. Some other works [11, 12, 6, 13, 14] that look into semantic parts within persons either only consider coarse parts, or are agnostic of person instances.

Considering the gap between current human parsing techniques and real-world requirements, we aim to drive the research on multi-human parsing. Towards solving this challenging problem, we introduce a new multi-human parsing dataset and a novel multi-human parsing model. In particular, we construct and annotate a new large-scale dataset, named the Multiple Human Parsing (MHP) dataset, providing images of multiple humans in an instance-aware setting with fine-grained pixel-level annotations. Humans in the images are captured in real-world scenarios with challenging poses, heavy occlusion and various appearances. Some annotation examples as well as comparison with existing human parsing datasets are illustrated in Fig. 1. See more details in Sec. 3. The MHP dataset will serve as a valuable data resource to develop multi-human parsing models and a benchmark to evaluate their performance.

We also propose a novel Multiple Human Parser model named MH-Parser to solve the challenging multi-human parsing problem. Unlike most existing methods focusing on single human parsing and rely on separate off-the-shelf person detectors to localize persons in images, the proposed MH-Parser tackles multiple human parsing by generating global parsing maps and instance masks for multiple persons simultaneously in a bottom-up fashion, without resorting to any ad-hoc detection models. To better capture the human body structure, part configuration and human interaction, the proposed MH-Parser introduces a novel Graph Generative Adversarial Network (Graph-GAN) model that learns to predict graph-structured instance parsing results by developing a graph convolutional discriminative model. The Graph-GAN is also of independent research interest for the community to apply GAN-alike models to graph data analysis.

To sum up, we make the following contributions. 1) We introduce the multi-human parsing problem that extends the research scope of human parsing and matches real-world scenarios better. 2) We construct the MHP dataset, a large-scale multi-human parsing benchmark, to advance the development of relevant techniques. 3) We propose a novel model MH-Parser, which serves as a strong baseline method for multi-human parsing in the wild.

2 Related Work

Human Parsing

Previous human parsing methods [15, 9, 16] and datasets [7, 11, 9, 5] mainly focus on single-human parsing, which have severe practical limitations. None of the commonly used human parsing datasets considers instance-aware cases. Moreover, the persons in these datasets are usually in upright positions with limited pose changes, which does not accord with reality. Recently, human parsing in the wild is inspected in [6], where persons present varying clothing appearances and diverse viewpoints, but it only considers the setting of instance-agnostic human parsing. Different from existing datasets on human parsing, the proposed MHP dataset considers simultaneous presence of multiple persons in an instance-aware setting with challenging pose variations, occlusion and interaction between persons, aligning much better with reality.

Instance-Aware Object/Human Segmentation

Recently, many research efforts have been devoted to instance-aware object/human semantic segmentation. It can be solved by top-down approaches and bottom-up approaches. In the top-down family, a detector (or a component functioning as a detector) is used to localize each instance, which is further processed to generate pixel segmentation. Multi-task Network Cascades (MNC) [10]

consists of three separate networks for differentiating instances, estimating masks and categorizing objects, receptively. The first fully convolutional end-to-end solution to instance-aware semantic segmentation in 

[17] performs instance mask prediction and classification jointly. Mask-RCNN [18] adds a segmentation branch to the state-of-the-art object detector Faster-RCNN [19] to perform instance segmentation. The top-down approaches heavily depend on the detection component, and suffer poor performance when instances are close to each other. In the bottom-up family, detection is usually not used. Usually embeddings of all pixels are learned, which are later used to cluster different pixels into different instances. In [20], embeddings are learned with a grouping loss, which does pairwise comparisons across randomly sampled pixels. In [21, 22], a discriminative loss containing push forces and pull forces is used to learn embeddings for each pixel. In [4], the embeddings of pixels are learned with direct supervision of instance locations. Different from the methods which operate on pixels, we learn an embedding of each superpixel. Furthermore, Graph-GAN is used to refine the learned embedding by leveraging high-order information. These instance-aware person segmentation methods, either top-down or bottom-up approaches, can only predict person-level segmentation without any detailed information on body parts and fashion categories, which is disadvantageous for fine-grained image understanding. In contrast, our MHP is proposed for fine-grained multi-human parsing in the wild, which aims to boost the research in real-world human-centric analysis.

Generative Adversarial Networks

The recently proposed GAN-based methods [23, 24, 25] have yielded remarkable performance on generating photo-realistic images [26] and semantic segmentation maps [27] by specifying only a high-level goal like “to make the output indistinguishable from the reality” [28]

. GAN automatically learns a customized loss function that adapts to data and guides the generation process of high-quality images. Different from existing works on image-based GANs which can only process regular input (

e.g. 2D grid images), the proposed Graph-GAN takes a flexible data structure, i.e. graphs, as input. This is the first time that Graph-GAN was explored in literature on GANs.

3 The MHP Dataset

In this section we introduce the Multiple Human Parsing (MHP) dataset designed for multi-human parsing in the wild. Some exemplar images and annotations are shown in Fig. 1.

3.1 Image Collection and Annotation Methodology

As pointed out in [14], in generic recognition datasets like PASCAL [29] or COCO [30], only a small percentage of images contain multiple persons. Also, persons in these generic recognition datasets usually lack fine details, compared to human-centric datasets, such as those for people recognition in photo album [31], human immediacy prediction [32], interpersonal relation prediction [33], etc. To benefit the development of new multi-human parsing models, we construct a pool of images from existing human-centric datasets [34, 32, 33, 31], and also online Creative Commons licensed imagery. From the images pool, we select a subset of images which contain clearly visible persons with intimate interaction, rich fashion items and diverse appearances, and manually annotate them with two operations: 1) counting and indexing the persons in the images and 2) annotating each person. We implement an annotation tool and generate multi-scale superpixels of images based on [35] to speed up the annotation. For each instance, pre-defined semantic categories (also commonly used in single-parsing datasets) are annotated, including hat, hair, sun glasses, upper clothes, skirt, pants, dress, belt, left shoe, right shoe, face, left leg, right leg, left arm, right arm, bag, scarf and torso skin. Each instance has a complete set of annotations whenever the corresponding category is present in the image. When annotating one instance, others are regarded as background. Thus, the resulting annotation set for each image consists of person-level parsing masks, where is the number of persons in the image.

3.2 Dataset Statistics

MHP dataset contains various numbers of persons in each image, and the distribution is illustrated in Fig. 2 (middle). Real-world human parsing aims to analyze every detailed region of each person of interest, including different body parts, clothes and accessories. Thus we define body parts and clothing and accessory categories. Among these body parts, we divide arms and legs into left and right side for more precise analysis, which also increases the difficulty of the task. As for clothing categories, we have not only common clothes like upper clothes, pants, and shoes, but also confusing categories such as skirt and dress and infrequent categories such as scarf, sun glasses, belt, and bag. The statistics for each semantic part annotation are shown in Fig. 2 (right).

Figure 2: Examples and statistics of the MHP dataset. Left: An annotated example for multi-human parsing. Middle: Statistics on number of persons in one image. Right: The data distribution on 18 semantic part labels in the MHP dataset.

In the MHP dataset, there are images, each with multiple persons, each with - persons ( on average). The resolution of the images ranges from to , with an average of pixels. Totally there are person instances with fine-grained annotations at pixel-level with different semantic labels. The resolution of each person ranges from to , with an average pixels. For other human parsing datasets, Fashionista [7] contains person instances, ATR [9] contains and LIP [6] contains . However, they all reflect the cases of single-human parsing, which deviates from real-world human parsing requirement.

In MHP, the person instances are entangled with close interaction and occlusion. To verify this, we calculate the mean average Intersection Over Union (IOU) of person bounding boxes in the dataset. That is, we find the average IOU between person instances in each image, and calculate its mean value over the whole dataset. In MHP the mean average IOU is . As a widely used human instance segmentation dataset, COCO [30] only has mean average IOU of for the images with multiple persons. Even for the Buffy [36] dataset, which is used in person individuation and claims to have multiple closely entangled persons [14], the mean average IOU is only . Thus MHP is a much more challenging dataset in terms of separating closely entangled person instances. Therefore, the MHP dataset will serve as a more realistic benchmark on human-centric analysis to push the frontier of human parsing research.

4 The MH-Parser

In this section we elaborate on the proposed MH-Parser model for parsing multiple humans. The proposed MH-Parser simultaneously generates a global semantic parsing map and a pairwise affinity map (which is used to construct instance masks). The former presents union of the instance parsing maps for all the persons in the input image, and the latter distinguishes one person from another. The overall architecture of MH-Parser is shown in Fig. 3.

Figure 3: Architecture overview of the proposed Multiple Human Parser (MH-Parser). Here refers to the global accordance map, refers to the ground truth pairwise affinity map and denotes the predictions. is obtained by rule-based mapping from and the corresponding superpixel map (see Eqn. (4) and (5)), and is the output of the graph generator (consisting of the representation learner and the affinity prediction net). The graph convolution discriminator takes the affinity graph from the graph generator as input and predicts whether it is a ground truth or a prediction. Fusing the predicted instance-agnostic parsing map and instance masks (constructed from ) gives the instance-aware parsing results.

4.1 Global Parsing Prediction

The MH-Parser uses a deep representation learner to learn rich and discriminative representations which are shareable for global parsing and affinity map prediction. In particular, the representation learner is a fully convolutional network consisting of layers (ResNet) adopted from DeepLab [12]. It generates features with of the spatial dimension of the input image. On top of this learner, a small parsing net consisting of atrous spatial pyramid pooling [12] is used to generate instance-agnostic semantic parsing maps of the whole image, as shown in Fig. 3.

Formally, let denote the global parsing module. Given an input image with size , its output gives instance-agnostic parsing of categories with a scaled down size compared with the input image . The global parsing predictor can be trained by minimizing the following standard parsing loss:

(1)

where is the pixel-wise cross-entropy loss and is the ground truth labeling of the instance-agnostic semantic parsing map.

4.2 Graph-GAN for Affinity Map Prediction

The global parsing results do not present any instance-level information which however is essential for multi-human parsing. Different from top-down solutions, we propose a novel graph-GAN model for learning instance information in a bottom-up fashion simultaneously with the global parsing prediction.

Global Accordance Map

Global accordance maps distinguish different persons by associating them with different accordance scores. For an input image with size , its global accordance map is defined as

(2)

An example of the global accordance map constructed from the ground truth instance parsing map is shown in Fig. 3.

Predicting global accordance scores accurately is important for separating different person instances and deriving high-quality multi-human parsing results. However, accordance prediction is very challenging, due to the large appearance variance of intra-instance pixels and subtle difference of some pixels from different instances. The number of persons is unknown and varies for different images, making traditional classification approaches inapplicable. Moreover, the accordance scores are expected to be invariant to permutation over person instance ids. This implies that the learning process of accordance score is extremely unstable if we directly use the ground truth global accordance map defined in Eqn. (

2) as supervision.

Pairwise Affinity Graph

Since directly predicting global accordance scores is difficult, MH-Parser generates a pairwise affinity graph instead. Specifically, the MH-Parser introduces a graph generator to learn to optimize the pairwise distances (or affinities) among regions within input images. In MH-Parser, superpixel is regarded as the basic unit of regions to calculate the affinities, due to the following two reasons. First, superpixels are natural low-level representations to delineate boundaries between semantic concepts. Second, superpixels can be regarded as low-level pixel grouping, so that the complexity of affinity computation is greatly reduced compared to pixel level affinity computation.

Formally, we define the pairwise affinity graph as

(3)

In the graph, each vertex is one superpixel within the image. There are superpixels in total and is the -th superpixel. Each edge represents the connectivities between each pair of vertices (, ), described by the pairwise affinity map . The ground truth pairwise affinity map is derived from a rule-based mapping, which is defined as

(4)

and

(5)

Here represents all pixels within and denotes the majority vote operation. Note that although the ground truth has multiple possible values due to random assignment of person ids, the corresponding ground truth is unique regardless of how the person ids are assigned.

The pairwise affinity maps can be learned directly by taking the ground truth as the regression target. The predicted pairwise affinity map can be generated directly by an affinity prediction net, which draws features from the representation learner. The affinity prediction net first generates a set of features , where is the number of channels for . Then it applies superpixel pooling on , followed by an affinity transformation with a Gaussian kernel to obtain :

(6)

where

(7)

Here is the parameter controlling sensitivity of , and is the number of pixels within superpixel .

The network for predicting can be trained by minimizing the distance between and its ground truth . However, when learning with direct supervision, the elements within it are learned independently of each other. The contiguity and relations (reflecting intrinsic human body structures) within are not captured. For example, if node is connected to and is connected to , then is also connected to . This higher-order affinity between regions is not captured for the case of direct supervision.

Predicting Affinity Graph with Graph-GAN

To remedy the potential issues in learning with direct supervision over , we propose a novel GAN model, Graph-GAN, to augment the learning process. Different from existing GAN-based models which can only process regular input (like 2D grid images), the Graph-GAN can take in and process flexible graph-structured data. It aims to learn high-quality affinity graphs to better capture the human body structure, part configuration and human interaction.

In the adversarial learning of the Graph-GAN model, the ground truth affinity graphs use from Eqn. (4) in the edge definition. The predicted affinity graphs use from Eqn. (6). The generator in Graph-GAN learns to generate high-quality affinity graphs, which are indistinguishable from the ground truth. The discriminator in Graph-GAN targets at telling the predicted affinity graphs apart from ground truth ones. With generator and discriminator playing against each other, the discriminator learns to supervise the generator in a way tailored for the graph-structured data.

The representation learner and the affinity prediction net are adopted as the generator in the Graph-GAN model to generate the predicted affinity graph. In order to handle graph-structured input, we propose a Graph Convolution Network (GCN) based discriminator model. The GCN [37, 38, 39]

can effectively model graph-structured data, thus is suitable for classifying input graphs and serves as the discriminator.

In particular, we use a simple form of layer-wise propagation rule [37, 39]:

(8)

where is the adjacency matrix of the graph (pairwise affinity map in our case), denotes the hidden activations in GCN, and denote the learnable weights and biases,

is a non-linear activation function, and

is the layer index. Thus represents the input node features and represents the output node features, where is the total number of layers in GCN. We follow [37] and normalize the adjacency matrix to make the propagation stable:

(9)

where with

as the identity matrix and

is the diagonal node degree matrix of , i.e. .

In GCN, the graph convolution operation effectively diffuses the features across different regions (including body parts and background) based on the connectivities between the regions. With multiple layers of feature propagation within GCN, higher order relations of different regions are captured, which help identify the intrinsic body part structures of multiple humans.

Since the layer propagation rule in Eqn. (9

) only models the transformation of the features of nodes, node pooling operation is defined in order to obtain a graph-level feature. We define a node pooling layer on top of the final output node features with an attention mechanism, as usually used in nature language processing 

[40, 41]:

(10)

Here

generates an attention weight vector based on the attention layer input feature

, and denotes the element-wise product, which applies the attention weight to the features of every node in the graph. We use the attention mechanism as the node feature pooling, resulting in a single descriptor for the whole graph. With the attention pooling operation, the diffused features of different regions within an image are aggregated into one feature vector. Then is used as the input to a classifier to predict whether the input affinity graph is a ground truth one or a predicted one in the adversarial training setting. The input feature to the GCN model is a one-hot embedding of each node, i.e. . The input feature of the node attention layer is the feature from the parsing net prediction with superpixel pooling operations applied to it, such that each node corresponds to a -dimensional feature vector and .

4.3 Training and Inference

We train the generator by introducing the following losses. For the global parsing task, the loss function in Eqn. (1) is used. For the affinity graph prediction task, we minimize the distance between the predicted pairwise affinity map and the ground truth pairwise affinity map with loss:

(11)

Here represents the mapping function from the input image to the predicted pairwise affinity map, i.e. . is a binary mask indicating connections only between foreground nodes, and it is used to set all other connections to . For training the Graph-GAN, the corresponding loss is

(12)

where denotes the GCN-based discriminator. Thus the overall objective function is to find such that

(13)

After finding the optimal , we use it to generate global parsing maps and affinity maps for testing images.

During testing, we use the predicted affinity graph

to perform spectral clustering. Background nodes are identified with the global parsing map, and are removed from the affinity graph. Then all the foreground nodes are clustered according to the pairwise affinities in

. Different instances of persons are identified from the clustering results. To help clustering, a regression layer built upon the representation learner is used to learn the number of persons during training, and the predicted person number is used in clustering during testing. It is omitted in the network structure and the objective function for brevity.

4.4 Instance Mask Refinement

We extend our model with a refinement step to reinforce the prediction of instance masks (obtained from the clustering results) from superpixel level to pixel level. We adopt Conditional Random Field (CRF) [42] in refinement to associate each pixel in the image with one of the persons (from the clustering results) or background. The CRF model contains two unary terms, i.e. and a binary term . With

denoting the random variable for the

-th pixel in the image, the target of the instance mask refinement is to find the optimal solution for all pixels in the image that minimizes the following energy function:

(14)

We define these terms as follows. Given persons from clustering results over the predicted affinity map, and assuming the -th person is represented by a binary mask indicating whether the pixel is from the -th person, we define the person consistency term as

(15)

Here

denotes the probability of the

-th pixel to be foreground. The person consistency term is designed to give strong cues about which person each foreground pixel should belong to. As in [43, 13], the global term is defined as

(16)

which is used to complete the person consistency term by giving equal likelihood of each foreground pixel to all the persons to correct errors in the clustering process. Finally we define our pairwise term as

(17)

where is the compatibility function, is the kernel function and is the feature vector at spatial location . The feature vector contains the -dimensional vector from (obtained by up-sampling to match the spatial dimension of the input image) in the affinity prediction net, the -dimensional color vector , and the -dimensional position vector . Thus the kernel is defined as

(18)

In other words, the pairwise kernel consists of the learned features for pairwise distance measurement, in addition to the bilateral term and the spatial term used in [44]. The compatibility function is realized by the simple Potts model.

With the above CRF model, we find the optimal solution that minimizes the energy function in Eqn. (14) with the approximation algorithm in [44] and obtain the final prediction of person instance masks for each pixel in input images. Standard CFR is also applied to the instance-agnostic parsing maps as in [12].

5 Experiments

5.1 Experimental Setup

Performance Evaluation Metrics

We use the following performance evaluation metrics for multi-human parsing.

Average Precision based on Part (AP). Different from region-based Average Precision (AP) used in instance segmentation [4, 45], AP uses part-level Intersection Over Union (IOU) of different semantic part categories within a person to determine if one instance is a true positive. Specifically, when comparing one predicted semantic part parsing map with one ground truth parsing map, we find the IOU of all the semantic part categories between them and use the average as the measure of overlap. We refer to AP under this condition as AP. We prefer AP over AP, as we focus on human-centric evaluation and we pay attention to how well a person as a whole is parsed. Similarly, we use AP to denote the average AP values at IOU threshold from to with a step size of .

Percentage of Correctly Parsed Body Parts (PCP). As AP averages the IOU of each part category, it cannot reflect how many parts are correctly predicted. Thus we propose to adopt PCP, originally used in human pose estimation [46, 11], to evaluate parsing quality on the semantic parts within person instances. For each true-positive person instance, we find all the categories (excluding background) with pixel-level IOU larger than a threshold, which are regarded as correctly parsed. PCP of one person is the ratio between the correctly parsed categories and the total number of categories of that person. Missed person instances are assigned PCP. The overall PCP is the average PCP for all person instances. Note that PCP is also a human-centric evaluation metric.

Datasets

We perform experiments on the MHP dataset. From all the images in MHP, we randomly choose images to form the testing set. The rest form a training set of images and a validation set of images. Since we are interested in the real-world situation where different people are near to each other with close interaction, we also perform experiments on the Buffy [36] dataset as suggested in [14], which contains entangled people in almost all testing images.

Implementation Details

Due to space limit, please see supplementary material for more architecture and implementation details.

5.2 Experimental Analysis

All Top 20% Top 5%
AP AP PCP AP AP PCP AP AP PCP
Detect+Parse 29.81 38.83 43.78 12.08 30.22 25.44 9.76 30.37 18.36
Mask RCNN [18] 52.68 49.81 51.87 31.49 40.16 37.31 24.25 35.63 28.77
DL [21] 47.76 47.73 49.21 34.81 44.06 40.59 29.52 43.52 33.70
MH-Parser 50.10 48.96 50.70 41.67 46.70 44.74 33.69 46.57 37.01
Table 1: Results from different methods on the MHP test set. The results of Mask RCNN and DL are obtained by using them to predict the instance masks, respectively, and combining with the same instance agnostic parsing map produced by MH-Parser for fair comparison. All denotes the entire test set, and Top and Top denote two subsets of testing images with top and top largest overlaps between person instances, respectively.

5.2.1 Comparison with State-of-the-Arts

Note that standard instance segmentation methods can only generate silhouettes of person instances and cannot produce person part parsing as desired. Thus we use them to generate instance masks as the graph generator in MH-Parser does, and combine the instance masks with the instance agnostic parsing to produce final multi-human parsing results. Here we use Mask-RCNN [18], which is the state-of-the-art top-down model, and Discriminative Loss (DL) [21], a well established bottom-up model, to generate instance masks. For Mask RCNN, we use the segmentation prediction in each detection with high confidence () to form the instance masks. DL can generate instance masks as the outputs of the model. We also consider the Detect+Parse baseline method as used in traditional single human parsing, where a person detector is used to detect person instances, and a parser is used to parse each detected instance.

The performance of these methods in terms of AP, AP and PCP on the MHP test set is listed in Tab. 1. In the table the overlap thresholds for AP and PCP are both set as . The MH-Parser, DL, the parser in Detect+Parse are trained on MHP training set with the same trunk network (ResNet). Especially, DL is trained with the official code [21, 22] with the suggested setting. The Mask-RCNN model and the detector in Detect+Parse are the top performing model with ResNet as the trunk from the official Detectron [47].

We can see that the proposed MH-Parser achieves competitive performance with Mask RCNN and DL on the MHP dataset, and outperforms Detect+Parse baseline. To investigate how these models address the concerned challenges of closely entangled persons, we select two challenging subsets from the MHP test set. For each image in the test set, we perform a pairwise comparison of all the person instances, and find the IOU of person bounding boxes in each pair. Then the average IOU of all the pairs is used to measure the closeness of the persons in each image. One subset contains the images with top highest average IOUs, and the other subset contains top . They represent images with very close interaction of human instances, reflecting the real scenarios. The results on these two subsets are listed in Tab. 1. We can see that on these challenging subsets, MH-Parse outperforms both Mask RCNN and DL. For Mask RCNN, it has difficulties to differentiate entangled persons, while as a bottom-up approach, MH-Parse can handle such cases well. For DL, it only exploits pairwise relation between embeddings of pixels, while MH-Parser models high-order relations among different regions and shows better performance.

Comparison with State-of-the-Arts on Separating Person Instances

We also evaluate the proposed MH-Parser on the Buffy dataset and compare it with other state-of-the-art methods. On Buffy forward score and backward score are used to evaluate the performance of person individuation [14]. We follow the same evaluation metric, and our average forward and backward scores for Episode , and on the Buffy dataset are and , respectively. In [14] the average forward and backward scores are and on the same dataset, and [36] reports an average score of . Note that MH-Parser is not trained on Buffy, only evaluation is performed. We can see MH-Parser achieves the best performance compared with other state-of-the-art methods in separating closely entangled persons.

MH-Parser AP AP PCP
Baseline L2 41.92 45.21 46.77
  + 44.34 46.43 47.62
  + Refine, w/o PAM 49.49 48.98 50.48
  + Refine 50.36 49.29 50.57
  w/ GT Person Number 51.39 49.77 51.32
  w/ GT Affinity 55.83 51.28 55.85
  w/ GT Global Seg. 91.75 77.29 82.96
Table 2: Results from different settings on the validation set. Refine refers to instance mask refinement, and Refine w/o PAM means in the refinement step the CRF is performed without the learned pairwise term from the pairwise affinity map.

5.2.2 Components Analysis for MH-Parser

In this subsection, we test the proposed MH-Parser in various settings. All the variants of MH-Parser are trained on the MHP training set and evaluated on the validation set. The loss in Eqn. (13) is adjusted to either include or exclude the Graph-GAN term. We also demonstrate effects of the instance mask refinement. In the refinement, the pairwise term in Eqn. (18) is disabled by setting to to investigate whether the learned pairwise term is beneficial to the refinement process. The performance of these variants in terms of AP, AP and PCP is listed in Tab. 2.

From the results, we can see that compared to the loss, the Graph-GAN can effectively improve the quality of the predicted pairwise affinity map. Better and finer affinity maps resulted from Graph-GAN help generate better grouping of the bottom level person instance information, leading to increased AP and PCP. The instance mask refinement, especially the learned pairwise term, plays a positive role in improving the performance of multi-human parsing.

We also use the respective ground truth annotations of the three components, i.e. ground truth person number, ground truth affinity graph and ground truth segmentation map, to probe the upper limits of MH-Parser in Tab. 2. We can see that the person number prediction and affinity map prediction are reasonably accurate, while the global segmentation is still the major hindrance of the problem of multi-human parsing. Improvement on global segmentation can greatly boost the performance of multi-human parsing.

5.2.3 Qualitative Comparison

Here we visually compare the results from Mask RCNN, DL and MH-Parser. The input images, global parsing ground truths, parsing predictions, predicted instance maps from Mask RCNN, DL and MH-Parser are visualized in Fig. 4. We can see that the MH-Parser captures both the fine-grained global parsing details and the information to differentiate person instances. For Mask RCNN, it has difficulties distinguishing closely entangled persons, especially when the bounding boxes of persons have large overlaps. The MH-Parser has better instance masks in such cases. MH-Parser also has better person instance masks than DL, especially at the boundary between two close instances. More visualized results are deferred to supplementary materials.

Figure 4: Visualization of parsing results. For each (a) input image, we show the (b) parsing ground truth, (c) global parsing prediction, person instance map predictions from (d) Mask RCNN, (e) DL and (f) MH-Parser. In (b) and (c), each color represents a semantic parsing category. In (d), (e) and (f), each color represents one person instance. We can see the proposed MH-Parser can generate satisfactory global parsing, and outperforms Mask RCNN and DL when persons are closely entangled.

6 Conclusion

In this paper, we tackle the multi-human parsing problem. We contributed a new large-scale MHP dataset, and also proposed a novel MH-Parser algorithm. We performed detailed evaluations of the proposed method and compared with current state-of-the-art solutions on the new benchmark dataset. We envision that the proposed MHP dataset and the MH-Parser are promising for driving human parsing research towards real-world applications. In the future, we will make efforts to annotate a more comprehensive multiple-human parsing dataset with more images and more detailed semantic labels to further push the frontier of multiple-human parsing research.

Acknowledgments

The work of Jianshu Li was partially funded by National Research Foundation of Singapore. The work of Jian Zhao was partially supported by National University of Defence Technology and China Scholarship Council (CSC) grant 201503170248. The work of Jiashi Feng was partially supported by National University of Singapore startup grant R-263-000-C08-133 and Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112.

References