Human parsing refers to partitioning persons captured in an image into multiple semantically consistent regions, e.g. body parts and clothing items (cf. Fig. 1). As a fine-grained semantic segmentation task, it is more challenging than human segmentation which aims to find silhouettes of persons. Human parsing is very important for human-centric analysis and has lots of industrial applications, e.g. virtual reality , video surveillance , and human behavior analysis [3, 4].
Remarkable progress has been made in parsing a single person in an image [7, 8, 9]. Single-human parsing features controlled and simplified scenarios without human interaction, occlusion or various poses that however are common in real scenarios. Thus the single-human parsing techniques deviate much from realistic requirements. Although the multi-human parsing problem can be straightforwardly solved by applying person detectors as a preprocessing step, the standard person detectors work best for upright people with simple poses, such as pedestrians. In more realistic scenarios where multiple persons are close to each other and present intimate interaction and body occlusion, person detectors tend to make false negatives, which harms the performance of multi-human parsing. Moreover, although instance semantic segmentation [4, 10] considers the presence of multiple humans, they only provide silhouettes of humans without fine-grained sub-category details, which does not fulfill the requirement of human parsing. Some other works [11, 12, 6, 13, 14] that look into semantic parts within persons either only consider coarse parts, or are agnostic of person instances.
Considering the gap between current human parsing techniques and real-world requirements, we aim to drive the research on multi-human parsing. Towards solving this challenging problem, we introduce a new multi-human parsing dataset and a novel multi-human parsing model. In particular, we construct and annotate a new large-scale dataset, named the Multiple Human Parsing (MHP) dataset, providing images of multiple humans in an instance-aware setting with fine-grained pixel-level annotations. Humans in the images are captured in real-world scenarios with challenging poses, heavy occlusion and various appearances. Some annotation examples as well as comparison with existing human parsing datasets are illustrated in Fig. 1. See more details in Sec. 3. The MHP dataset will serve as a valuable data resource to develop multi-human parsing models and a benchmark to evaluate their performance.
We also propose a novel Multiple Human Parser model named MH-Parser to solve the challenging multi-human parsing problem. Unlike most existing methods focusing on single human parsing and rely on separate off-the-shelf person detectors to localize persons in images, the proposed MH-Parser tackles multiple human parsing by generating global parsing maps and instance masks for multiple persons simultaneously in a bottom-up fashion, without resorting to any ad-hoc detection models. To better capture the human body structure, part configuration and human interaction, the proposed MH-Parser introduces a novel Graph Generative Adversarial Network (Graph-GAN) model that learns to predict graph-structured instance parsing results by developing a graph convolutional discriminative model. The Graph-GAN is also of independent research interest for the community to apply GAN-alike models to graph data analysis.
To sum up, we make the following contributions. 1) We introduce the multi-human parsing problem that extends the research scope of human parsing and matches real-world scenarios better. 2) We construct the MHP dataset, a large-scale multi-human parsing benchmark, to advance the development of relevant techniques. 3) We propose a novel model MH-Parser, which serves as a strong baseline method for multi-human parsing in the wild.
2 Related Work
Previous human parsing methods [15, 9, 16] and datasets [7, 11, 9, 5] mainly focus on single-human parsing, which have severe practical limitations. None of the commonly used human parsing datasets considers instance-aware cases. Moreover, the persons in these datasets are usually in upright positions with limited pose changes, which does not accord with reality. Recently, human parsing in the wild is inspected in , where persons present varying clothing appearances and diverse viewpoints, but it only considers the setting of instance-agnostic human parsing. Different from existing datasets on human parsing, the proposed MHP dataset considers simultaneous presence of multiple persons in an instance-aware setting with challenging pose variations, occlusion and interaction between persons, aligning much better with reality.
Instance-Aware Object/Human Segmentation
Recently, many research efforts have been devoted to instance-aware object/human semantic segmentation. It can be solved by top-down approaches and bottom-up approaches. In the top-down family, a detector (or a component functioning as a detector) is used to localize each instance, which is further processed to generate pixel segmentation. Multi-task Network Cascades (MNC) 
consists of three separate networks for differentiating instances, estimating masks and categorizing objects, receptively. The first fully convolutional end-to-end solution to instance-aware semantic segmentation in performs instance mask prediction and classification jointly. Mask-RCNN  adds a segmentation branch to the state-of-the-art object detector Faster-RCNN  to perform instance segmentation. The top-down approaches heavily depend on the detection component, and suffer poor performance when instances are close to each other. In the bottom-up family, detection is usually not used. Usually embeddings of all pixels are learned, which are later used to cluster different pixels into different instances. In , embeddings are learned with a grouping loss, which does pairwise comparisons across randomly sampled pixels. In [21, 22], a discriminative loss containing push forces and pull forces is used to learn embeddings for each pixel. In , the embeddings of pixels are learned with direct supervision of instance locations. Different from the methods which operate on pixels, we learn an embedding of each superpixel. Furthermore, Graph-GAN is used to refine the learned embedding by leveraging high-order information. These instance-aware person segmentation methods, either top-down or bottom-up approaches, can only predict person-level segmentation without any detailed information on body parts and fashion categories, which is disadvantageous for fine-grained image understanding. In contrast, our MHP is proposed for fine-grained multi-human parsing in the wild, which aims to boost the research in real-world human-centric analysis.
Generative Adversarial Networks
The recently proposed GAN-based methods [23, 24, 25] have yielded remarkable performance on generating photo-realistic images  and semantic segmentation maps  by specifying only a high-level goal like “to make the output indistinguishable from the reality” 
. GAN automatically learns a customized loss function that adapts to data and guides the generation process of high-quality images. Different from existing works on image-based GANs which can only process regular input (e.g. 2D grid images), the proposed Graph-GAN takes a flexible data structure, i.e. graphs, as input. This is the first time that Graph-GAN was explored in literature on GANs.
3 The MHP Dataset
In this section we introduce the Multiple Human Parsing (MHP) dataset designed for multi-human parsing in the wild. Some exemplar images and annotations are shown in Fig. 1.
3.1 Image Collection and Annotation Methodology
As pointed out in , in generic recognition datasets like PASCAL  or COCO , only a small percentage of images contain multiple persons. Also, persons in these generic recognition datasets usually lack fine details, compared to human-centric datasets, such as those for people recognition in photo album , human immediacy prediction , interpersonal relation prediction , etc. To benefit the development of new multi-human parsing models, we construct a pool of images from existing human-centric datasets [34, 32, 33, 31], and also online Creative Commons licensed imagery. From the images pool, we select a subset of images which contain clearly visible persons with intimate interaction, rich fashion items and diverse appearances, and manually annotate them with two operations: 1) counting and indexing the persons in the images and 2) annotating each person. We implement an annotation tool and generate multi-scale superpixels of images based on  to speed up the annotation. For each instance, pre-defined semantic categories (also commonly used in single-parsing datasets) are annotated, including hat, hair, sun glasses, upper clothes, skirt, pants, dress, belt, left shoe, right shoe, face, left leg, right leg, left arm, right arm, bag, scarf and torso skin. Each instance has a complete set of annotations whenever the corresponding category is present in the image. When annotating one instance, others are regarded as background. Thus, the resulting annotation set for each image consists of person-level parsing masks, where is the number of persons in the image.
3.2 Dataset Statistics
MHP dataset contains various numbers of persons in each image, and the distribution is illustrated in Fig. 2 (middle). Real-world human parsing aims to analyze every detailed region of each person of interest, including different body parts, clothes and accessories. Thus we define body parts and clothing and accessory categories. Among these body parts, we divide arms and legs into left and right side for more precise analysis, which also increases the difficulty of the task. As for clothing categories, we have not only common clothes like upper clothes, pants, and shoes, but also confusing categories such as skirt and dress and infrequent categories such as scarf, sun glasses, belt, and bag. The statistics for each semantic part annotation are shown in Fig. 2 (right).
In the MHP dataset, there are images, each with multiple persons, each with - persons ( on average). The resolution of the images ranges from to , with an average of pixels. Totally there are person instances with fine-grained annotations at pixel-level with different semantic labels. The resolution of each person ranges from to , with an average pixels. For other human parsing datasets, Fashionista  contains person instances, ATR  contains and LIP  contains . However, they all reflect the cases of single-human parsing, which deviates from real-world human parsing requirement.
In MHP, the person instances are entangled with close interaction and occlusion. To verify this, we calculate the mean average Intersection Over Union (IOU) of person bounding boxes in the dataset. That is, we find the average IOU between person instances in each image, and calculate its mean value over the whole dataset. In MHP the mean average IOU is . As a widely used human instance segmentation dataset, COCO  only has mean average IOU of for the images with multiple persons. Even for the Buffy  dataset, which is used in person individuation and claims to have multiple closely entangled persons , the mean average IOU is only . Thus MHP is a much more challenging dataset in terms of separating closely entangled person instances. Therefore, the MHP dataset will serve as a more realistic benchmark on human-centric analysis to push the frontier of human parsing research.
4 The MH-Parser
In this section we elaborate on the proposed MH-Parser model for parsing multiple humans. The proposed MH-Parser simultaneously generates a global semantic parsing map and a pairwise affinity map (which is used to construct instance masks). The former presents union of the instance parsing maps for all the persons in the input image, and the latter distinguishes one person from another. The overall architecture of MH-Parser is shown in Fig. 3.
4.1 Global Parsing Prediction
The MH-Parser uses a deep representation learner to learn rich and discriminative representations which are shareable for global parsing and affinity map prediction. In particular, the representation learner is a fully convolutional network consisting of layers (ResNet) adopted from DeepLab . It generates features with of the spatial dimension of the input image. On top of this learner, a small parsing net consisting of atrous spatial pyramid pooling  is used to generate instance-agnostic semantic parsing maps of the whole image, as shown in Fig. 3.
Formally, let denote the global parsing module. Given an input image with size , its output gives instance-agnostic parsing of categories with a scaled down size compared with the input image . The global parsing predictor can be trained by minimizing the following standard parsing loss:
where is the pixel-wise cross-entropy loss and is the ground truth labeling of the instance-agnostic semantic parsing map.
4.2 Graph-GAN for Affinity Map Prediction
The global parsing results do not present any instance-level information which however is essential for multi-human parsing. Different from top-down solutions, we propose a novel graph-GAN model for learning instance information in a bottom-up fashion simultaneously with the global parsing prediction.
Global Accordance Map
Global accordance maps distinguish different persons by associating them with different accordance scores. For an input image with size , its global accordance map is defined as
An example of the global accordance map constructed from the ground truth instance parsing map is shown in Fig. 3.
Predicting global accordance scores accurately is important for separating different person instances and deriving high-quality multi-human parsing results. However, accordance prediction is very challenging, due to the large appearance variance of intra-instance pixels and subtle difference of some pixels from different instances. The number of persons is unknown and varies for different images, making traditional classification approaches inapplicable. Moreover, the accordance scores are expected to be invariant to permutation over person instance ids. This implies that the learning process of accordance score is extremely unstable if we directly use the ground truth global accordance map defined in Eqn. (2) as supervision.
Pairwise Affinity Graph
Since directly predicting global accordance scores is difficult, MH-Parser generates a pairwise affinity graph instead. Specifically, the MH-Parser introduces a graph generator to learn to optimize the pairwise distances (or affinities) among regions within input images. In MH-Parser, superpixel is regarded as the basic unit of regions to calculate the affinities, due to the following two reasons. First, superpixels are natural low-level representations to delineate boundaries between semantic concepts. Second, superpixels can be regarded as low-level pixel grouping, so that the complexity of affinity computation is greatly reduced compared to pixel level affinity computation.
Formally, we define the pairwise affinity graph as
In the graph, each vertex is one superpixel within the image. There are superpixels in total and is the -th superpixel. Each edge represents the connectivities between each pair of vertices (, ), described by the pairwise affinity map . The ground truth pairwise affinity map is derived from a rule-based mapping, which is defined as
Here represents all pixels within and denotes the majority vote operation. Note that although the ground truth has multiple possible values due to random assignment of person ids, the corresponding ground truth is unique regardless of how the person ids are assigned.
The pairwise affinity maps can be learned directly by taking the ground truth as the regression target. The predicted pairwise affinity map can be generated directly by an affinity prediction net, which draws features from the representation learner. The affinity prediction net first generates a set of features , where is the number of channels for . Then it applies superpixel pooling on , followed by an affinity transformation with a Gaussian kernel to obtain :
Here is the parameter controlling sensitivity of , and is the number of pixels within superpixel .
The network for predicting can be trained by minimizing the distance between and its ground truth . However, when learning with direct supervision, the elements within it are learned independently of each other. The contiguity and relations (reflecting intrinsic human body structures) within are not captured. For example, if node is connected to and is connected to , then is also connected to . This higher-order affinity between regions is not captured for the case of direct supervision.
Predicting Affinity Graph with Graph-GAN
To remedy the potential issues in learning with direct supervision over , we propose a novel GAN model, Graph-GAN, to augment the learning process. Different from existing GAN-based models which can only process regular input (like 2D grid images), the Graph-GAN can take in and process flexible graph-structured data. It aims to learn high-quality affinity graphs to better capture the human body structure, part configuration and human interaction.
In the adversarial learning of the Graph-GAN model, the ground truth affinity graphs use from Eqn. (4) in the edge definition. The predicted affinity graphs use from Eqn. (6). The generator in Graph-GAN learns to generate high-quality affinity graphs, which are indistinguishable from the ground truth. The discriminator in Graph-GAN targets at telling the predicted affinity graphs apart from ground truth ones. With generator and discriminator playing against each other, the discriminator learns to supervise the generator in a way tailored for the graph-structured data.
The representation learner and the affinity prediction net are adopted as the generator in the Graph-GAN model to generate the predicted affinity graph. In order to handle graph-structured input, we propose a Graph Convolution Network (GCN) based discriminator model. The GCN [37, 38, 39]
can effectively model graph-structured data, thus is suitable for classifying input graphs and serves as the discriminator.
where is the adjacency matrix of the graph (pairwise affinity map in our case), denotes the hidden activations in GCN, and denote the learnable weights and biases,
is a non-linear activation function, andis the layer index. Thus represents the input node features and represents the output node features, where is the total number of layers in GCN. We follow  and normalize the adjacency matrix to make the propagation stable:
as the identity matrix andis the diagonal node degree matrix of , i.e. .
In GCN, the graph convolution operation effectively diffuses the features across different regions (including body parts and background) based on the connectivities between the regions. With multiple layers of feature propagation within GCN, higher order relations of different regions are captured, which help identify the intrinsic body part structures of multiple humans.
Since the layer propagation rule in Eqn. (9
) only models the transformation of the features of nodes, node pooling operation is defined in order to obtain a graph-level feature. We define a node pooling layer on top of the final output node features with an attention mechanism, as usually used in nature language processing[40, 41]:
generates an attention weight vector based on the attention layer input feature, and denotes the element-wise product, which applies the attention weight to the features of every node in the graph. We use the attention mechanism as the node feature pooling, resulting in a single descriptor for the whole graph. With the attention pooling operation, the diffused features of different regions within an image are aggregated into one feature vector. Then is used as the input to a classifier to predict whether the input affinity graph is a ground truth one or a predicted one in the adversarial training setting. The input feature to the GCN model is a one-hot embedding of each node, i.e. . The input feature of the node attention layer is the feature from the parsing net prediction with superpixel pooling operations applied to it, such that each node corresponds to a -dimensional feature vector and .
4.3 Training and Inference
We train the generator by introducing the following losses. For the global parsing task, the loss function in Eqn. (1) is used. For the affinity graph prediction task, we minimize the distance between the predicted pairwise affinity map and the ground truth pairwise affinity map with loss:
Here represents the mapping function from the input image to the predicted pairwise affinity map, i.e. . is a binary mask indicating connections only between foreground nodes, and it is used to set all other connections to . For training the Graph-GAN, the corresponding loss is
where denotes the GCN-based discriminator. Thus the overall objective function is to find such that
After finding the optimal , we use it to generate global parsing maps and affinity maps for testing images.
During testing, we use the predicted affinity graph
to perform spectral clustering. Background nodes are identified with the global parsing map, and are removed from the affinity graph. Then all the foreground nodes are clustered according to the pairwise affinities in. Different instances of persons are identified from the clustering results. To help clustering, a regression layer built upon the representation learner is used to learn the number of persons during training, and the predicted person number is used in clustering during testing. It is omitted in the network structure and the objective function for brevity.
4.4 Instance Mask Refinement
We extend our model with a refinement step to reinforce the prediction of instance masks (obtained from the clustering results) from superpixel level to pixel level. We adopt Conditional Random Field (CRF)  in refinement to associate each pixel in the image with one of the persons (from the clustering results) or background. The CRF model contains two unary terms, i.e. and a binary term . With
denoting the random variable for the-th pixel in the image, the target of the instance mask refinement is to find the optimal solution for all pixels in the image that minimizes the following energy function:
We define these terms as follows. Given persons from clustering results over the predicted affinity map, and assuming the -th person is represented by a binary mask indicating whether the pixel is from the -th person, we define the person consistency term as
denotes the probability of the-th pixel to be foreground. The person consistency term is designed to give strong cues about which person each foreground pixel should belong to. As in [43, 13], the global term is defined as
which is used to complete the person consistency term by giving equal likelihood of each foreground pixel to all the persons to correct errors in the clustering process. Finally we define our pairwise term as
where is the compatibility function, is the kernel function and is the feature vector at spatial location . The feature vector contains the -dimensional vector from (obtained by up-sampling to match the spatial dimension of the input image) in the affinity prediction net, the -dimensional color vector , and the -dimensional position vector . Thus the kernel is defined as
In other words, the pairwise kernel consists of the learned features for pairwise distance measurement, in addition to the bilateral term and the spatial term used in . The compatibility function is realized by the simple Potts model.
With the above CRF model, we find the optimal solution that minimizes the energy function in Eqn. (14) with the approximation algorithm in  and obtain the final prediction of person instance masks for each pixel in input images. Standard CFR is also applied to the instance-agnostic parsing maps as in .
5.1 Experimental Setup
Performance Evaluation Metrics
We use the following performance evaluation metrics for multi-human parsing.
Average Precision based on Part (AP). Different from region-based Average Precision (AP) used in instance segmentation [4, 45], AP uses part-level Intersection Over Union (IOU) of different semantic part categories within a person to determine if one instance is a true positive. Specifically, when comparing one predicted semantic part parsing map with one ground truth parsing map, we find the IOU of all the semantic part categories between them and use the average as the measure of overlap. We refer to AP under this condition as AP. We prefer AP over AP, as we focus on human-centric evaluation and we pay attention to how well a person as a whole is parsed. Similarly, we use AP to denote the average AP values at IOU threshold from to with a step size of .
Percentage of Correctly Parsed Body Parts (PCP). As AP averages the IOU of each part category, it cannot reflect how many parts are correctly predicted. Thus we propose to adopt PCP, originally used in human pose estimation [46, 11], to evaluate parsing quality on the semantic parts within person instances. For each true-positive person instance, we find all the categories (excluding background) with pixel-level IOU larger than a threshold, which are regarded as correctly parsed. PCP of one person is the ratio between the correctly parsed categories and the total number of categories of that person. Missed person instances are assigned PCP. The overall PCP is the average PCP for all person instances. Note that PCP is also a human-centric evaluation metric.
We perform experiments on the MHP dataset. From all the images in MHP, we randomly choose images to form the testing set. The rest form a training set of images and a validation set of images. Since we are interested in the real-world situation where different people are near to each other with close interaction, we also perform experiments on the Buffy  dataset as suggested in , which contains entangled people in almost all testing images.
Due to space limit, please see supplementary material for more architecture and implementation details.
5.2 Experimental Analysis
|All||Top 20%||Top 5%|
|Mask RCNN ||52.68||49.81||51.87||31.49||40.16||37.31||24.25||35.63||28.77|
5.2.1 Comparison with State-of-the-Arts
Note that standard instance segmentation methods can only generate silhouettes of person instances and cannot produce person part parsing as desired. Thus we use them to generate instance masks as the graph generator in MH-Parser does, and combine the instance masks with the instance agnostic parsing to produce final multi-human parsing results. Here we use Mask-RCNN , which is the state-of-the-art top-down model, and Discriminative Loss (DL) , a well established bottom-up model, to generate instance masks. For Mask RCNN, we use the segmentation prediction in each detection with high confidence () to form the instance masks. DL can generate instance masks as the outputs of the model. We also consider the Detect+Parse baseline method as used in traditional single human parsing, where a person detector is used to detect person instances, and a parser is used to parse each detected instance.
The performance of these methods in terms of AP, AP and PCP on the MHP test set is listed in Tab. 1. In the table the overlap thresholds for AP and PCP are both set as . The MH-Parser, DL, the parser in Detect+Parse are trained on MHP training set with the same trunk network (ResNet). Especially, DL is trained with the official code [21, 22] with the suggested setting. The Mask-RCNN model and the detector in Detect+Parse are the top performing model with ResNet as the trunk from the official Detectron .
We can see that the proposed MH-Parser achieves competitive performance with Mask RCNN and DL on the MHP dataset, and outperforms Detect+Parse baseline. To investigate how these models address the concerned challenges of closely entangled persons, we select two challenging subsets from the MHP test set. For each image in the test set, we perform a pairwise comparison of all the person instances, and find the IOU of person bounding boxes in each pair. Then the average IOU of all the pairs is used to measure the closeness of the persons in each image. One subset contains the images with top highest average IOUs, and the other subset contains top . They represent images with very close interaction of human instances, reflecting the real scenarios. The results on these two subsets are listed in Tab. 1. We can see that on these challenging subsets, MH-Parse outperforms both Mask RCNN and DL. For Mask RCNN, it has difficulties to differentiate entangled persons, while as a bottom-up approach, MH-Parse can handle such cases well. For DL, it only exploits pairwise relation between embeddings of pixels, while MH-Parser models high-order relations among different regions and shows better performance.
Comparison with State-of-the-Arts on Separating Person Instances
We also evaluate the proposed MH-Parser on the Buffy dataset and compare it with other state-of-the-art methods. On Buffy forward score and backward score are used to evaluate the performance of person individuation . We follow the same evaluation metric, and our average forward and backward scores for Episode , and on the Buffy dataset are and , respectively. In  the average forward and backward scores are and on the same dataset, and  reports an average score of . Note that MH-Parser is not trained on Buffy, only evaluation is performed. We can see MH-Parser achieves the best performance compared with other state-of-the-art methods in separating closely entangled persons.
|+ Refine, w/o PAM||49.49||48.98||50.48|
|w/ GT Person Number||51.39||49.77||51.32|
|w/ GT Affinity||55.83||51.28||55.85|
|w/ GT Global Seg.||91.75||77.29||82.96|
5.2.2 Components Analysis for MH-Parser
In this subsection, we test the proposed MH-Parser in various settings. All the variants of MH-Parser are trained on the MHP training set and evaluated on the validation set. The loss in Eqn. (13) is adjusted to either include or exclude the Graph-GAN term. We also demonstrate effects of the instance mask refinement. In the refinement, the pairwise term in Eqn. (18) is disabled by setting to to investigate whether the learned pairwise term is beneficial to the refinement process. The performance of these variants in terms of AP, AP and PCP is listed in Tab. 2.
From the results, we can see that compared to the loss, the Graph-GAN can effectively improve the quality of the predicted pairwise affinity map. Better and finer affinity maps resulted from Graph-GAN help generate better grouping of the bottom level person instance information, leading to increased AP and PCP. The instance mask refinement, especially the learned pairwise term, plays a positive role in improving the performance of multi-human parsing.
We also use the respective ground truth annotations of the three components, i.e. ground truth person number, ground truth affinity graph and ground truth segmentation map, to probe the upper limits of MH-Parser in Tab. 2. We can see that the person number prediction and affinity map prediction are reasonably accurate, while the global segmentation is still the major hindrance of the problem of multi-human parsing. Improvement on global segmentation can greatly boost the performance of multi-human parsing.
5.2.3 Qualitative Comparison
Here we visually compare the results from Mask RCNN, DL and MH-Parser. The input images, global parsing ground truths, parsing predictions, predicted instance maps from Mask RCNN, DL and MH-Parser are visualized in Fig. 4. We can see that the MH-Parser captures both the fine-grained global parsing details and the information to differentiate person instances. For Mask RCNN, it has difficulties distinguishing closely entangled persons, especially when the bounding boxes of persons have large overlaps. The MH-Parser has better instance masks in such cases. MH-Parser also has better person instance masks than DL, especially at the boundary between two close instances. More visualized results are deferred to supplementary materials.
In this paper, we tackle the multi-human parsing problem. We contributed a new large-scale MHP dataset, and also proposed a novel MH-Parser algorithm. We performed detailed evaluations of the proposed method and compared with current state-of-the-art solutions on the new benchmark dataset. We envision that the proposed MHP dataset and the MH-Parser are promising for driving human parsing research towards real-world applications. In the future, we will make efforts to annotate a more comprehensive multiple-human parsing dataset with more images and more detailed semantic labels to further push the frontier of multiple-human parsing research.
The work of Jianshu Li was partially funded by National Research Foundation of Singapore. The work of Jian Zhao was partially supported by National University of Defence Technology and China Scholarship Council (CSC) grant 201503170248. The work of Jiashi Feng was partially supported by National University of Singapore startup grant R-263-000-C08-133 and Ministry of Education of Singapore AcRF Tier One grant R-263-000-C21-112.
-  Lin, J., Guo, X., Shao, J., Jiang, C., Zhu, Y., Zhu, S.C.: A virtual reality platform for dynamic human-scene interaction. In: Proceedings of the ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia) Virtual Reality meets Physical Reality: Modelling and Simulating Virtual Humans and Environments Workshop. (2016) 11
-  Collins, R.T., Lipton, A.J., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., et al.: A system for video surveillance and monitoring. (2000)
Gan, C., Lin, M., Yang, Y., de Melo, G., Hauptmann, A.G.:
Concepts not alone: Exploring pairwise relationships for zero-shot
video activity recognition.
In: Proceedings of the Association for the Advance of Artificial Intelligence Conference on Artificial Intelligence (AAAI). (2016) 3487
-  Liang, X., Wei, Y., Shen, X., Yang, J., Lin, L., Yan, S.: Proposal-free network for instance-level object segmentation. arXiv preprint arXiv:1509.02636 (2015)
-  Liang, X., Liu, S., Shen, X., Yang, J., Liu, L., Dong, J., Lin, L., Yan, S.: Deep human parsing with active template regression. IEEE transactions on pattern analysis and machine intelligence 37(12) (2015) 2402–2414
-  Gong, K., Liang, X., Shen, X., Lin, L.: Look into person: Self-supervised structure-sensitive learning and a new benchmark for human parsing. arXiv preprint arXiv:1703.05446 (2017)
-  Yamaguchi, K., Kiapour, M.H., Ortiz, L.E., Berg, T.L.: Parsing clothing in fashion photographs.
-  Dong, J., Chen, Q., Xia, W., Huang, Z., Yan, S.: A deformable mixture parsing model with parselets. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2013) 3408–3415
Liang, X., Xu, C., Shen, X., Yang, J., Liu, S., Tang, J., Lin, L., Yan, S.:
Human parsing with contextualized convolutional neural network.In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). (2015) 1386–1394
-  Dai, J., He, K., Sun, J.: Instance-aware semantic segmentation via multi-task network cascades. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 3150–3158
-  Chen, X., Mottaghi, R., Liu, X., Fidler, S., Urtasun, R., Yuille, A.: Detect what you can: Detecting and representing objects using holistic models and body parts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2014) 1971–1978
-  Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv preprint arXiv:1606.00915 (2016)
-  Li, Q., Arnab, A., Torr, P.H.: Holistic, instance-level human parsing. arXiv preprint arXiv:1709.03612 (2017)
-  Jiang, H., Grauman, K.: Detangling people: Individuating multiple close people and their body parts via region assembly. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2017) 6021–6029
Liu, S., Liang, X., Liu, L., Shen, X., Yang, J., Xu, C., Lin, L., Cao, X., Yan,
Matching-cnn meets knn: Quasi-parametric human parsing.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2015) 1419–1427
Liang, X., Shen, X., Xiang, D., Feng, J., Lin, L., Yan, S.:
Semantic object parsing with local-global long short-term memory.In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2016) 3185–3193
-  Li, Y., Qi, H., Dai, J., Ji, X., Wei, Y.: Fully convolutional instance-aware semantic segmentation. arXiv preprint arXiv:1611.07709 (2016)
-  He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Computer Vision (ICCV), 2017 IEEE International Conference on, IEEE (2017) 2980–2988
-  Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS). (2015) 91–99
-  Newell, A., Huang, Z., Deng, J.: Associative embedding: End-to-end learning for joint detection and grouping. arXiv preprint arXiv:1611.05424 (2016)
-  De Brabandere, B., Neven, D., Van Gool, L.: Semantic instance segmentation with a discriminative loss function. arXiv preprint arXiv:1708.02551 (2017)
-  Neven, D., De Brabandere, B., Georgoulis, S., Proesmans, M., Van Gool, L.: Fast scene understanding for autonomous driving. arXiv preprint arXiv:1708.02550 (2017)
-  Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS). (2014) 2672–2680
-  Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 (2015)
Arjovsky, M., Chintala, S., Bottou, L.:
Wasserstein generative adversarial networks.
In: Proceedings of the International Conference on Machine Learning (ICML). (2017) 214–223
-  Huang, R., Zhang, S., Li, T., He, R.: Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. arXiv preprint arXiv:1704.04086 (2017)
-  Luc, P., Couprie, C., Chintala, S., Verbeek, J.: Semantic segmentation using adversarial networks. arXiv preprint arXiv:1611.08408 (2016)
-  Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv preprint arXiv:1611.07004 (2016)
-  Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: A retrospective. International journal of computer vision 111(1) (2015) 98–136
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision, Springer (2014) 740–755
-  Ning, Z., Manohar, P., Yaniv, T., Rob, F., Lubomir, B.: Beyond frontal faces: Improving person recognition using multiple cues. (2015)
Chu, X., Ouyang, W., Yang, W., Wang, X.:
Multi-task recurrent neural network for immediacy prediction.In: Proceedings of the IEEE international conference on computer vision. (2015) 3352–3360
-  Zhanpeng, Z., Ping, L., Change Loy, C., Xiaoou, T.: From facial expression recognition to interpersonal relation prediction. In: arXiv:1609.06426v2. (2016)
-  Sapp, B., Taskar, B.: Modec: Multimodal decomposable models for human pose estimation. In: In Proc. CVPR. (2013)
-  Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. The IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 33(5) (2011) 898–916
-  Vineet, V., Warrell, J., Ladicky, L., Torr, P.H.: Human instance segmentation from video using detector-based conditional random fields. In: BMVC. Volume 2. (2011) 12–15
-  Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 (2016)
-  Defferrard, M., Bresson, X., Vandergheynst, P.: Convolutional neural networks on graphs with fast localized spectral filtering. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS). (2016) 3844–3852
-  Manessi, F., Rozza, A., Manzo, M.: Dynamic graph convolutional networks. arXiv preprint arXiv:1704.06199 (2017)
-  Lin, Z., Feng, M., Santos, C.N.d., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017)
-  Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015)
-  Lafferty, J., McCallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the International Conference on Machine Learning (ICML). (2001)
-  Arnab, A., Torr, P.H.: Pixelwise instance segmentation with a dynamically instantiated network. arXiv preprint arXiv:1704.02386 (2017)
-  Krähenbühl, P., Koltun, V.: Efficient inference in fully connected crfs with gaussian edge potentials. In: Proceedings of the Advances in Neural Information Processing Systems (NIPS). (2011) 109–117
-  Hariharan, B., Arbeláez, P., Girshick, R., Malik, J.: Simultaneous detection and segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). (2014) 297–312
-  Ferrari, V., Marin-Jimenez, M., Zisserman, A.: Progressive search space reduction for human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (2008) 1–8
-  Ross, G., Ilija, R., Georgia, G., Piotr, D., Kaiming, H.: Detectron. https://github.com/facebookresearch/detectron (2018)