Cross-domain Human Parsing via Adversarial Feature and Label Adaptation

01/04/2018 ∙ by Si Liu, et al. ∙ National University of Singapore JD.com, Inc. 0

Human parsing has been extensively studied recently due to its wide applications in many important scenarios. Mainstream fashion parsing models focus on parsing the high-resolution and clean images. However, directly applying the parsers trained on benchmarks to a particular application scenario in the wild, e.g., a canteen, airport or workplace, often gives non-satisfactory performance due to domain shift. In this paper, we explore a new and challenging cross-domain human parsing problem: taking the benchmark dataset with extensive pixel-wise labeling as the source domain, how to obtain a satisfactory parser on a new target domain without requiring any additional manual labeling? To this end, we propose a novel and efficient cross-domain human parsing model to bridge the cross-domain differences in terms of visual appearance and environment conditions and fully exploit commonalities across domains. Our proposed model explicitly learns a feature compensation network, which is specialized for mitigating the cross-domain differences. A discriminative feature adversarial network is introduced to supervise the feature compensation to effectively reduce the discrepancy between feature distributions of two domains. Besides, our model also introduces a structured label adversarial network to guide the parsing results of the target domain to follow the high-order relationships of the structured labels shared across domains. The proposed framework is end-to-end trainable, practical and scalable in real applications. Extensive experiments are conducted where LIP dataset is the source domain and 4 different datasets including surveillance videos, movies and runway shows are evaluated as target domains. The results consistently confirm data efficiency and performance advantages of the proposed method for the cross-domain human parsing problem.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Recently human parsing [Liu et al.2015] has been receiving increasing attention owning to its wide applications, such as person re-identification [Zhao, Ouyang, and Wang2014], people search [Li et al.2017], fashion synthesis [Zhu et al.].

Figure 1: Cross-domain human parsing: the upper panel is the source domain with a large amount of training data, e.g., the LIP dataset; the lower panel shows the target domain, e.g, canteen and road, without any manual labeling.

Existing human parsing algorithms can be divided into following two categories. The first one is constrained human parsing. More specifically, the clean images of well-posed persons are collected from some fashion sharing websites, e.g., Chictopia.com, for training and testing. Representative datasets include Fashionista [Yamaguchi et al.2012] with 685 images, Colorful-Fashion dataset [Liu et al.2014] with images and ATR dataset [Liang et al.2015a] with dataset. Each image in these datasets contains only one person, with relatively simple poses (mostly standing), against relatively clean backgrounds. The human parsers trained in such strictly constrained scenario often fail when applied to images captured under the real-life, more complicated environments. The second category is unconstrained human parsing. Representative datasets include Pascal human part dataset [Chen et al.2014b] with images and LIP dataset [Gong et al.2017] with images. The images in these dataset present humans with varying clothing appearances, strong articulation, partial (self-) occlusions, truncation at image borders, diverse viewpoints and background clutters. Although they are closer to real environments than the constrained datasets, when applying the human parser trained on these unconstrained datasets to a real application scenario, such as shop, airport, the performance is still worse than the parser trained on that particular scenario even with much less training samples, due to domain shift on visual features.

In this paper, we explore a new cross-domain human parsing problem: taking the unconstrained benchmark dataset with rich pixel-wise labeling as the source domain, how to obtain a satisfactory parser for a totally different target domain without any additional manual labeling? As shown in Figure 1, the source domain (shown in the upper panel of Figure 1) is a set of labeled data. The target domain training set (shown in the lower panel of Figure 1) is as a set of images without any annotations. We believe investigation on this challenging problem will push human parsing models toward more practical applications.

From Figure 1, we observe the following differences and commonality across two domains, e.g., the source domain and the first target domain, canteen. On the one hand, they have very different illumination, view points, image scale, resolution and degree of motion blur etc. For example, the lighting condition in the canteen scenario is much darker than the source domain. On the other hand, the persons to parse from both domains share the intrinsic commonality such as the high-order relationships among labels (reflecting human body structure) are similar. For example, in both domains, the arms are below the head, but above the legs. Therefore, the cross-domain human parsing problem can be solved by minimizing the differences of the features and maximizing the commonality of the structured labels.

A typical semantic segmentation network [Long, Shelhamer, and Darrell2015, Chen et al.2014a] is composed of a feature extractor and a pixel-wise labeler. In this work, we propose to introduce a new and learnable feature compensation network that transforms the features from different domains to a common space where the cross-domain difference can be effectively alleviated. In this way, the pixel-wise labeler can be readily applied to perform parsing on the compensated features. The feature compensation network is trained under the joint supervision from a feature adversarial network and a structured label adversarial network. More specifically, the feature adversarial network serves as a supervisor and provides guidance on the feature compensation learning like the discriminator of the Generative Adversarial Network (GANs) [Goodfellow et al.2014, Radford, Metz, and Chintala2015]. It is trained to differentiate target and compensated source feature representations. Similarly, the structured label adversarial network differentiates the groundtruth structural labels and the predicted target domain labels. Supervised by these two level information, the cross-domain shift issues can be effectively addressed. We evaluate our approach using LIP [Gong et al.2017] as source domain and datasets as target domains. Extensive experiments demonstrate the effectiveness of our method on all domain shifts adaptation tasks.

The contributions of the paper can be summarized as following. Firstly, we are the first to explore the cross-domain human parsing problem. Since no manual labeling in the target domain is needed, it is very practical. Secondly, we propose a cross-domain human parsing framework with the novel feature adaptation and structured label adaptation network. It is the first cross-domain work to consider both feature invariance and label structure regularization. Thirdly, we will release the source code of our implementation to the academic to facilitate future studies.

Related Work

Human parsing and cross-domain feature transformation have been studied for decades. However, they generally develop independently. There are few works consider solving the cross-domain human parsing by considering these directions jointly. In this section, we briefly review recent techniques on human parsing as well as feature adaption.

Human parsing: Yamaguchi et al. [Yamaguchi, Kiapour, and Berg2013] tackle the clothing parsing problem using a retrieval based approach. Simo-Serra et al. [Simo-Serra et al.2014] propose a Conditional Random Field (CRF) model that is able to leverage many different image features. Luo et al. [Luo, Wang, and Tang2013] propose a Deep Decompositional Network for parsing pedestrian images into semantic regions. Liang et al. [Liang et al.2015b]

propose a Contextualized Convolutional Neural Network to tackle the problem and achieve very impressing results. Xia

et al. [Xia et al.2015] propose the “Auto-Zoom Net” for human paring. Wei et al. [Wei et al.2016, Wei et al.2017] propose several weakly supervised parsing methods to reduce the human labeling burden. Existing human parsing methods work well in the benchmark datasets. However, when applied in the new application scenarios, the performances are unsatisfactory. The cross-domain human parsing problem becomes a significant problem for making the technology practical.

Figure 2: The cross-domain parsing model. It contains a feature adaptation component to minimize the feature differences between two domains, and a structured label adaptation component to maximize the label map commonalities across the domains.

Feature Adaptation:

There have been extensive prior works on domain transfer learning

[Gretton et al.2009]. Recent works have focused on transferring deep neural network representations from a labeled source dataset to a target domain where labeled data is sparse. In the case of unlabeled target domains (the focus of this paper) the main strategy has been to guide feature learning by minimizing the differences between the source and target feature distributions [Ganin and Lempitsky2015, Liu and Tuzel2016, Long et al.2015]. Different from existing feature adaptation methods, we explicitly consider the cross-domain differences via a feature compensation network.

Structured Label Adaptation:

There are few works to consider the label structure adaptation during domain adaption. Some pioneer pose estimation works take the geometric constraints of human joints connectivity into consideration. For example, Chen

et al. [Chen et al.2017] propose Adversarial PoseNet, which is a structure-aware convolutional network to implicitly take such priors into account during training of the deep network. Chou et al. [Chou, Chien, and Chen2017] employ GANs as pose estimator, which enables learn plausible human body configurations。 Our proposed cross-domain human parsing method differs from existing domain adaptation methods in that we consider both feature and structured label adaption simultaneously.

Proposed Cross-domain Adaptation Model

Suppose the source domain images and labels are denoted as and respectively. The target domain images are represented as . Typical human parsing models are composed of a feature extractor and a pixel-wise labeler . However, the parsing model trained on the source domain does not perform well in the target domain in presence of significant domain shift.

Our proposed cross-domain adaptation model to address this issue is shown in Figure 2. It includes a novel feature compensation component supervised by two components, namely adversarial feature adaptation and adversarial structured label adaptation components. The feature adaptation component aims to minimize the feature differences between different domains, while the structured label adaptation is used to maximize the label map commonalities across the domains. Therefore, the whole model introduces three novel components (shown in purple rectangular) on top of conventional human parsing models: feature compensation network , feature adversarial network and structured label adversarial network to address the cross-domain human parsing problem. Next, we will introduce the two adversarial learning components and explain how they help feature compensation to mitigate the domain difference.

Adversarial Feature Adaptation

The feature compensation network and feature adversarial network collaboratively contribute to the feature adaptation. maps the feature representation of the source domain toward the target domain under the supervision of . Alternatively updating them gradually narrows the cross-domain feature differences.

The feature compensation network, as shown in Figure 2, takes as input the extracted features from source domain, where is a part of the feature extractor . The output is the feature differences (shift) between source and target domains. is composed of convolutional layers of VGG-16 net [Simonyan and Zisserman2014], from conv1 till pool5 in VGG-16.The first several layers (from conv1 till pool1) forms . The structure of the feature compensation network is a ResNet-like [He et al.2016] network with a 77 convolution filters and then 6 residual blocks with the identical layout consisting of two 3

3 convolution filters followed by batch-normalization layer and ReLU activation layer. Every three blocks follows a max pool layer and a 3

3 convolution filter to reduce feature maps’ sizes. The result of the feature compensation network is pixel-wisely added to that of the feature extractor to produce the compensated source domain feature .

The feature adversarial network is introduced to guide the cross-domain feature adaptation. Different from traditional adversarial learning models (e.g., vanilla GAN [Goodfellow et al.2014]) that performs judgment over raw images, our proposed feature adversarial network is defined upon the high-level feature space (pool5) which incorporates essential feature information for human parsing. It can accelerate the training and inference. The architecture of is composed of the same fc6-fc7 layers of the Atrous Spatial Pyramid Pooling (ASPP) scheme in DeepLab [Chen et al.2016]. Then a convolution layer with 3

3 convolution filters is appended to create a 1-channel probability map, which is used to calculate the pixel-wise least square feature adversarial loss, like the local LSGANs

[Shrivastava et al.2016].

The optimization of and are iterative. More specifically, the objective for updating is:

(1)

where

is an all-one tensor. The feature adversarial network adopts the least squares loss function, regressing the feature of the target domain

to while regressing the features of the compensated source domain to . It distinguishes the target feature and the compensated source domain feature, while the feature compensation network tries to transform them into indistinguishable ones.

The learning target of the feature compensation network is to mitigate the difference between source and target features. It is trained by optimizing the following objective function:

(2)

The target of is to transform the source domain features to the one similar to target domain by trying to confuse . Or more concretely, the tries to generate features that persuade the to predict the feature is from target domain (output binary prediction of 1). It implicitly maps the source domain features toward the target domain by encoding lighting conditions, environment factors. By iteratively boosting the abilities of and through alternative training, the gap between the two domains are gradually narrowed down.

Input: Source domain images ; source domain labels ; target domain images ; feature extractor ; feature compensation network ; feature adversarial network ; structured label adversarial network ; pixel-wise labeler ; number of training iterations ; a constant .
1 for  do
2      sample , , , .
3      update , by minimizing .
4      update by minimizing Equation (2).
5      update by minimizing Equation (1).
6      if mod then
7           update , by minimizing Equation (4).
8           update by minimizing Equation (3).
9          
10           end if
11          update , by minimizing .
12          
13           end for
Algorithm 1 Training details of the integrated cross-domain human parsing framework.

Adversarial Structured Label Adaptation

Only feature compensation cannot fully utilize the valuable information about human body structure and leads to suboptimal parsing performance. Therefore, we also propose a structured label adversarial network that learns to capture the commonalities of parsing labels from different domains. Such information is learnable from the source domain data because of the following reasons. Firstly, the labels have very strong spatial priors. For example, in daily-life scenarios, the head always lies on the top, while the shoes appear in the bottom in most cases. Moreover, relative positions between the labels are consistent across domains. For example, the arms lie on both sides of the body, while the head is at the top of the body. Finally, the part shapes of certain labels are basically similar on both domains. For example, the faces are usually round or oval while the legs are often long striped. The pixel-wise labeler and the structured label adversarial network collaboratively adapt the structured label prediction.

The pixel-wise labeler is composed of the , and layers of DeepLab [Chen et al.2016], which is a fully convolutional variant of the VGG-16 net [Simonyan and Zisserman2014] by modifying the atrous (dilated) convolutions to increase the field-of-view. Depending on the properties of the input, two losses are defined upon the network.

  • : The pixel-wise cross entropy loss defined upon the source domain images and .

  • : The pixel-wise cross entropy loss defined upon the compensated source domain features and .

The structured label adversarial network is used to distill the high-order relationships of the labels from the source domain groundtruth pixel-wise labels and transfer to guide parsing target domain images. The architecture of

is as follows. LeakyReLU activations and batch normalization are used for all layers except the output. All layers contain stride = 2 convolution filter except the last layer, which just contains one stride = 1 convolution filter to produce the confidence map. All convolution filter used in the network is 5

5 convolution filter.

The optimization is conducted jointly through a minimax scheme that alternates between optimizing the parsing network and the adversarial network. takes either the ground truth label or the prediction parsing result, and output the probability estimate of the input is the ground truth (with training target 1) or the segmentation network prediction (with training target 0). The learning target is:

(3)

The can help refine the feature extractor and pixel-wise labeler via:

(4)

Both and collaboratively confuse to produce the output 1, which means that the parsing results are drawn from the ground truth labels.

Model Learning and Inference

Training details of the integrated cross-domain human parsing framework are summarized in Algorithm 1. Generally speaking, all the model parameters are alternatively updated. Note that before every update of , the network , , and are updated for times. Experiments show that the different updating scheduling between and the remaining network facilitate the model convergence.

During inference, the parsing label of the test sample is obtained by . Note that the feature compensation network, tow adversarial networks are not involved in the inference stage. Therefore, the complexity of our algorithm is the same with conventional human parsing method.

Discussions

In terms of the architecture of the adversarial networks, we originally tried DCGANs [Radford, Metz, and Chintala2015]. However, we found it difficult to optimize (issue of convergence) and performs not so well. Therefore, we borrow the architecture Least Squares Generative Adversarial Networks (LSGANs) [Mao et al.2016] to build our adversarial learning networks, which adopts least squares loss function for the discriminator. It performs more stable during learning. For the feature adversarial network, the adversarial loss is defined pixel-wisely on the 2-dim feature maps. The local LSGANs structure [Shrivastava et al.2016] can hence the capacity of the network. The situation is similar for structured label adversarial network.

Methods Avg. acc Fg. acc Avg. pre Avg. rec Avg. F1
Target Only 89.50 74.53 60.07 59.55 59.75
Source Only 86.84 68.87 51.12 50.97 49.70
DANN 88.04 71.74 52.23 50.73 50.50
Feat. Adapt 88.16 72.56 53.63 52.23 51.59
Lab. Adapt 88.14 72.82 53.21 51.54 50.95
Feat. + Lab. Adapt 87.98 73.86 50.84 54.49 51.73
Table 1: From LIP to Indoor dataset. ().
Methods Avg. acc Fg. acc Avg. pre Avg. rec Avg. F1
Target Only 85.96 62.64 58.63 61.07 59.73
Source Only 87.11 63.47 62.05 63.93 62.41
DANN 87.56 63.20 64.28 62.73 62.84
Feat. Adapt 87.64 64.83 63.95 64.25 63.40
Lab. Adapt 87.52 66.53 62.64 65.62 63.62
Feat. + Lab. Adapt 87.88 65.87 64.08 65.97 64.36
Table 2: From LIP to Daily Video dataset. ().

Experiments

We conduct extensive experiments to evaluate performance of our model for 4 cross-domain human parsing scenarios.

Experimental Setting

Source Domain : We use LIP dataset [Gong et al.2017] as the source domain that contains more than images with careful pixel-wise annotations of semantic human parts. These images are collected from real-world scenarios and the persons present challenging poses and views, heavily occlusions, various appearances and low-resolutions. The original labels are merged to labels or labels by discarding or combining to be consistent with target domains.

Target Domain: The following four target domains are investigated in this paper. Some example images from these target domains are shown in Figure 3.

Indoor dataset [Liu et al.2016] contains labeled images with semantic human part labels and unlabeled images. The images are captured in the canteen by surveillance cameras and have dim lights.

Daily  Video dataset is a newly collected dataset, containing labeled images with semantic human part labels and unlabeled images. These images are collected from a variety of scenes including shop, road, etc.

PridA and PridB datasets are selected from camera view A and camera view B of Person Re-ID 2011 Dataset [Roth et al.2014]. Person Re-ID 2011 Dataset consists of images extracted from multiple person trajectories recorded from two different, static surveillance cameras.

Baseline & Evaluation We compare the proposed method is compared with following baseline methods.

Target Only:

Since all of our target domains have pixel-level annotations, we train and test the parsing model directly on the target domains. We take the results, derived from accessing the full supervision, as performance upper bound for the cross-domain parsing models. In the following experiments, the basic model is the same as our feature extraction network and label predicting network.

Source Only: We apply the model trained on the source domain directly to the target domain, without any fine-tuning on the target domain datasets. It is a valid performance lower bound of the cross-domain methods.

DANN: There are a few works investigating cross-domain learning problems following the adversarial learning strategy. Here, we take the most competitive one proposed in [Ganin et al.2016]. It resolves the cross-domain problems on classifications. DANN uses an adversary network to make the features extracted from the source domain and target domains undistinguishable. The feature extraction network are shared for images from both domains. We adapt this method for the human parsing problem.

Methods Avg. acc Fg. acc Avg. pre Avg. rec Avg. F1
Target Only 89.90 81.44 81.38 82.96 82.12
Source Only 86.10 78.39 72.54 80.60 76.00
DANN 86.17 81.99 73.51 82.18 76.99
Feat. Adapt 86.63 81.51 73.41 82.88 77.39
Lab. Adapt 87.01 79.55 73.74 81.81 77.14
Feat. + Lab. Adapt 87.24 80.81 74.76 82.32 77.92
Table 3: From LIP to PridA dataset. ().
Methods Avg. acc Fg. acc Avg. pre Avg. rec Avg. F1
Target Only 88.50 79.71 79.83 82.28 81.00
Source Only 84.46 80.01 72.85 80.01 75.63
DANN 83.91 83.06 71.55 82.83 75.83
Feat. Adapt 85.63 82.30 74.47 81.69 77.28
Lab. Adapt 84.62 80.54 73.13 80.42 75.89
Feat. + Lab. Adapt 86.26 82.39 75.20 81.62 77.89
Table 4: From LIP to PridB dataset. ().
Methods bg face hair U-clothes L-arm R-arm pants L-leg R-leg dress L-shoe R-shoe
Target Only 95.05 66.46 77.30 81.35 50.79 50.29 80.95 38.28 39.342 63.15 37.285 36.68
Source Only 94.17 58.89 59.10 77.51 43.35 43.39 75.06 35.16 32.53 26.55 24.11 26.54
DANN 94.48 61.38 65.26 78.41 42.01 41.74 78.83 32.84 25.53 35.56 23.76 26.19
Feat. Adapt 94.53 58.92 62.99 78.27 41.14 40.11 79.42 41.49 22.90 45.15 26.53 27.69
Lab. Adapt 94.48 57.71 63.32 78.60 41.20 41.22 79.06 38.99 22.64 45.52 25.90 22.70
Feat. + Lab. Adapt 94.49 56.73 67.86 78.81 42.79 42.64 78.97 36.25 22.86 47.00 25.32 27.00
Table 5: F-1 Scores of each category from LIP to Indoor. ().
Methods bg face hair U-clothes L-arm R-arm pants L-leg R-leg dress L-shoe R-shoe
Target Only 94.50 69.06 57.37 68.16 46.33 42.37 65.01 59.97 60.35 67.06 41.85 44.72
Source Only 95.15 70.28 59.54 69.91 55.25 50.72 72.95 61.52 61.82 60.32 45.55 45.88
DANN 95.18 70.98 58.87 71.13 54.73 50.64 73.23 61.84 61.16 64.16 46.55 45.61
Feat. Adapt 95.35 72.13 55.99 73.01 56.55 52.38 73.08 60.60 62.91 61.72 48.77 48.24
Lab. Adapt 95.37 70.99 59.66 72.18 55.94 52.33 72.76 62.68 63.60 63.08 46.51 48.31
Feat. + Lab. Adapt 95.38 70.88 57.11 73.04 57.05 53.92 73.34 64.80 64.73 64.80 48.34 48.97
Table 6: F-1 Scores of each category from LIP to Daily Video dataset. ().

For ablation studies, we consider three variants of our method, to evaluate the contribution of each sub-network. Feat. Adapt: Our method with the Feature Adversarial network alone. Lab. Adapt: Our method with the Structured Label Adversarial network alone. Feat. + Lab. Adapt: Our method with both Feature Adversarial network and Structured Adversarial network.

We adopt five popular evaluation metrics, i.e., accuracy, foreground accuracy, average precision, average recall, and average F-1 scores over pixels

[Yamaguchi, Kiapour, and Berg2013]. All these scores are obtained on the testing sets of the target domains. The annotations of target domains are only used in the “Target Only” method.

Implementation details

: The feature extractor and the pixel-wise labeler use the DeepLab model, with pre-trained models on PASCAL VOC. The other networks are initialized with “Normal” distribution.

During training of the feature adversarial adaption component, “Adam” optimizer is used with and . The learning rate is -5. When training the structured label adaptation component, we use “Adam” optimizer with , and , while the learning rate is -8. The remaining networks are optimized via “SGD” optimizer with momentum of 0.9, learning rate

-8 and weight decay of 0.0005. The whole framework is trained on PyTorch with a mini-batch size of

. The input image size is . The experiments are done on a single NVIDIA GeForce GTX TITAN X GPU with 12GB memory. The constant is in our experiment.

Methods bg head U-body L-body
Target Only 93.90 76.63 81.83 76.11
Source Only 92.06 69.05 74.49 68.38
DANN 92.01 71.49 75.65 68.80
Feat. Adapt 92.28 71.61 76.71 68.96
Lab. Adapt 92.78 70.24 76.11 69.43
Feat. + Lab. Adapt 92.83 72.01 76.72 70.14
Table 7: F-1 Scores of each category from LIP to PridA dataset. ().
Methods bg head U-body L-body
Target Only 92.88 77.23 80.38 73.51
Source Only 90.75 71.79 74.56 65.41
DANN 90.16 72.73 74.79 65.66
Feat. Adapt 91.30 72.59 76.95 68.29
Lab. Adapt 90.88 71.92 74.69 66.07
Feat. + Lab. Adapt 91.59 72.71 78.27 68.99
Table 8: F-1 Scores of each category from LIP to PridB dataset. ().

Quantitative Results

Table 1 to 4 show the quantitative comparison of the proposed method with baseline methods. The best scores except those performed by “Target Only” (upper bound) are shown in black bold.

From these results, we can observe that the “Feat. + Lab. Adapt” method always outperforms about than the method “Source Only” and “DANN” in the value “Avg. F-1”, which verifies the effectiveness of the proposed cross-domain method. Note that, the “Avg. F-1” score of “Feat. + Lab. Adapt” is even higher than those of “Target Only” on the Daily Video dataset. We believe the reason is that the number of images in the training set is quite limited in this dataset and our proposed model is effective at transferring useful knowledge to address the sample-insufficiency issue. Besides, “Feat. Adapt” performs better than “Lab. Adapt” on the dataset Indoor, PridA, and PridB. This is from the fact that the features output by the “pool5” layer contain more sufficient characteristics, so the adversary network on these features has more influence on the whole performance.

The detailed “F-1” scores of each category are shown in Table 5 8, which verify the effects of our method.

Figure 3: Qualitative Results on 4 target domains. “GT” stands for the groundtruth labels.

Qualitative Results

Some qualitative comparisons on the four target domains are shown in Figure 3.

For the dataset Indoor, back-view persons appear more frequently, and the illuminations are poor. Therefore, the predictions of left and right arms/shoes are often incorrect, and the hairs may be mis-predicted as backgrounds as well. For the persons in the 1st and 3rd rows of the dataset Indoor, the left and right arms are confused by “Source Only”. The DANN performs slightly better, but our model is able to predict the left and right arms correctly. The hair of the second person is missed in both the “Source Only” and “DANN” methods, due to the dim lights of the image. The dress of the 4th person looks smaller because the camera is much higher than the person. So “Source Only” and “DANN” methods wrongly predict them as upper clothes.

For the dataset Daily Video, cameras are put at general positions but the poses of persons are more challenging. People usually appear in frontal view, but they are often moving fast, e.g. the 2nd person, or in nonuniform illuminations, e.g. 3th and 4th persons. In these cases, the proposed model performs better, benefiting from the structure adversary network. Our method also performs better in predicting the classes of clothes, e.g. the 1st person.

The resolution of the dataset PridA and PridB are very low. As shown in Figure 3, our model and its variants also win in predicting details of the persons.

Conclusion

In this paper, we explored a new cross-domain human parsing problem: making use of the benchmark dataset with extensive labeling, how to build a human parsing for a new scenario without additional labels. To this end, an adversarial feature and structured label adaptation method were developed to learn to minimize the cross-domain feature differences and maximize the label commonalities across the two domains. In future, we plan to explore unsupervised domain adaptation when the target domain are unsupervised videos. The videos provide rich temporal context and can facilitate cross-domain adaptation. Moreover, we would like to try other types of GANs, such as WGAN [Arjovsky, Chintala, and Bottou2017] in our network.

Acknowledgments

This work was supported by Natural Science Foundation of China (Grant U1536203, Grant 61572493, Grant 61572493), the Open Project Program of the Jiangsu Key Laboratory of Big Data Analysis Technology, Fundamental theory and cutting edge technology Research Program of Institute of Information Engineering, CAS(Grant No. Y7Z0241102) and Grant No. Y6Z0021102.

References