Unpaired Pose Guided Human Image Generation

by   Xu Chen, et al.

This paper studies the task of full generative modelling of realistic images of humans, guided only by coarse sketch of the pose, while providing control over the specific instance or type of outfit worn by the user. This is a difficult problem because input and output domain are very different and direct image-to-image translation becomes infeasible. We propose an end-to-end trainable network under the generative adversarial framework, that provides detailed control over the final appearance while not requiring paired training data and hence allows us to forgo the challenging problem of fitting 3D poses to 2D images. The model allows to generate novel samples conditioned on either an image taken from the target domain or a class label indicating the style of clothing (e.g., t-shirt). We thoroughly evaluate the architecture and the contributions of the individual components experimentally. Finally, we show in a large scale perceptual study that our approach can generate realistic looking images and that participants struggle in detecting fake images versus real samples, especially if faces are blurred.



page 1

page 5

page 6

page 7

page 8


Similarity-preserving Image-image Domain Adaptation for Person Re-identification

This article studies the domain adaptation problem in person re-identifi...

HumanGAN: A Generative Model of Humans Images

Generative adversarial networks achieve great performance in photorealis...

Attribute-Guided Sketch Generation

Facial attributes are important since they provide a detailed descriptio...

Examining Performance of Sketch-to-Image Translation Models with Multiclass Automatically Generated Paired Training Data

Image translation is a computer vision task that involves translating on...

Generative Guiding Block: Synthesizing Realistic Looking Variants Capable of Even Large Change Demands

Realistic image synthesis is to generate an image that is perceptually i...

Total Generate: Cycle in Cycle Generative Adversarial Networks for Generating Human Faces, Hands, Bodies, and Natural Scenes

We propose a novel and unified Cycle in Cycle Generative Adversarial Net...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper we explore full generative modelling of people in clothing given only a sketch as input. This is a compelling problem motivated by the following questions. First, humans can imagine people in particular poses, wearing specific types of clothing – can machines learn to do the same? If so – how well can generative models perform this task? This clearly is a difficult problem since the input, a sketch of the pose, and the output, a detailed image of a dressed person, are drastically different in complexity, rendering direct image-to-image translation infeasible. The availability of such a generative model, would make many new application scenarios possible: cheap and easy-to-control generation of imagery for e-commerce applications such as fashion shopping, or to synthesize training data for discriminative approaches in person detection, identification or pose estimation.

Figure 1: Generating humans in clothing: Our network takes a pose sketch as input and generates realistic images, giving users (a) instance control via an reference image, or (b) via conditional sampling, leading to images of variations of a class of outfits.

Existing approaches typically cast the problem of generating people in clothing as two-stage paired image-to-image translation task. Lassner et al[16] require pairs of corresponding input images, 3D body poses and so-called parsing images. While this allows for the generation of images with control over the depicted pose, such approaches do not provide any control over the type or even specific instance of clothing. However, many application scenarios would require control over all three.

In this paper we propose a new task that, to the best of our knowledge, isn’t covered by any related work: we seek to generate images of people in clothing with (T1) control over the pose, and (T2) exact instance control (the type of clothing), or (T3)

conditional sampling via class label, and with sufficient variance (e.g., blue or black suit).

For example, provided an image of a person a model should be able to generate a new image of the person wearing a specific clothing item (e.g., red shirt) in a particular pose (Fig 1 a). Or provided a pose and class label (e.g., dress) it should synthesize images of different variants of the clothing in target pose (Fig 1 b).

To tackle this problem, we furthermore contribute a new dataset, and a network architecture that greatly reduces the amount of annotations required for training which in turn drastically reduces human effort and thus allows for faster, cheaper and easier acquisition of large training data sets. Specifically, we only require unpaired sets of 3D meshes with part annotations and images of people in clothing. These sets only need to contain roughly similar distributions of poses, allowing for reuse of existing datasets.

Furthermore, our work contributes on the technical level in that we propose a single end-to-end trainable network under the generative adversarial framework, that provides active control over (i) the pose of the depicted person, (ii) over the type of cloth worn via conditional synthesis, and even (iii) the specific instance of clothing item. Note that in our formulation applying clothing is not performed via image-to-image transfer and hence allows for generation of novel images via sampling from a variational latent space, conditioned on a style label (e.g., t-shirt). Finally, our approach allows for the construction of a single inference model that makes the two-stage approach of prior work unnecessary.

We evaluate our method qualitatively in an in-depth analysis and show via a large scale user study () that our approach produces images that are perceived as more realistic, indicated by a higher “fool-rate”, than prior work.

2 Related Work

We consider the problem of generating images of people in clothing, with fine-grained user control. This is a relatively new area in the computer vision literature. Here we review the most closely related work and briefly summarize work that leverages deep generative models in adjacent areas of computer vision.

Generating people in clothing Synthesizing realistic images of people in different types of clothing is a challenging task and of great importance in many areas such as e-commerce, gaming and as potential source of training data for computer vision tasks. The computer graphics literature has dedicated a lot of attention to this problem including skinning and articulation of 3D meshes, simulation of physically accurate clothing deformation and the associated rendering problems [7, 4, 5, 23, 22, 15]. Despite much progress, generating photo-realistic renderings remains difficult and is computationally expensive.

Circumventing the graphics pipeline entirely, image-based generative modeling of people in clothing has been proposed as emerging task in the computer vision and machine learning literature [16, 18, 8, 33].

Ma et al[18, 19] propose a two-stage pose guided image generation network that allows to synthesize images depicting a specific given person in arbitrary 2D poses. [8, 36, 28] propose methods for so-called virtual try-on. Han et al[8] use a coarse-to-fine framework to transfer garment appearance from a high-quality photograph, to the corresponding body part(s) in the destination 2D image.

The most related work in spirit to ours is [16], proposing a generative model, learned directly from images. Their model allows to generate images of people by sampling from a learned space. This work also relies on a two-stage process and requires expensive annotations and paired data which are non-trivial to obtain.

Figure 2: Schematic architecture overview. (a): network pipeline for training; (b): inset shows encoder in detail, consisting of a VAE and class label code; (c) the inference network generates synthetic images , given part model as input and using either a style class label or style depiction to condition the generation process.

Deep generative models Deep generative models that can synthesize novel samples have lately seen a lot of attention in the machine learning and computer vision literature. Especially generative adversarial networks (GANs) [6, 26, 24, 21]

and variational autoencoders (VAEs) 


have been successfully applied to the task of image synthesis. A particular problem that needs to be addressed in real tasks is that of control over the synthesis process. In their most basic form neither GANs nor VAEs allow for direct control over the output or individual parameters. To address this challenge a number of recent studies have investigated the problem of generating images conditioned on a given image or vector embeddings 

[13, 11, 31] such that they predict , where

is a reference image or similar mean of defining the desired output (e.g., one-hot encoded class labels).

Image-to-image translation Generating people in clothing given a target pose can be regarded as an instance of image to image translation problem. In [11], automatic image-to-image translation is first tackled based on conditional GANs. While conditional GANs have shown promise when conditioned on one image to generate another image, the generation quality has been attained at the expense of lack of diversity in the translated outputs. To alleviate this problem, many works propose to simultaneously generate multiple outputs given the same input and encourage them to be distinct [35, 3, 2].

In [35], the ambiguity of the many-to-one mapping is distilled into a low-dimensional latent vector, which can be randomly sampled at test time and encourages generation of more diverse results. To train all of the above methods, paired training data is needed. However, for many tasks, paired information will not be available or will be very expensive to obtain. In [34], an approach called CycleGAN to translate an image from a source domain X to a target domain Y in the absence of paired examples has been introduced. Similar approaches have been proposed to enforces domain translation with cycle consistency loss [12, 32].

In our paper, we build on this prior work but leverage a formulation that gives control over individual dimensions when generating images and the capability to generate entirely novel samples via incorporation of a variational auto-encoder that can be conditioned on a particular pose. A couple of concurrent works also recognize these limitations and propose architectures for the somewhat easier task of unsupervised multimodal image domain translation [1, 10, 17]. We show that our architecture yields comparable results even though this task is not the main focus of our work.

3 Method

To conditionally synthesize images of people in clothing, while providing user control over the generation process, we propose a network combining the benefits of generative adversarial networks with cycle consistency and variational autoencoders in a single end-to-end trainable network. The aim is to generate high-quality images and to allow for the conditional generation of samples where the appearance is controlled via either an image (e.g., a specific suit) or a class label indicating the style of clothing (e.g., suits in general). Importantly our architecture can be trained via unpaired training data. That is 3D poses and real images do not need to correspond directly. The architecture, illustrated in Fig. 2 and dubbed Unpaired Pose-Guided GAN (UPG-GAN) combines elements of the GAN and VAE framework with several novel consistency losses to enable the fine-grained appearance control. In this section, we briefly introduce the basics of CycleGANs and VAEs and then detail our proposed architecture and training scheme.

3.1 Preliminaries: CycleGAN and VAE

To enable the desired unpaired training scheme, we build our framework on the basis of CycleGAN [34]. In the CycleGAN framework two generators and translate images from one domain to another and attempt to fool the corresponding discriminators and

, classifying samples into real and synthetic data. This is expressed via the following minimax game:


where is the standard GAN loss:


To prevent and from neglecting the underlying information of the input source images, a cyclic reconstruction loss is added:


The overall objective of CycleGAN is then given by:


This is optimized by alternating between maximizing the generator and minimizing the discriminator objectives.

Furthermore, we would like to endow the model with the ability to stochastically generate samples of natural human images (to attain natural variation in appearance), given a specified pose. For this task, we leverage VAEs so that a latent code can be sampled from a prior distribution . In the VAE setting,

is commonly chosen to be an isotropic normal distribution

with zero mean and unit variance. Applying the re-parametrization trick the corresponding KL-Divergence () is minimized during training to regularize the latent distribution to be close to , and and are encoded via the image :


3.2 Unpaired Pose Guided GANs (UPG-GAN)

We now discuss (1) how to endow the base CycleGAN architecture with control over the specific instance of clothing (provided via a reference image) and (2) how to allow for conditional generation of a diverse set of images via sampling where the style is controlled via a class label .

Instance control. To allow for control over the clothing in the output image, we first introduce an additional encoder (dashed box in Fig. 2 (a)) to extract information from the input human image into a latent code . This code then serves as a guidance to reconstruct a human image from the generated synthetic part model . If is simply fed to the generation process of suffers from mode-collapse due to information hiding. During training embeds a nearly imperceptible, high frequency signal into . At inference time , the input to , is void of the hidden signal and generation of converges to a single mode. This problem is especially severe when one domain, in our case the human images , represents significantly richer and more diverse information than the other, in our case the part models (see Fig. 2). To prevent this, is enforced to reflect the information encoded in via introduction of a latent code consistency loss:


Here the encoder produces a style code from .

Conditional sampling. In cases for which only the type of clothing matters but diverse image generation is desired (i.e., to triage different designs) we extend our architecture to accept a class label for implicit guidance. To be able to generate varied images via sampling from a prior distribution, we propose a scheme similar to the VAE framework, albeit without an explicit decoder (see Fig. 2, inset). Conditioning with a specific type of clothing requires disentanglement of style-type information from other dimensions of the latent code such as color and texture of the clothing. Following the approach in ss-InfoGAN [27], we leverage easy to attain style class labels while we do not provide labels for any of the other dimensions. The latent code is then decomposed into a supervised part , controlling the style class, and an unsupervised part , where . The following regularization term encourages the VAE’s encoder to learn a disentangled representation


via enforcing similarity between and its corresponding ground truth label . The inference network (Fig. 2, c) can then be fed with labels to control style classes while producing samples with sufficient variability (Fig. 1, b).

3.3 Training Schedule

Our overall objective function:


is optimized by the iterative training procedure given as pseudo-code in Alg.1. At the beginning of each iteration, one part model and one human image are randomly drawn from the corresponding datasets, and the ground truth label for is read, if available. The VAE encodes to get the latent code . Synthetic samples and are generated from and , and at the same time a latent code is extracted from . The last step of the forward pass is to reconstruct from and from respectively. The overall loss is calculated and back-propagated through the network to update the weights.

   part model, part model dataset
   human image, human image dataset
  : class label (clothing), : latent code
  for  do
     , ,
  end for
Algorithm 1 Unpaired Pose Guided GAN

4 Experiments

Evaluating generative models is a difficult task since the main goal, that of synthesizing novel samples, implies that no ground-truth information is available. Prior work on generating images of people in clothing has reported reconstruction accuracy. However, in our work this is not possible since we do not train on pairs of images. Furthermore, for the final task the two most important aspects are degree of control over the content and the final image quality. For these reasons we report mostly qualitative results but compare the various aspects of our proposed architecture in an ablative manner against the underlying building blocks (e.g., CycleGAN only). Finally, we report results from a large scale user study in which we asked participants to discriminate between randomly sampled real and fake images.

4.1 Dataset

One of the contributions in our work is the removal of the requirement for a task specific dataset. Training of the proposed architecture only relies on two separate sources of data: images of humans with known 3D pose (e.g., from the 3D pose estimation literature) and images of real people wearing varying types of clothing. To attain samples of part models we use the dataset of [16]. To attain the real human images, we crawled 1500 images from an online fashion store111https://www.zalando.ch/, including t-shirts, dress-shirts, dresses and suits. The label was extracted from the online shop’s categorization. To ensure rough correspondences between the body models and the images, we separated the datasets into those depicting full bodies and upper bodies only. Importantly these two data sources need not be directly paired and hence there is no need to fit the 3D body models to the 2D imagery. We will release the dataset and code for network training and inference.

Figure 3: Several images produced by ours and w/o , conditioned on target human images and poses. Each four images in a row form a group. From left to right in a group: target style image, target pose, our result and w/o .

4.2 Implementation Details

We set the network input to a fixed size of for computational reasons. The two generators (cf. Fig. 2) share the same architecture and we use a standard Downsampling-ResNetChain-Upsampling configuration. The VAE block and the classifier share the convolutional layers of a ResNet [9] architecture. During training, the learning rate is set to . The weight of the supervised regression loss is set to while the weight of the KL-divergence loss is set to . For more details please refer to the supplementary materials.

4.3 Ablation Study

To understand the effect of each component that we add to the base CycleGAN architecture, we conduct an ablation study. We contribute four novel aspects, namely the clothing encoder, the latent code consistency loss, the KL divergence loss and the class supervision loss. The clothing encoder and KL divergence loss are necessary to perform instance control (via source image) and conditional sampling (via class label) respectively. Therefore we focus our study on the latent code consistency loss and the class supervision loss. We train without the latent code consistency and its corresponding loss and a network without and evaluate their performance on both instance control and conditional sampling, compared to our full network. Note that we cannot compare directly to the base architecture (without both) since it has severe problem of mode collapse and also it fails to provide control over the depicted outfit.

Figure 4: Conditional sampling comparison. (a): without loss. (b): ours with .

Setup: During evaluation of instance control the realism of the generated human images and the similarity of style between the generation and the target are considered. To measure the realism we use a pre-trained faster-RCNN [25] to detect people in the generated images and report the detection accuracy and average confidence. To evaluate if the target style persists in the output, we use a pre-trained person re-identification network [30]. If the generated human image has a similar style to the target, the re-identification network should be able to re-identify it in the output image. To evaluate conditional sampling, we sample 20 human images for each input body part model and again compute the person detection accuracy and the average confidence. Moreover, we randomly pick 19 sample pairs for each input body part model, and measure the diversity of the generation with the average LPIPS score as proposed in [35].

Results: We first discuss the instance control results as shown in Tab. 1. As indicated by the re-identification confidence, we can see that without the latent code consistency (w/o ) the network cannot obey the target style image well. The classification loss helps to slightly improve the style-preservation. In addition we can see that both the latent code consistency and the classification loss help to improve the realism, reflected by the decrease in detection accuracy and confidence when either of these two is absent. We can see from Fig. 3 that even though the version without loss can produce satisfactory human images and also can keep the color correctly, it does not maintain the clothing type.

We now analyze the results in terms of conditional sampling. The low diversity score (0.0001) for the network trained without the latent code consistency loss indicates that the provided latent codes are ignored during inference. By adding the latent code consistency loss, our full network can produce diverse samples with a much higher diversity score of 0.0700. Notable, the network without achieves an even higher diversity score. However, this is due the unrealistic samples produced by this network, which is confirmed in the qualitative results and is also reflected in the low person detection scores. Fig. 4 shows that without the network can produce diverse results but these are not consistent with the desired the type of clothing.

w/o 61.4% 86.6% 0.382
w/o 90.7% 91.9% 0.642
Ours 94.3% 94.4% 0.665
Table 1: Ablation study for instance control.
w/o 61.4% 86.4% 0.0001
w/o 80.7% 81.3% 0.1240
Ours 93.9% 94.2% 0.0700
Table 2: Ablation study for conditional sampling.

4.4 Latent Space Visualization

Fig. 5 visualizes the learned latent space, indicating that the latent codes indeed capture clothing information. We encode all of our training images that depict t-shirts into latent codes and project them to 2D via t-SNE [20] for visualization. The plot illustrates that latent codes cluster by clothing and texture, and not by pose, which indicates that pose and clothing are indeed disentangled.

Figure 5: Visualization of the latent space.

4.5 Nearest Neighbor Analysis

A nearest neighbor analysis based on the structural similarity (SSIM) [29] metric, demonstrates that our learned model does not simply memorize the training data (see Fig. 6).

Figure 6: Nearest training images to our generations.
Figure 7: User ratings, without and with blurred faces.

4.6 User Ratings

Finally, we assess the potentially most important metric to quantify the overall performance of the proposed approach. To better understand if the generated images look realistic we conduct a large-scale perceptual study. In this experiment we randomly sample images from the training dataset (real) and from synthetic samples generated by our model (synth). To isolate the influence of the facial region (which we do not treat specifically), we separate the participants in two groups where the first group judges images with full facial information and the second group judges images with blurred faces.

In total we asked participants to judge images ( real / synth). In a forced-choice setting, participants had to decide whether an image is real or synthetic. The participants did not know the true distribution of synthetic and real images and were not given any other instructions. Participants were recruited via mailing-lists. Fig 7 summarizes the results. With faces visible, synthetic images were rated as real with a fool-rate of which is significantly higher than previous work [16]. When removing the influence of the facial region via face blurring, this result improves to a fool-rate of . This indicates that our model indeed synthesizes realistic samples and improves the perceived image quality over prior work.

Figure 8: Samples drawn from the latent space . Each row is conditioned one a particular pose and clothing style.

4.7 Qualitative Results

As discussed in Section 3.2 and experimentally verified above, the proposed architecture can generate synthetic images, conditioned on either an example image of the target style or a class label. Fig. 8 illustrates that the architecture generates images of sufficient quality and is capable of producing samples with significant intra-class variation. Fig. 9 illustrates both concepts via a number of example sequences where each row shows images generated conditioned on a target pose and a specific clothing style. The samples contain significant diversity in both color and texture, while adhering to the pose guidance given by the input. In terms of image quality, we can see that the synthetic samples accurately capture the body pose information and generally appear realistic. The main challenge stems from the facial regions where neither the part model nor the class label provides any guidance and simultaneously the training data contains a lot of variation. One possibility to alleviate this issue could be to employ facial landmark annotations as shown in Fig. 10.

Figure 9: More examples on instance control: provided an image of a person to generate a new image of the person wearing this specific clothing in a particular pose
Figure 10: Improving synthetic faces via landmarks.
Figure 11: Qualitative results of photo2painting.

4.8 Limitation and Failure Cases

Fig. 12 depicts several challenging examples and failure cases. The main reason for unnatural synthetic images are uncommon or entirely unseen poses or viewing angles. Due to the unpaired nature of the proposed training regime, this kind of problem may be alleviated via the addition of more diverse training data.

Figure 12: Failure cases due to uncommon poses or viewing angles.

4.9 Multi-model Conditional Image Translation

The focus of our work is generative modelling of human images. However, the architecture is general and can also be utilized in other application domains that need to synthesizing images, controlled by guidance that is specified in a different domain. Recent work has tackled the related problem of multi-modal image translation [10]. We demonstrate in Fig. 11 a photo-to-painting example. We use two styles of paintings and real photos from the dataset proposed in [34]. Where [10] train a network per style, we train only one single network for style translation. Furthermore, we show that we can control the style and texture of our generated paintings by maintaining content and texture from the input while translating the style from the reference (11, a). Furthermore, we demonstrate the capability to sample from the latent space, varying the texture of the generated image within the desired style (Fig. 11, b). Neither of these two functionalities can be achieved with direct-image-to-image approaches such as CycleGAN.

5 Conclusion

In this paper we have introduced the task of purely pose-guided generative modelling of humans in clothing. This task is challenging because it goes beyond the traditional setting of image-to-image translation. Given only a sketch of the human pose as input, we seek to generate realistic looking images that accurately depict the desired pose, and either a specific outfit (red dress), or to generate a number of images that depict the pose and variations of a class of outfit (different dresses). We have contributed a novel architecture that can generate high-quality images in an unpaired setting, while providing either direct instance or class-level control over the depicted style of clothing. We have experimentally shown that the proposed architecture is capable of creating realistic images and an ablative study showed the contributions of the different components of the architecture. A large-scale user-study shows that the generated images are seen as convincing, especially if the faces are blurred. Finally, we have demonstrated that the architecture can also be leveraged for other multi-modal image translation tasks.

Interesting directions for future work include, handling of uncommon poses and generation of images under novel viewpoints. Furthermore, it would be interesting to extend the task to the temporal domain which would also require modelling of the dynamics of the human body and its interactions with the non-rigid deformation of clothing.