360-Degree Textures of People in Clothing from a Single Image

08/20/2019 ∙ by Verica Lazova, et al. ∙ Max Planck Society 27

In this paper we predict a full 3D avatar of a person from a single image. We infer texture and geometry in the UV-space of the SMPL model using an image-to-image translation method. Given partial texture and segmentation layout maps derived from the input view, our model predicts the complete segmentation map, the complete texture map, and a displacement map. The predicted maps can be applied to the SMPL model in order to naturally generalize to novel poses, shapes, and even new clothing. In order to learn our model in a common UV-space, we non-rigidly register the SMPL model to thousands of 3D scans, effectively encoding textures and geometries as images in correspondence. This turns a difficult 3D inference task into a simpler image-to-image translation one. Results on rendered scans of people and images from the DeepFashion dataset demonstrate that our method can reconstruct plausible 3D avatars from a single image. We further use our model to digitally change pose, shape, swap garments between people and edit clothing. To encourage research in this direction we will make the source code available for research purpose.

READ FULL TEXT VIEW PDF

Authors

page 6

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

3D models of humans, comprising personalized surface geometry, and full texture are required for numerous applications including VR/AR, gaming, entertainment or human tracking for surveillance. Methods capable of recovering such 3D models from a single image would democratize the acquisition process, and allow people to easily recover avatars of themselves.

While there is extensive work on 3D human pose and shape recovery (surface geometry) from a single image, almost no work addresses the problem of predicting a complete texture of a person from a single image. We, humans, can do it to some degree, because we can guess what the person might look like from another view. We make these guesses effortlessly because we have seen people from many angles, and we have built a mental model of typical correlations. For example, the texture patterns within a garment are often repetitive; skin color and hair is roughly homogeneous; and the appearance of the left and right shoe is typically the same.

In this work, we introduce a model that automatically learns such correlations from real appearances of people. Specifically, our model predicts a full (360) texture map of the person in the UV space of the SMPL model [41] given a partial texture map derived from the visible part in an image, see Fig. 1. Instead of learning an image-based model that has to generalize to every possible pose, shape and viewpoint [30, 37, 43], our idea is to learn to complete the full 3D texture and generalize to new poses, shapes and viewpoints with SMPL and traditional rendering (Figure 9). Learning in a common UV-space requires texture maps of people in clothing in good correspondence, which is a highly non-trivial task. To that end, we non-rigidly deform the SMPL body model to 4541 static 3D scans of people in different poses and clothing. This effectively brings the 3D appearances of people into a common 2D UV-map representation, see Fig. 2

. Learning from this data has several advantages. First, since every pixel in a UV-map corresponds to one location on the body surface, it is easier for a neural network to implicitly learn localized and specialized models for particular parts of the body surface. For example, our model learns to complete different textures for the different garments, the face region and the skin, with clearly defined boundaries. Second, at test time, once the texture map is recovered, we can apply it to SMPL and animate it directly in 3D with any pose and shape and viewpoint to generate a coherent movement and appearance–without the risk of failing to generalize to these factors.

Additionally, our model can predict, on the SMPL UV map, a full clothing segmentation layout, from which we predict a plausible clothing geometry of the garments worn by the subject. The latter is a highly multi-modal problem where many clothing geometries can explain the same segmentation layout. However, as observed for other tasks [33]

an image translation network is capable of producing a plausible output. To allow for further control, we can additionally easily edit the predicted segmentation layout in order to control the shape and appearance of the clothing. In particular, we can modify the length of the sleeves in shirts and t-shirts, and interpolate from shorts to pants and the other way around.

We train and our method on our newly created dataset (360 People Textures) consisting of 4541 scans registered to a common SMPL body template. Our results generalize on real images and show that our model is capable of completing textures of people, often reproducing clothing boundaries and completing garment textures and skin. In summary, our model can reconstruct a full 3D animatable avatar from a single image, including texture, geometry and segmentation layout for further control.

2 Related work

Pose guided image and video generation Given a source image and a target 2D pose, image-based methods [43, 44, 22, 59, 49] produce an image with the source appearance in the target pose. To deal with pixel miss-alignments, it is helpful to transform the pixels in the original image to match the target pose within the network [60, 11, 42]. Following similar ideas to growing GANs [36], high resolution anime characters can be generated [29]. Dis-occlusions and out of plane rotations are a problem for such methods.

Sharing our goal of incorporating more information about the human body during the generation process, some works [37, 49] also leverage the SMPL body model. By conditioning on a posed SMPL rendering, realistic images of people in the same pose can be generated [37]–the model however cannot maintain the appearance of the person when the pose changes. Recent approaches [49, 26] leverage DensePose [27] to map a source appearance to the SMPL surface, and condition the generated image on the target DensePose mask. Although their goal is generating an image, the model generates a texture map on the SMPL surface as an internal representation. During training, multiple views of the same subject are mapped to a custom designed UV-map, and the network is forced to complete the texture on the visible parts. While training from images is practical, DensePose was not designed to accommodate for hair and clothing deviating from the body, and significant miss-alignments might occur in the UV-map when DensePose fails or is inaccurate. Consequently, texture map completion results are limited by the ability of DensePose to parse images during training. We use DensePose solely to create the partial texture map from the input view, but train from high quality aligned and complete texture maps, which are obtained by registering SMPL to 3D scans. This has several advantages. By virtue of the registration, the textures are very well aligned in the SMPL-UV space, and cover the full extend of the appearance including clothing and hair. This allows us to generalize beyond poses seen in a particular dataset such as DeepFashion [40]. Furthermore, we go beyond [49, 26] and predict a full 3D textured avatar including geometry and 3D segmentation layout for further control.

While image-based methods are able to generate plausible images, many applications (VR/AR, gaming) require a full 3D representation of the person. Traditional graphics rendering could potentially be replaced by querying such models for every desired novel viewpoint and pose. Unfortunately, image-based models do not produce temporally coherent results for sequences of target poses. Recent work focus on video generation [18, 11, 73, 58], but these models are often trained specifically for a particular subject. Furthermore, the formation of more complex scenes with objects or multiple people would be a challenge for image-based methods.

Try-on and conditional clothing synthesis A growing number of recent works focus on conditioning the person image generation on a particular clothing item [30, 66], or a description [78]. Other works demonstrated models capable of swapping the appearances between two different subjects [54, 75].
Multi-view texture generation. Texture generation is challenging even in the case of multi-view images. The difficulty lies in combining the partial textures created from different views by using blending [13, 20, 51, 61], mosaicing [12, 39, 50, 55], graph cuts [38], or flow based registration [21, 14, 65, 23, 76, 7] in order to reduce ghosting and stitching artifacts. A learning based model that leverages multi-view images for training [49, 26] will suffer similar problems. Hence, in this work, we learn from complete texture maps obtained from 3D registrations.
3D person reconstruction from images While promising, recent methods for 3D person reconstruction either require video as input [6, 7, 8], scans [74], do not allow control over pose, shape and clothing [48, 56], focus only on faces [72, 32, 63, 57, 47, 62], or only on garments [68].

3 Method

We introduce an image-to-image translation method for inferring texture and geometry of a person from one image. Our method predicts a full texture map, displacement map and full clothing segmentation for additional control. Our models are trained on large number of high quality, diverse 3D scans of people in clothing.

Figure 2: 360 People Textures: Example registrations and texture maps used to train our model. After registering SMPL to the scans their appearances are encoded as texture maps in a common UV-space.

3.1 Synthetic data generation

To generate our training set, we take 4541 static scans, and register non-rigidly the SMPL template to them to obtain complete textures aligned in the SMPL UV space. We will briefly review SMPL and explain our non-rigid registration procedure in 3.1.1.

3.1.1 SMPL Body Model with Clothing

SMPL is a function that takes as input pose and shape parameters and outputs the vertices of a mesh. SMPL applies deformations to the mean shape using a learned skinning function:

(1)
(2)

where is a linear blend-skinning function applied to the morphed shape based on the skeleton joints ; the morphed shape is obtained by applying pose-dependent deformations and shape-dependent deformations . The shape space was learned from undressed scans of people and therefore cannot accommodate for clothing. Since our goal is registering the SMPL surface to scans of people in clothing, we modify SMPL by adding a set of offsets to the template:

(3)
(4)

which are responsible to explain clothing, hair and details beyond the shape space of SMPL.

Non-rigid registration

of humans (Figure 3) is challenging even for scans without clothing [16, 53]. The scans in 360 People Textures include subjects in clothing scanned in variety of poses. Registration without a good initialization of pose and shape would fail. To obtain a good initialization, we render the scans in camera viewpoints around the vertical axis and run a 2D pose joint detector [17] in each of the views. Then, we minimize the re-projection error between SMPL joint positions and the detected joints with respect to pose and shape:

(5)
(6)

where projects the 3D joints onto the view of camera , and is a Mahalanobis distance prior that we use to regularize pose and shape. This brings the model close to the scan, but it is not registered at this point because SMPL cannot explain details such as clothing, hair or jewellery. Let denote a scan we want to register to, and be the registration mesh with free-form vertices and faces defining the same topology as SMPL. We obtain a registration by minimizing the following objective function:

(7)

where the first term measures the distance from every point in the scan to the surface of the registration , the second term forces the registration vertices to remain close to the model vertices , and the other terms are the aforementioned priors. Here, denotes a Geman-McClure robust cost function, and are the weights that penalize deviations from the model more heavily for the vertices on the hands and the feet. The first term in Eq. (7) pulls the registration to the scan, while the second penalizes deviations from the body model ensuring the registration looks like a human. That allows to accurately bring all scans into correspondence.

Figure 3: Registration: we bring all scan appearances into correspondence by non-rigidly deforming the SMPL model. This allows us to learn in a pose invariant space.

3.2 3D Model Generation

Our full pipeline for 3D model generation from an image is shown in Figure 1 and consists of three stages: texture completion, segmentation completion and geometry prediction. Given an image of a person and it’s garment segmentation we extract partial texture map as well as partial segmentation map. We have put together inpainting models to complete the garment segmentation and the texture in UV-space. The texture completion part recovers the full appearance of the person in the image. The completed segmentation is further used to predict the geometry. To obtain geometry details we predict a displacement map from the segmentation, which gives us a set of offsets from the average SMPL [41] body model, that correspond to clothing geometry. All three parts are trained separately and allow us to go from a single image to a 3D model. Additionally, the complete segmentation map allows us to edit the final model by changing the garment properties such as length or texture, or to completely redress the person.

3.2.1 Texture Completion

The texture completion part recovers the full appearance of the person in the image. We use DensePose [27] to find pixel correspondences between the image and the SMPL body model, which we use to remap a partial texture map of the person in the SMPL texture template (Figure 1). We use an inpainting network based on image-to-image translation [33, 19, 67]

that learns to complete the partial texture map. The inaccuracies of texture extraction with DensePose introduce distortions in the partial texture map. As a result, the input and the output image for the texture completion network are not in perfect alignment. This makes our task different from the classical image inpainting, where the network can learn to copy the visible parts of the image to the target. We use an architecture for the generator based on residual blocks similar to

[19, 77, 67]. We refer the reader to the supplementary material for the detailed description of the network layers.

We train the inpainting network using the conditional Generative Adversarial Network (GAN) approach with a fully-convolutional patch discriminator to discriminate between real and fake texture maps. The partial texture map is fed through the generator to infer the full texture map . The conditional discriminator

takes pair of partial and complete texture map and classifies them as real or fake, i.e.

and . We train the generator by optimizing the compound objective consisting of a GAN, reconstruction, perceptual and dissimilarity term.

(8)

where , …, are the weights for each of the terms. We use the Wasserstein GAN loss [10, 28] as our adversarial loss, which produced visually sharper results:

(9)

The reconstruction loss is an distance between the target and the reconstructed texture map .

(10)

Additionally we use perceptual loss which involves matching deep features between the target and the reconstructed image. The features are extracted from the convolutional layers of a pretrained VGG-19 neural network,

and for each layer . The perceptual loss is the

distance between features extracted from the target and the generated texture image.

(11)

Finally, as another perceptual term we minimize the dissimilarity index between the target and generated texture map:

(12)

where is a Multiscale Structural Similarity index [69, 70]

, a structural similarity computed at 5 different scales. Please check the supplementary material for more details on the effect of each of the loss functions.

3.2.2 Segmentation completion

Similar to the texture completion stage, the segmentation completion starts with a partial clothing segmentation obtained from a single image, and the task is to obtain a valid full body clothing segmentation in UV space. An image-to-image translation model is trained to inpaint the missing part of the segmentation. The architecture for generator and discriminator is the same as in the texture completion network. We train the segmentation completion network by minimizing the loss:

(13)

where and are defined in Equations 9 and 10. Finally, the inpainted segmentation is discretized by assigning the label for each pixel in a nearest-neighbour fashion.

3.2.3 Displacement map prediction

The final step of our pipeline is geometry prediction where the purpose is to generate geometry that corresponds to the clothes the person is wearing. For this purpose, to capture the clothing shape, we learn to generate vertex offsets from the average SMPL body conditioned on garment type, i.e the segmentation. The offsets are stored as displacement map, where each pixel location corresponds to a point on the SMPL model, and each pixel stores the normalized offset at that body location. This allows us to treat the displacements as images and apply the image-to-image translation method we described previously. A similar approach is used in [9], where the authors learn full displacement and normal maps from a partial texture. Our intent is to learn a model which is able to sample plausible geometry that fits the clothing segmentation. The predicted geometry might not exactly coincide with the image (eg. it is impossible to know from a segmentation in UV-space, if a T-shirt is tight or loose). Nevertheless, relying on the segmentation has advantages: it provides additional flexibility, making editing and adjusting the final 3D model straightforward. We discuss some of the model editing possibilities in Section 4.

As in the previous stages, the model for displacements prediction consists of a generator , that takes a segmentation map as input and produces a displacement map at the output. The generator and the discriminator have the same architecture as in the texture completion scenario. We train the model by minimizing the objective:

(14)

where , and are defined in Eq. 9, 10 and 12.

The generated displacement map applied to the average SMPL body yields the full untextured 3D model. Applying the generated texture from the texture completion part on the model, gives us the full 3D avatar (Figure 1)

4 Experiments and Results

In the following section we give more details on our 360 People Textures dataset and we describe our experiments. For details regarding the training process please check the suplementary material. We have made an attempt for quantitative evaluations, however the typical metrics for evaluating GAN models known in the literature did not correspond to the human-perceived quality of the generated texture. Therefore, here we only present our qualitative results.

Figure 4: Generalization of our method to real images. Left to right: real image, segmented image, complete texture map, complete segmentation map, generated untextured avatar, generated textured avatar.
Figure 5: Comparison of our single view method (bottom row) to the monocular video method of [6] (top row)

4.1 Dataset

We trained our models on a dataset of 4541 3D scans: 230 come from Renderpeople [1]; 116 from AXYZ [2]; 45 are courtesy of Treedy’s [4] and the rest 4150 are courtesy of Twindom [3]. We used all of them to train the texture inpainting part, plus additional 242 texture maps from the SURREAL [64] dataset. Each of the scans is rendered from 10 roughly frontal views, with the random horizontal rotation of the body sampled from the range . Out of these we used 40 scans for validation and 10 for testing. Additionally we test on the People Snapshot Dataset [8] and on preselected 526 images from the DeepFashion [40] dataset that satisfy the criteria of fully visible person from a roughly frontal view without overly complex clothing or poses. We use DensePose to extract partial texture maps which we aim to complete.

In order to obtain a full segmentation map first we render the scan from 20 different views. For each of the rendered images we run the clothing segmentation method in [25]. Each of the segmented images is projected back to the scan and remapped into a partial segmentation UV-map. All the partial segmentation maps are then stitched together using a graph-cut based method to finally obtain the full segmentation map. The segmentation maps obtained in this way are used as ground-truth for the segmentation completion method. The partial segmentation maps used as input are extracted with DensePose.

For the geometry prediction part we have selected a subset of 2056 scans for which we could obtain good ground truth segmentation and displacement maps. Out of these 30 scans were taken out for validation. We obtain the segmentation as explained above and the ground truth displacement maps are obtained directly from the scan registrations.

4.2 Results

Figure 4 shows sample results of going from real images to 3D models on the People Snapshot and DeepFashion datasets. Each row shows the original image, it’s garment segmentation, the recovered full texture and segmentation, predicted geometry and the fully textured avatar. Predicting the body shape is not part of our contribution, but an off-the-shelf method such as [52, 35, 15] could easily be deployed to obtain a more faithful representation of the body shape. Additionally, in Figure 5 we show comparison of our single-view method to the monocular video method of [6]. While our results look comparable and visually pleasing, should be noted that the method in [6] produces higher resolution reconstructions. The advantage of our method is that it requires just a single view.

The full clothing segmentation map gives us the flexibility to edit and adjust the model in different ways. Using the segmentation, one can change the texture of a specific garment and have it completed in a different style. Figure 6 shows examples where the original T-shirt of the subject is edited with samples of different textures. The network hallucinates wrinkles to make the artificially added texture look more cloth-like.

Figure 6: Segmentation-based texture editing: the 3D model reconstructed from an image is shown in the top row. The subsequent rows show the edited partial texture and the reconstructed model with different texture for the t-shirt

Additionally, the segmentation could be used for clothes re-targeting. Having two partial texture maps from two images one can rely on the segmentation to exchange garments between the subjects. The edited partial texture and segmentation are then completed, new geometry is produced and the final result is a redressed avatar. An example of this application is shown in Figure 7.

Figure 7: Garment re-targeting results: Four subjects are shown in the top row with their 3D reconstruction. In each of the subsequent rows the subjects are reconstructed wearing the selected garments from the person in the first image.

Finally, one can use the segmentation to edit the length of a garment, transforming shorts to pants or blouse to a T-shirt. We train a separate network that only completes the arms and the legs of the person conditioned on the segmentation. Therefore, once we have reconstructed a model, we can run it through the editing network to change the length of the sleeves or the pants. An example of this scenario is shown in Figure 8.

Figure 8: Model editing results: Models reconstructed from an image are shown in the top row. In each of the subsequent rows the models are edited such that the length of the sleeves and pants matches the segmentation on the left.
Figure 9: Image-based reposing methods cannot handle challenging target poses outside of the training set. First row shows the original scan reposed. The second row shows image based reposing using [60]. In the last row is the full model obtained with our method, reposed. The input image is the first one in each row.

For additional results and better preview please check the supplementary material and video 111https://youtu.be/h-HyFq2rYO0.

5 Conclusions and Future Work

We presented a method for completion of full texture and geometry from single image. Our key idea is to turn a hard 3D inference problem into an image-to-image translation which is amenable to CNNs by encoding appearance, geometry and clothing layout on a common SMPL UV-space. By learning in such space we obtain visually pleasing and realistic 3D model reconstructions. To allow for control we decompose the predictions into factors: texture maps to control appearance, displacement maps for geometry, segmentation layout to control clothing shape and appearance. We have demonstrated how to swap clothing between 3D avatars just from the image and how to edit the garment geometry and texture.

There are many avenues for future work. First, our model cannot predict clothing with vastly different topology than the human body, eg. skirts or dresses. Implicit function based representations [71, 56, 46, 48, 31] might be beneficial to deal with different topologies, but they do not allow control. Although it is remarkable that our model can predict the occluded appearance of the person, the model struggles to predict high frequency detail and complex texture patterns. We have extensively experimented with style-based losses and models [24, 34], but the results are either not photo-realistic, lack control or need to be re-trained per texture. In the future, we would like to investigate models capable of producing photo-realistic textures, and complex clothing without losing control.

There is a strong trend in the community towards data-driven image-based models for novel viewpoint synthesis [44, 43], virtual try-on [30, 66], reposing [18] and reshaping with impressive results, suggesting 3D modelling can be by-passed. In this paper, we have demonstrated that full 3D completion and 3D reasoning has several advantages, particularly in terms of control and ability to generalize to novel poses and views. In the future, we plan to explore hybrid methods that combine powerful 3D representations with neural based rendering [45].

Acknowledgements: We thank Christoph Lassner, Bernt Schiele and Christian Theobald for provided insight and expertise at the early stages of the project; Thiemo Alldieck and Bharat Lal Bhatnagar for the in-depth discussions and assistance in our dataset preparation. Special thanks should be given to Mohamed Omran for contributing with an abundance of valuable ideas and most honest critiques.

This work is partly funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans) and the Google Faculty Research Award.

References