Dressing 3D Humans using a Conditional Mesh-VAE-GAN

07/31/2019 ∙ by Qianli Ma, et al. ∙ 0

Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed humans and thus do not capture the complexity of dressed humans in common images and videos. To address this, we learn a generative 3D mesh model of clothing from 3D scans of people with varying pose. Going beyond previous work, our generative model is conditioned on different clothing types, giving the ability to dress different body shapes in a variety of clothing. To do so, we train a conditional Mesh-VAE-GAN on clothing displacements from a 3D SMPL body model. This generative clothing model enables us to sample various types of clothing, in novel poses, on top of SMPL. With a focus on clothing geometry, the model captures both global shape and local structure, effectively extending the SMPL model to add clothing. To our knowledge, this is the first conditional VAE-GAN that works on 3D meshes. For clothing specifically, it is the first such model that directly dresses 3D human body meshes and generalizes to different poses.



There are no comments yet.


page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Existing human body models like SMPL [1] successfully capture the statistics of human body shape and pose, but lack something important — clothing. Since most images of people show them in clothing, this causes several problems. For example, body models are used to generate synthetic training data [2]

, but the lack of clothing leads to a significant domain gap between synthetic and real images of humans. Deep learning methods train a regressor from image pixels to the parameters of the SMPL body model

[3, 4, 5, 6, 7, 8] but SMPL does not match the complexity of people in images. Generative approaches that use SMPL for training or for analysis-by-synthesis suffer when the model does not explain the image evidence well. Given the interest and value in 3D body models and the problems above, there is a clear need for a 3D model of clothed bodies.

A good model of a clothed human should be low-dimensional, easy to pose, differentiable, represent clothing on different body shapes and poses, and it should be relatively realistic. To our knowledge, no current approach to modeling clothing satisfies these properties. Instead of extracting standalone garments, we choose to model clothing together with human body, so as to serve the need of perceiving people in various computer vision tasks mentioned above. Consequently, we use SMPL as a foundation and learn to model a clothing layer as displacements from the SMPL mesh in a canonical pose (Fig. 

1). This is similar in spirit to ClothCap [9], which also models clothing as a displacement from the body, but we go further to learn a model of clothing deformation that is conditioned on pose and clothing type. DRAPE [10]

is also similar in that clothing is a learned function but, in DRAPE, the model is learned from simulations, is based on principal component analysis (PCA), is overly smooth, and separate models need to be trained for each garment. We need, instead, a generative model that can be learned (like DRAPE), from real scan data (like ClothCap), can be added to SMPL, that captures different types of clothing, and can be animated and fit to data.

To address these needs, we train a novel deep neural network from scan data to capture clothing displacements conditioned on pose and clothing type. Extending deep networks to 3D meshes is a relatively new and active field. Recent advances in convolutional neural networks for graphs

[11, 12, 13, 14, 15, 16] enable direct learning on 3D meshes [17, 18, 19] by deforming a pre-defined template mesh. Ranjan et al. [17]

define a variational autoencoder (VAE) on meshes

[20, 21] using graph convolutions [11]. However, a drawback of VAEs is that they tend to produce overly smoothed results, making them ill-suited to modeling clothing, which has high-frequency wrinkles. Methods such as [18, 19] try to improve over-smoothed results with upsampling techniques and others use a different representation for local details such as normal maps [22]. Instead, we exploit generative adversarial networks (GANs) [23, 24], which have proven to be successful in the generation of realistic images. We introduce a novel Mesh-VAE-GAN for end-to-end learning of a generative model 111Also see for concurrent work that proposes similar idea [25]. of 3D clothing. Generating other types of 3D meshes with our model is also possible by using respective template meshes, making it a generic mesh-deep-learning framework.

Our Mesh-VAE-GAN uses graph convolutions [11] and mesh sampling [17] as the backbone layers. We use the topology of the 3D SMPL human body model [1] to represent the 3D geometry of clothes. Similarly to [26] and [9], we find it expressive enough in order to register scans from people in clothing that roughly matches the topology of the body. We condition our model on clothing type and SMPL body pose parameters, and train our model using 4D captured data of people wearing different types of clothing, in a variety of pose and motion sequences. The result is a parametric generative 3D clothing model, called AutoEnClother, with a low-dimensional latent space, from which we can sample and condition, enabling the generation of a wide variety of clothing. The method is designed to be “plug and play” for many applications that already use SMPL. Dressing SMPL with the AutoEnClother yields 3D meshes of people in clothing, which can be used to generate training data, as the output of a network that predicts body pose, as a clothing “prior”, or as part of a generative analysis-by-synthesis approach. In summary, our key contributions are: (1) We introduce a novel conditional Mesh-VAE-GAN architecture. (2) Using this, we build a 3D generative clothing model with controlled conditioning based on human pose and clothing types. (3) We augment a popular 3D human body model that is minimally clothed with clothing generated by our model.

2 Related Work

Models for clothing and clothed humans. There is a large literature on clothing simulation [27], which is beyond our scope. Traditional physics simulation requires significant artist or designer involvement and is not practical in the inner loop of a deep learning framework. The most relevant work has focused on computing off-line physics-based simulations and learning efficient data-driven approximations from them [10, 28, 29, 30, 31, 32]. DRAPE [10] learns a model of clothing that allows changing the pose and shape, and Wang  [31] allow manipulation of clothing with sketches. Learning from physics simulation is convenient because many examples can be synthesized, they do not have noise, and registration and the factorization of shape into pose, body shape and clothing is already given. However, models learned from synthetic data often look unrealistic [10]. Learning from real scans of people in motion is difficult, but opens the door to model more realism and diversity.

The first problem with real scans is to estimate the shape and pose

under clothing, which is required to model how garments deviate from the body. Typically, a single shape is optimized over multiple poses to fit inside multiple clothing silhouettes [33], or a sequence of 3D scans [26, 34]. Here, we use a similar approach  [26] to factor a scan into the minimally clothed shape and a clothing layer. Pons-Moll  [9] proposed ClothCap, a method to capture a sequence of dynamic scans, encoding clothing as displacement from the SMPL body, and retargeting it to new body shapes. Similar to ClothCap, Alldieck [35, 36] represent clothing as an offset from the SMPL to reconstruct people in clothing from images. Combining dynamic fusion ideas [37] with SMPL, cloth capture and retargeting has been demonstrated from a single depth camera [38, 39]. Still other work captures garments in isolation from multi-view [40] or single images using a CNN [41]. These are essentially capture methods and can not generate novel clothing. While a model of static clothing is implicitly learned in [42] from scans, it requires images as input.

Few clothing models, learned from real data, have shown that they generalize to new poses. Neophytou and Hilton [43] learn a layered garment model on top of SCAPE [44] from dynamic sequences, but generalization to novel poses is not demonstrated. Yang  [45] train a neural network to regress a PCA-based representation of clothing, but generalization is only shown on the same sequence or on the same subject. Lähner [22] learn a garment-specific pose deformation model by regressing low-frequency PCA components, and high frequency normal maps. While the achieved quality is good, the model is garment specific and does not provide a solution for full-body clothing. Most importantly, the aforementioned models are regressors and produce single point estimates. In contrast, our model is generative, which allows us to sample clothing. Our motivation for learning a generative model is that clothing shape is intrinsically probabilistic; conditioned on a single pose, multiple clothing deformations are possible. A conceptually different approach to ours is to infer the parameters of a physical model from 3D scan sequences [46], and show generalization to novel poses. However, the inference problem is difficult, and unlike our model, the resulting physics simulator is not differentiable.

Several recent works recover 3D meshes of clothed people from single- or multi-view images using neural networks [47, 48, 49]. While preserving a high-level of detail, the recovered 3D clothing is not parametrized and, hence, not manipulable. In contrast, our model provides control over pose and clothing types, and can be extended to control other factors.

Generative models for 3D meshes. Generative models for 3D shapes are usually based on PCA [50] or its robust versions [51]. Alternately, deep learning methods such as Variational Autoencoders (VAE) [20] and Generative Adversarial Networks (GAN) [23] have shown state of the art results in generating 2D images [52] and voxels [53]. However, a voxel representation is not well suited to modeling clothing surfaces. Compared to voxels [54, 55, 56, 57] and point clouds [58, 59, 60, 61, 62], meshes are more suitable for 3D clothing data because of their computational efficiency and flexibility in modeling both global and local information. Generalizing GANs to irregular structures, such as graphs and meshes, however, is not trivial.

To deal with irregular structures like graphs, Bruna  [13] introduce graph convolutions. Follow-up work [11, 12] extends these graph convolutions, which have been successfully used [17, 63] to learn representations defined on meshes. Verma [63] use feature-steered graph-convolutions for 3D shape analysis. Based on this, Litany [64] use mesh VAEs to do mesh completion. Ranjan [17] learn a convolutional mesh-VAE using graph convolutions [11] with mesh down- and up-sampling layers [65]. Although it works well for faces, the mesh sampling layer makes it difficult to capture the high frequency wrinkles, which are key in clothing. In our work, we capture high frequency wrinkles by extending the PatchGAN [66] architecture to handle 3D meshes.

3 Probabilistic Clothing Mesh Generation

3.1 Clothing Representation

SMPL: Layerwise Body Model. SMPL [1] is a generative model of human bodies that factors the surface of the body into shape () and pose () parameters. The architecture of SMPL starts with a triangulated template mesh, , in rest pose, defined by vertices. Given shape and pose parameters (, 3D offsets are added, which correspond to shape dependent deformations () and pose dependent deformations (). The shape blend shape function, , is a weighted sum of linear shape components, , learned from data, that models the individual body shape. The pose blend shape function, , models pose-dependent deformations with respect to the template mesh in the rest pose. The resulting mesh is then posed using the skinning function . Mathematically one can write:


where the blend skinning function rotates the the rest pose vertices around the 3D joint (computed from ), linearly smoothes them with the blend weights , and returns the posed vertices . The pose

is represented by a vector of relative 3D rotations of the

joints in axis-angle representation, plus one “joint” for global rotation.

The core idea of SMPL is to start with an initial body shape, and then add linear deformation layers to it. The current layers of SMPL are the shape and pose dependent deformations. Following this spirit, we treat clothing as an additional offset layer from the body and add it on top of the SMPL mesh.

Clothing as Offsets from SMPL Body. Representing clothing as offsets from body [9, 10] is practical for a range of clothing type such as pants and shirts. While it does not account for the real physics of clothes, it makes the process of dressing people a simple addition of these offsets to the minimally-clothed body shape.

One key design choice of our method is that we compute the displacements from the minimally-clothed body to the clothing in a canonical space, corresponding to the rest pose configuration (or zero-pose space). We name this space the unposed space. Here we use a 3D scan of the person in minimal clothing to accurately capture their shape. In order to compute the displacements, we first register the SMPL model to both scans using [26] and considering all vertices as “skin”. We obtain a SMPL mesh capturing the geometry of the scan, the corresponding pose parameters and the unposed mesh, living in the unposed space. For the mathematical details of registration and the unposing operation, we refer the reader to [26]. In the rest of the paper, denotes the unposed “minimal” mesh vertices, the unposed clothed mesh vertices, and the pose of the clothed mesh. The pose of the minimally-clothed body is not used once it is put in the rest pose. The clothing displacements are , and correspond to the SMPL mesh topology. We exploit this fixed topology and neighborhood structure during learning. has non-zero values only on body parts covered with clothes.

Towards “SMPL-Clothed”. Our goal is to effectively extend the SMPL body model with a new clothing layer that controls both global clothing shape as well as local clothing structure. Specifically, we extend the deformed template mesh from Eq. (1), with the new additive clothing layer:


where is our clothing function. Our clothing layer is parametrized by the body pose , clothing type and a latent variable which defines a point in a learned low-dimensional space that encodes clothing shape and structure. Following Eq. (2), a posed, clothed human body mesh is:


The key remaining questions are (1) how to learn a well structured, low-dimensional latent space of the clothing shape and structure information? (2) how to use this latent space in order to dress SMPL meshes in a coherent and controllable manner?

3.2 Mesh VAE-GAN

To model the clothing term, , in Eq. (3), we introduce AutoEnClother, a novel generative model for 3D meshes. It has the architecture of a conditional VAE-GAN [67, 68], consisting of an encoder , a generator , a condition module and a discriminator , with mesh convolution and mesh sampling layers as building blocks. Specifically, the model is conditioned on the clothing type and pose parameters and is trained in an adversarial way to allow the capture of realistic local clothing structure.

In this section we review the mesh convolutions, present the new architecture and detail the learning strategy.

Preliminaries: Convolution on Meshes. As described in Sec. 3.1, we compute clothing displacements and associate them with the SMPL mesh topology. In order to learn features on the clothing mesh, we use convolutions on 3D meshes, which can be defined as filtering in the spectral domain [13, 14] and approximated using Chebyshev polynomials [11] of the graph Laplacian [69]. Formally, the Chebyshev convolution at layer is given by:


where refers to the input features of the input vertices, with features at each vertex; are the output features. A mesh has features per vertex corresponding to its position in 3D Euclidean space. The learnable weights are represented by and are the coefficients of the Chebyshev polynomials as defined in Defferrard . [11]. The parameter maps the input feature to the output feature using the Laplacian, , of the input mesh. The convolution kernel filters features from the -ring neighbor of each vertex. This allows control of the receptive field of the kernels by using the parameter . For more details about mesh convolution and sampling, we refer the reader to [11, 12, 17].

3.2.1 Model Architecture

The VAE-GAN model proposed by [67] is obtained by combining a VAE [20] and a GAN [23] and consists of an encoder , a generator and a discriminator . The encoder maps the data into a low-dimensional latent code . The generator tries to reconstruct this data from its latent code:

. A multivariate Gaussian prior of zero mean and unit variance

is imposed on the latent code . The discriminator is used for adversarial training and is trained to discriminate generated samples from the real samples .

We extend the VAE-GAN to the 3D mesh domain by: (1) building the encoder and generator with mesh-residual blocks so as to learn deep features; (2) introducing a novel mesh-patchwise discriminator for capturing fine local structures. The entire network is trained in an end-to-end manner.

Figure 2 shows the overall network design, and architecture details are provided in the Appendix B. The following notation is used in the rest of the paper unless otherwise denoted: input mesh , its reconstruction , latent code , and condition .

Encoder-Generator Module. The generator of our model takes both the clothing condition (pose and clothing type) and the latent code . It is essentially the decoder part of the VAE. It is combined with the encoder during training: the latent code from is concatenated with condition and then fed to . At test, the encoder is discarded and is sampled from the Gaussian prior distribution, . Using an entire VAE instead of only the decoder as the generator is a deliberate choice, as we find in our experiments that the training becomes unstable without the encoder.

Both and

are feed-forward neural networks built with mesh convolutional layers. A fully-connected layer sits at the end of

and beginning of respectively. It transforms the feature maps to/from the latent code vector.

Stacking mesh convolution layers can, however, cause the smoothing of local features [70], which is undesirable for mesh generation because fine details are likely to disappear. We replace the standard mesh convolution layers with mesh residual blocks [71]. The shortcut connections within a residual block enable the use of the low-level features from its input if necessary.

Patchwise Discriminator Module.

We introduce a patchwise discriminator for meshes, whose counterpart for images has shown success in image-to-image translation problems

[66, 72]

. Instead of looking at the entire generated mesh, the discriminator only classifies whether a patch is real or fake based on its local structure. Intuitively this encourages the discriminator to only focus on fine details, and the global shape is taken care of by the reconstruction loss.

In practice this is achieved by repeating the convolution-downsampling block several times to get a real/fake prediction map. Following [17], we use the quadric error based [65] downsampling method.

Condition Module. The condition module transforms through a small fully connected network to get its compressed representation: . This condition representation is then fed to the input of and .

Figure 2: Architecture overview of our model. : input mesh, : reconstruction, : condition, : latent code; : encoder, : condition encoder, : generator, : discriminator. Details are provided in Appendix B.

3.2.2 Learning

For the reconstruction loss, we use an L1 distance over the vertices of the mesh , because it encourages less smoothing compared to L2:


Furthermore, we apply a loss on the face normals to encourage the generation of wrinkles instead of smooth surfaces. Let be the unit normal corresponding to the face in the set of mesh faces of the mesh . We penalize the angle between all corresponding face normals of and :


where is vector inner product.

The loss on the latent distribution is formulated by the KL divergence between the posterior and prior:


The generator and discriminator are trained as opponents with the adversarial loss:


where tries to minimize this loss against the that aims to maximize it.

Overall Objective. All the losses are weighted by coefficients that balance the quality of generation and sampling. The overall objective is written as:


3.3 AutoEnClother dresses SMPL

Now we can model the clothing layer in Eq. (3) with AutoEnClother. We train AutoEnClother using the clothing displacements mesh (by associating with SMPL mesh topology, Sec. 3.1) as the data , and for the condition we choose pose and clothing type . The latent variable encodes information that is influenced by all other factors. Consequently, the actual input to the generator, , is essentially disentangled into , pose and clothing type. Note that, as a unified framework, AutoEnClother is not limited to these two factors. Instead, it can scale and take more conditions (tightness and fabric), provided that the data for training is available. When all such factors are controllable, degenerates to a Dirac delta-function and the model becomes a regressor.

The clothing type is a 4-dimensional one-hot vector corresponding to the 4 types of clothing (see Sec. 4) in our dataset. Given the data of other clothing types, the extension is straightforward, as long as the garment does not largely differ from the geometry of the human body. The -dimensional SMPL pose parameter representation is flattened into a 72-dimensional vector to be used in AutoEnClother.

Finally, the clothing term from Eq. (3) can be concretely written as:


where is the generator of our network, and denotes the learned parameters of the model.

AutoEnClother effectively dresses SMPL. For a specific pose and clothing type , one can randomly sample in the latent space and generate varied and unique samples of 3D clothing that satisfy the given condition. Similarly, for a fixed latent code , one can obtain varied clothing shapes as a function of pose and clothing type.

4 Data and Training

We demonstrate the characteristics of our model in two respects: representation capacity and generation performance. The model is trained and tested on a real-world scan dataset, and the generated clothing meshes are evaluated perceptually via a user study.

4D Clothing Scan Dataset. We captured temporal sequences of 3D body scans with a high-resolution body scanner (3dMD LLC, Atlanta, GA). Each subject was scanned once in the minimal clothing condition. Then, subjects in the clothed condition performed several predefined motion sequences, in four different clothing types: short sleeve T-shirt with long pants (shortlong), short sleeve T-shirt with short pants (shortshort), as well as their opposite versions longshort and longlong. This allowed us to capture a variety of wrinkles in different poses and clothing types. The minimal setting was used to obtain an estimation of the “naked” body shape; the subjects were scanned with tight fitting sports underwear. This scan could also be replaced by an automatic body shape estimation method [26, 34, 73]. The scans were registered using the single mesh alignment method from [9], and the clothing displacements were computed as described in Sec. 3.1.

Our dataset contains 31,402 examples (frames) from 117 motion sequences by 7 male subjects in 4 types of clothing, as well as 4,318 examples from 18 sequences by 5 female subjects in shortlong. It contains unposed registrations (aligned to the scans) and the clothing displacements, as well as the pose parameters and clothing type label. The subjects gave informed written consent to participate.

Implementation Details. We train our model for males and females separately. We split the male dataset into 22,082 training examples, and the remaining 9,320 examples are left-out for different test scenarios (see Sec. 5.1). The female dataset is split into a training set with 4,093 examples and a test set with 225 examples. The following quantitative results are shown on the male dataset due to its higher diversity in subjects, clothing types and motion sequences.

Note that, while we train our model with 4D scan data, it can take meshes from other sources (clothing simulations), as long as these have the SMPL mesh topology.

The model is trained for 150 epochs using stochastic gradient descent with momentum of

, an initial learning rate of , and a decay of after every epoch. The Chebyshev convolutions use . An L2-weight decay with strength is used as regularization. The pose and clothing type conditions are compressed to compact vectors with dimensions and

, respectively. Batch normalization is not used in our network, as we noticed that it leads to unstable training.

5 Experiments

Evaluation Metrics.

Typical evaluation metrics for generative models such as Inception Score

[74] and FID [75] are not available for 3D meshes. Therefore, we evaluate our model’s representation power by computing the per-vertex Euclidean auto-encoding error on clothing meshes. When averaging the error over vertices, we exclude the vertices from the head, fingers, toes, hands and feet as they are not covered with clothing.

To test if the conditional generated results of our method look realistic, we performed a user study on Amazon Mechanical Turk (AMT). Virtual avatars were dressed in 3D and rendered into front-view images. Following the protocol from [66], raters were presented with a series of “real vs fake” trials. On each trial, the rater is presented with a “real” mesh render (randomly picked from our dataset, alignments of the scan data) and a “fake” render (mesh generated by our algorithm). Both images are shown side-by-side. The raters are asked to pick the one that they think is real. Each pair of renderings is evaluated by 10 raters. Unlike [66], we present both real and fake renderings simultaneously, do not set a time limit for the raters and allow zoom-in for detailed comparison. In this setting, the best score that a method can obtain is 50%, meaning that the real and fake examples are indistinguishable.

5.1 Model Capability

seen subj unseen seq unseen subj seen seq unseen subj unseen seq
Baseline Comparison
PCA 0.590.51 1.190.77 1.310.85
CoMA-1 0.630.56 1.380.86 1.510.94
CoMA-4 0.690.58 1.510.85 1.660.87
AutoEnClother-full 0.560.51 1.140.72 1.190.77
Ablation Study
no 0.570.53 1.160.73 1.230.78
global 0.580.53 1.240.78 1.330.83
no res-block 0.590.52 1.200.79 1.280.86
no normal-loss 0.570.51 1.160.71 1.240.76
Table 1: Quantitative evaluation on three test scenarios. Per-vertex reconstruction error are reported in . Best in bold, second best underlined. Methods in comparison: AutoEnClother-full: our model; CoMA-4 and CoMA-1: mesh autoencoder with () and without downsampling; PCA: principal component analysis. The ablation study: no : remove discriminator; global : use non-patchwise discriminator; no res-block: use plain convolution rather than residual blocks; no normal-loss: exclude normal loss from training.

Reconstruction and Generalization. The reconstruction accuracy reflects the capability of the model to encode while preserving the original geometry. We compare against the state-of-the-art convolutional mesh autoencoder CoMA [17] and a PCA model. For a fair comparison, we compare to both the original CoMA with a 4 downsampling (denoted as “CoMA-4”), and without downsampling (as in our model), denoted “CoMA-1”. We use the same latent space dimension (number of principal components in the case of PCA) and hyper-parameter settings, where applicable, for all models.

We evaluate the reconstruction accuracy ability on three scenarios: (1) seen subject, unseen motion sequence: the training set includes the subject, but does not include the test motion sequence. Analogously we define the scenarios (2) unseen subject, seen motion sequence, and (3) unseen subject and unseen motion sequence.

The seen subjects, unseen motion setting measures the pose generalization capability of the models. The model is tested on three sequences excluded from training: ballerina-spin, pose-like-a-model and soccer-kick which include challenging clothing deformations from extreme poses such as kicking and raising the arms. The unseen subjects, seen sequence

scenario is more challenging, because even with a pre-defined motion sequence, different subjects will have their own way of performing it. Namely, the model is tested with similar, but not identical, poses as in training. The models need to, at the same time, interpolate between poses and extrapolate to new body shapes. This scenario was evaluated on two subjects that were left out for training. We used the motions

hips-shake, shoulders-mill and ATU-squat, which cover a range of difficulty in terms of clothing deformations. The unseen subject and motion case is the most challenging one, which requires the models to extrapolate both pose deformations and body shapes.

Reconstruction errors are reported in Table 1. While our model has a focus on mesh generation, it outperforms the baselines in the reconstruction task. Figure 3 shows an example from a ballerina-spin test pose. CoMA with downsampling produces smoothed results; improvements are seen when the downsampling layer is removed (CoMA-1). The PCA model keeps wrinkles and boundaries, but long-range correlations are missing: the rising hem on the left side completely disappears. Our model manages to model both local structures and global correlations.

Figure 3: Example of reconstruction by our model and baselines. Our model is able to recover both long-range correlations and local details.

Ablation Study. In Table 1, we also present an ablation study of our architecture. We remove one component from our model respectively and evaluate its performance. We observe that both the discriminator and the residual block play an important role in improving the performance of the model. Further, there is an improvement when using the patchwise discriminator on meshes instead of a global discriminator. We also note that the normal loss leads to slight improvement in the accuracy.

5.2 Conditional Generation of Clothing

Next, we evaluate the conditional generation characteristics of our model. Our clothing generation model has three parameters: (see Eq. (3)). We present different scenarios and show how the conditioning effects the generated clothing shape.

Sample in clothing type and latent space. We first fix , and for each clothing type (shortlong, shortshort, longlong and longshort) we sample . Examples of generated clothing are applied to an unseen body. Figure 4 presents examples obtained on a body in an A-pose. Each row represents one of the four clothing types and each column has a different sampled from . Examples from the same column correspond to the same value, hence have similar clothing style (tightness and wrinkles).

In the AMT user study, we tested 400 generated examples (100 per category). On average, of the time the raters preferred our synthesized results over renderings from real data.

Figure 4: Generated clothing conditioned on clothing type, shown on an unseen body shape. Each row: same clothing type (from above: shortlong, shortshort, longshort, longlong). Each column: same .

Sample in pose and latent space. Now we fix the clothing type to shortlong, for example, and generate clothing conditioned on different poses that are not used in training. Again we generate different by sampling from . The generated clothing displacements are again applied to a subject not present in the training data. As shown in Figure 5, our model manages to capture long-range correlations within a mesh, such as the elevated hem as the subject raises the arms, and the lateral wrinkle on the back as he shrugs. The model also synthesizes local details such as wrinkles in the armpit area, and boundaries at cuffs and collars. Here we carry out AMT evaluation with 300 generated meshes with various poses; of the time the raters rated our results as real.

Figure 5: Generated clothing conditioned on a specific pose, shown on an unseen body shape. Each row: same pose. Each column: same . All poses come from motion sequences unseen in training. Rightmost are undressed body shapes.

Animating a pose sequence. This scenario represents the use case, where a user has an initial body (not seen in training) and selects a clothing type from the available ones. Then the user wants to generate an animated sequence reproducing a motion (unseen in training). The challenge here is to have a clothing shape that is consistent across poses, yet deforms plausibly.

In this case we fix and use the poses from an unseen entire motion sequence (see Appendix A for more detail). This scenario is difficult because dynamics should be taken into account in the model in order to have an accurate temporal consistency. As our model does not take into account the temporal consistency, small flickering artifacts can be seen in the generated trajectories. Thus it is not evaluated in the perceptual study. However, although considering dynamics is out of scope of this work, our model manages to produce coherent (yet to be improved) clothing deformations over time.

Interpolate between clothing type. The clothing type is represented as discrete one-hot vectors, it is however possible to interpolate between different states and get intermediate results. Figure 6 shows an example from longshort to shortlong.

Figure 6: Clothing interpolation between longshort and shortlong.

AutoEnClother enhanced by SMPL texture. As our model has the same topology as SMPL, it is compatible with all existing SMPL texture maps. Figure 7 shows an example texture applied to the standard “naked” SMPL model (as done in the SURREAL dataset [2]) and to our clothed body model, respectively. Although the texture creates an illusion of clothing on the SMPL body, the overall shape remains skinny, oversmoothed, and hence unrealistic. In contrast, our model, with its improved clothing geometry, matches more naturally with the clothing texture if the correct clothing type is given. As a future line of research, we plan to model the alignment between the clothing texture boundaries and the underlying geometry by learning a clothing appearance model that is coupled to shape.

Figure 7: Front row: A clothing texture applied to the SMPL body and one of our generated clothed bodies. Back row: respective underlying geometry.

6 Conclusions, Limitations, Future Work

We have introduced a novel generative model for 3D meshes that enables us to condition, sample, and preserve detail. We apply this model to clothing deformations from a 3D body mesh and condition the latent space on body pose and clothing type. The generated clothing mesh is added as an extra layer to the SMPL body model, making it possible to dress any body shape with realistic clothing, then animate it in motion. This capability is of wide applicability in computer vision and provides a practical extension to current body modeling technology.

There are several limitations of our approach that point to future work. As our clothing model is anchored to the SMPL mesh topology, it can only handle garments which have similar geometry as human body. Skirts, open jackets, and multiple layers need to be represented in other forms. Similar to what was done for body shape [76], we will build a dynamics clothing model where the clothing deformation depends on the state of the previous time step: . Here we showed how multiple garments can be captured but we plan to extend this to a much wider wardrobe. With more data, we could condition on fabric type, clothing size, and body shape.

Acknowledgements: We thank J. Tesch for rendering the results, P.  Karasik for the help with Amazon Mechanical Turk, T. Alexiadis and A. Keller for building the dataset. We thank P. Ghosh, T. Bolkart and Y. Zhang for useful discussions. Q. Ma and S. Tang acknowledge funding by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Projektnummer 276693517 SFB 1233. G. Pons-Moll is funded by the Emmy Noether Programme, Deutsche Forschungsgemeinschaft (DFG. German Research Foundation) - 409792180.

Disclosure: Michael J. Black has received research gift funds from Intel, Nvidia, Adobe, Facebook, and Amazon. While he is a part-time employee of Amazon and has financial interests in Amazon and Meshcapade GmbH, his research was performed solely at, and funded solely by, MPI.


  • [1] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black. Smpl: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.
  • [2] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid. Learning from synthetic humans. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 109–117, 2017.
  • [3] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. End-to-end recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018.
  • [4] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV), pages 484–494. IEEE, 2018.
  • [5] R. Alp Güler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306, 2018.
  • [6] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6050–6059, 2017.
  • [7] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 459–468, 2018.
  • [8] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [9] G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black. Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG), 36(4):73, 2017.
  • [10] P. Guan, L. Reiss, D. A. Hirshberg, A. Weiss, and M. J. Black. Drape: Dressing any person. ACM Trans. Graph., 31(4):35–1, 2012.
  • [11] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
  • [12] T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
  • [13] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
  • [14] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graph-structured data. arXiv preprint arXiv:1506.05163, 2015.
  • [15] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
  • [16] J. Atwood and D. Towsley. Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1993–2001, 2016.
  • [17] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3d faces using convolutional mesh autoencoders. In Proceedings of the European Conference on Computer Vision (ECCV), pages 704–720, 2018.
  • [18] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018.
  • [19] E. J. Smith, S. Fujimoto, A. Romero, and D. Meger. Geometrics: Exploiting geometric structure for graph-encoded objects. arXiv preprint arXiv:1901.11461, 2019.
  • [20] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [21] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
  • [22] Z. Lähner, D. Cremers, and T. Tung. Deepwrinkles: Accurate and realistic clothing modeling. In European Conference on Computer Vision, pages 698–715. Springer, 2018.
  • [23] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • [24] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [25] S. Cheng, M. Bronstein, Y. Zhou, I. Kotsia, M. Pantic, and S. Zafeiriou. Meshgan: Non-linear 3d morphable models of faces. arXiv preprint arXiv:1903.10384, 2019.
  • [26] C. Zhang, S. Pujades, M. J. Black, and G. Pons-Moll. Detailed, accurate, human shape estimation from clothed 3d scan sequences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [27] N. Magnenat-Thalmann, H. Seo, and F. Cordier. Automatic modeling of virtual humans and body clothing. Journal of Computer Science and Technology, 19(5):575–584, 2004.
  • [28] E. de Aguiar, L. Sigal, A. Treuille, and J. K. Hodgins. Stable spaces for real-time clothing. ACM Trans. Graph., 29(4):106:1–106:9, July 2010.
  • [29] L. Sigal, M. Mahler, S. Diaz, K. McIntosh, E. Carter, T. Richards, and J. Hodgins. A perceptual control space for garment simulation. ACM Transactions on Graphics (TOG), 34(4):117, 2015.
  • [30] D. Kim, W. Koh, R. Narain, K. Fatahalian, A. Treuille, and J. F. O’Brien. Near-exhaustive precomputation of secondary cloth effects. ACM Transactions on Graphics, 32(4):87:1–7, July 2013. Proceedings of ACM SIGGRAPH 2013, Anaheim.
  • [31] T. Y. Wang, D. Ceylan, J. Popović, and N. J. Mitra. Learning a shared shape space for multimodal garment design. In SIGGRAPH Asia 2018 Technical Papers, page 203. ACM, 2018.
  • [32] I. Santesteban, M. A. Otaduy, and D. Casas. Learning-based animation of clothing for virtual try-on. arXiv preprint arXiv:1903.07190, 2019.
  • [33] A. O. Bălan and M. J. Black. The naked truth: Estimating body shape under clothing. In European Conference on Computer Vision, pages 15–29. Springer, 2008.
  • [34] J. Yang, J.-S. Franco, F. Hétroy-Wheeler, and S. Wuhrer. Estimation of human body shape in motion with wide clothing. In European Conference on Computer Vision, pages 439–454. Springer, 2016.
  • [35] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Video based reconstruction of 3D people models. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
  • [36] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll. Detailed human avatars from monocular video. In International Conf. on 3D Vision, sep 2018.
  • [37] R. A. Newcombe, D. Fox, and S. M. Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 343–352, 2015.
  • [38] Y. Tao, Z. Zheng, K. Guo, J. Zhao, D. Quionhai, H. Li, G. Pons-Moll, and Y. Liu. Doublefusion: Real-time capture of human performance with inner body shape from a depth sensor. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
  • [39] Y. Tao, Z. Zheng, Y. Zhong, J. Zhao, D. Quionhai, G. Pons-Moll, and Y. Liu. Simulcap : Single-view human performance capture with cloth simulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2019.
  • [40] D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur. Markerless garment capture. In ACM Transactions on Graphics (TOG), volume 27, page 99. ACM, 2008.
  • [41] R. Daněřek, E. Dibra, C. Öztireli, R. Ziegler, and M. Gross. Deepgarment: 3d garment shape estimation from a single image. In Computer Graphics Forum, volume 36, pages 269–280. Wiley Online Library, 2017.
  • [42] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-Moll. Learning to reconstruct people in clothing from a single RGB camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2019.
  • [43] A. Neophytou and A. Hilton. A layered model of human body and garment deformation. In International Conference on 3D Vision, 2014.
  • [44] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape completion and animation of people. In ACM transactions on graphics (TOG), volume 24, pages 408–416. ACM, 2005.
  • [45] J. Yang, J.-S. Franco, F. Hétroy-Wheeler, and S. Wuhrer. Analyzing clothing layer deformation statistics of 3d human motions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 237–253, 2018.
  • [46] C. Stoll, J. Gall, E. d. Aguiar, S. Thrun, and C. Theobalt. Video-based reconstruction of animatable human characters. In ACM SIGGRAPH ASIA, 2010.
  • [47] H. Zhu, X. Zuo, S. Wang, X. Cao, and R. Yang. Detailed human shape estimation from a single image by hierarchical mesh deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4491–4500, 2019.
  • [48] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima. Siclope: Silhouette-based clothed people. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  • [49] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. arXiv preprint arXiv:1905.05172, 2019.
  • [50] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3):37–52, 1987.
  • [51] F. De la Torre and M. J. Black. Robust principal component analysis for computer vision. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 362–369. IEEE, 2001.
  • [52] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • [53] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generative-adversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
  • [54] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  • [55] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
  • [56] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. In 3D Vision (3DV), 2017 International Conference on, pages 412–420. IEEE, 2017.
  • [57] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for high-resolution 3d outputs. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), volume 2, page 8, 2017.
  • [58] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
  • [59] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
  • [60] M. Atzmon, H. Maron, and Y. Lipman. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018.
  • [61] R. Klokov and V. Lempitsky. Escape from cells: Deep kd-networks for the recognition of 3d point cloud models. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 863–872. IEEE, 2017.
  • [62] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
  • [63] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Feature-steered graph convolutions for 3d shape analysis. In CVPR 2018-IEEE Conference on Computer Vision & Pattern Recognition, 2018.
  • [64] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1886–1895, 2018.
  • [65] M. Garland and P. S. Heckbert. Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 209–216. ACM Press/Addison-Wesley Publishing Co., 1997.
  • [66] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros.

    Image-to-image translation with conditional adversarial networks.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [67] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
  • [68] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvae-gan: fine-grained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pages 2745–2754, 2017.
  • [69] F. R. Chung and F. C. Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
  • [70] Q. Li, Z. Han, and X.-M. Wu.

    Deeper insights into graph convolutional networks for semi-supervised learning.


    Thirty-Second AAAI Conference on Artificial Intelligence

    , 2018.
  • [71] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [72] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision, 2017.
  • [73] S. Wuhrer, L. Pishchulin, A. Brunton, C. Shu, and J. Lang. Estimation of human body shape and posture under clothing. Computer Vision and Image Understanding, 127:31–42, 2014.
  • [74] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
  • [75] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
  • [76] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics (TOG), 34(4):120, 2015.

Appendix A Experiment: Animate a Pose Sequence

Here we elaborate on the third experiment described in Section 5.2. We make a comparison between two conditions:

  1. Clothing retargeting using ClothCap [9], followed by animation using SMPL [1]. We add clothing displacements from one frame to the avatar, and then we change the pose for a motion sequence using SMPL pose deformations, while holding the garment static.

  2. Result produced by our AutoEnClother model. In our model, , we fix a value of the sampled latent code and clothing type , and only change by feeding SMPL pose parameters from a sequence.

For the first condition, despite the sharp wrinkles (since they are directly taken from a captured mesh), the results look as if the subject is wearing an elastic “swimsuit”. No matter how the avatar moves, the clothing remains tightly fitted to the body, and stretches as the body stretches.

In contrast, our AutoEnClother model produces the small pose corrective offsets that change the mesh more naturally, reducing this effect, as shown in Figure 8. When the subject opens his arms, the clothing below the armpit area typically “inflates”. A similar change occurs when the subject raises his arms: the cloth on the back drops naturally with gravity. Note that we do not explicitly integrate physics constraints in our model; such corrections are learned from data.

Figure 8: Examples of pose-corrected retargeting: with naïve ClothCap, the clothing is “glued” to the body under all poses, and stretches unnaturally. With AutoEnClother (our model), the cloth drapes in a more natural way with pose change.

Appendix B Detailed Network Architecture

We use the following notations:

  • : data, : condition, : latent code, : the prediction map from discriminator;

  • : Chebyshev mesh convolution layer with filters;

  • : linear mesh downsampling layer with rate ;

  • FC: fully connected layer;

  • : residual block that uses as filters;

Condition Module:

for pose:

for clothing type:




Residual Block:



where denotes element-wise addition.