1 Introduction
Existing human body models like SMPL [1] successfully capture the statistics of human body shape and pose, but lack something important — clothing. Since most images of people show them in clothing, this causes several problems. For example, body models are used to generate synthetic training data [2]
, but the lack of clothing leads to a significant domain gap between synthetic and real images of humans. Deep learning methods train a regressor from image pixels to the parameters of the SMPL body model
[3, 4, 5, 6, 7, 8] but SMPL does not match the complexity of people in images. Generative approaches that use SMPL for training or for analysisbysynthesis suffer when the model does not explain the image evidence well. Given the interest and value in 3D body models and the problems above, there is a clear need for a 3D model of clothed bodies.A good model of a clothed human should be lowdimensional, easy to pose, differentiable, represent clothing on different body shapes and poses, and it should be relatively realistic. To our knowledge, no current approach to modeling clothing satisfies these properties. Instead of extracting standalone garments, we choose to model clothing together with human body, so as to serve the need of perceiving people in various computer vision tasks mentioned above. Consequently, we use SMPL as a foundation and learn to model a clothing layer as displacements from the SMPL mesh in a canonical pose (Fig.
1). This is similar in spirit to ClothCap [9], which also models clothing as a displacement from the body, but we go further to learn a model of clothing deformation that is conditioned on pose and clothing type. DRAPE [10]is also similar in that clothing is a learned function but, in DRAPE, the model is learned from simulations, is based on principal component analysis (PCA), is overly smooth, and separate models need to be trained for each garment. We need, instead, a generative model that can be learned (like DRAPE), from real scan data (like ClothCap), can be added to SMPL, that captures different types of clothing, and can be animated and fit to data.
To address these needs, we train a novel deep neural network from scan data to capture clothing displacements conditioned on pose and clothing type. Extending deep networks to 3D meshes is a relatively new and active field. Recent advances in convolutional neural networks for graphs
[11, 12, 13, 14, 15, 16] enable direct learning on 3D meshes [17, 18, 19] by deforming a predefined template mesh. Ranjan et al. [17]define a variational autoencoder (VAE) on meshes
[20, 21] using graph convolutions [11]. However, a drawback of VAEs is that they tend to produce overly smoothed results, making them illsuited to modeling clothing, which has highfrequency wrinkles. Methods such as [18, 19] try to improve oversmoothed results with upsampling techniques and others use a different representation for local details such as normal maps [22]. Instead, we exploit generative adversarial networks (GANs) [23, 24], which have proven to be successful in the generation of realistic images. We introduce a novel MeshVAEGAN for endtoend learning of a generative model ^{1}^{1}1Also see for concurrent work that proposes similar idea [25]. of 3D clothing. Generating other types of 3D meshes with our model is also possible by using respective template meshes, making it a generic meshdeeplearning framework.Our MeshVAEGAN uses graph convolutions [11] and mesh sampling [17] as the backbone layers. We use the topology of the 3D SMPL human body model [1] to represent the 3D geometry of clothes. Similarly to [26] and [9], we find it expressive enough in order to register scans from people in clothing that roughly matches the topology of the body. We condition our model on clothing type and SMPL body pose parameters, and train our model using 4D captured data of people wearing different types of clothing, in a variety of pose and motion sequences. The result is a parametric generative 3D clothing model, called AutoEnClother, with a lowdimensional latent space, from which we can sample and condition, enabling the generation of a wide variety of clothing. The method is designed to be “plug and play” for many applications that already use SMPL. Dressing SMPL with the AutoEnClother yields 3D meshes of people in clothing, which can be used to generate training data, as the output of a network that predicts body pose, as a clothing “prior”, or as part of a generative analysisbysynthesis approach. In summary, our key contributions are: (1) We introduce a novel conditional MeshVAEGAN architecture. (2) Using this, we build a 3D generative clothing model with controlled conditioning based on human pose and clothing types. (3) We augment a popular 3D human body model that is minimally clothed with clothing generated by our model.
2 Related Work
Models for clothing and clothed humans. There is a large literature on clothing simulation [27], which is beyond our scope. Traditional physics simulation requires significant artist or designer involvement and is not practical in the inner loop of a deep learning framework. The most relevant work has focused on computing offline physicsbased simulations and learning efficient datadriven approximations from them [10, 28, 29, 30, 31, 32]. DRAPE [10] learns a model of clothing that allows changing the pose and shape, and Wang [31] allow manipulation of clothing with sketches. Learning from physics simulation is convenient because many examples can be synthesized, they do not have noise, and registration and the factorization of shape into pose, body shape and clothing is already given. However, models learned from synthetic data often look unrealistic [10]. Learning from real scans of people in motion is difficult, but opens the door to model more realism and diversity.
The first problem with real scans is to estimate the shape and pose
under clothing, which is required to model how garments deviate from the body. Typically, a single shape is optimized over multiple poses to fit inside multiple clothing silhouettes [33], or a sequence of 3D scans [26, 34]. Here, we use a similar approach [26] to factor a scan into the minimally clothed shape and a clothing layer. PonsMoll [9] proposed ClothCap, a method to capture a sequence of dynamic scans, encoding clothing as displacement from the SMPL body, and retargeting it to new body shapes. Similar to ClothCap, Alldieck [35, 36] represent clothing as an offset from the SMPL to reconstruct people in clothing from images. Combining dynamic fusion ideas [37] with SMPL, cloth capture and retargeting has been demonstrated from a single depth camera [38, 39]. Still other work captures garments in isolation from multiview [40] or single images using a CNN [41]. These are essentially capture methods and can not generate novel clothing. While a model of static clothing is implicitly learned in [42] from scans, it requires images as input.Few clothing models, learned from real data, have shown that they generalize to new poses. Neophytou and Hilton [43] learn a layered garment model on top of SCAPE [44] from dynamic sequences, but generalization to novel poses is not demonstrated. Yang [45] train a neural network to regress a PCAbased representation of clothing, but generalization is only shown on the same sequence or on the same subject. Lähner [22] learn a garmentspecific pose deformation model by regressing lowfrequency PCA components, and high frequency normal maps. While the achieved quality is good, the model is garment specific and does not provide a solution for fullbody clothing. Most importantly, the aforementioned models are regressors and produce single point estimates. In contrast, our model is generative, which allows us to sample clothing. Our motivation for learning a generative model is that clothing shape is intrinsically probabilistic; conditioned on a single pose, multiple clothing deformations are possible. A conceptually different approach to ours is to infer the parameters of a physical model from 3D scan sequences [46], and show generalization to novel poses. However, the inference problem is difficult, and unlike our model, the resulting physics simulator is not differentiable.
Several recent works recover 3D meshes of clothed people from single or multiview images using neural networks [47, 48, 49]. While preserving a highlevel of detail, the recovered 3D clothing is not parametrized and, hence, not manipulable. In contrast, our model provides control over pose and clothing types, and can be extended to control other factors.
Generative models for 3D meshes. Generative models for 3D shapes are usually based on PCA [50] or its robust versions [51]. Alternately, deep learning methods such as Variational Autoencoders (VAE) [20] and Generative Adversarial Networks (GAN) [23] have shown state of the art results in generating 2D images [52] and voxels [53]. However, a voxel representation is not well suited to modeling clothing surfaces. Compared to voxels [54, 55, 56, 57] and point clouds [58, 59, 60, 61, 62], meshes are more suitable for 3D clothing data because of their computational efficiency and flexibility in modeling both global and local information. Generalizing GANs to irregular structures, such as graphs and meshes, however, is not trivial.
To deal with irregular structures like graphs, Bruna [13] introduce graph convolutions. Followup work [11, 12] extends these graph convolutions, which have been successfully used [17, 63] to learn representations defined on meshes. Verma [63] use featuresteered graphconvolutions for 3D shape analysis. Based on this, Litany [64] use mesh VAEs to do mesh completion. Ranjan [17] learn a convolutional meshVAE using graph convolutions [11] with mesh down and upsampling layers [65]. Although it works well for faces, the mesh sampling layer makes it difficult to capture the high frequency wrinkles, which are key in clothing. In our work, we capture high frequency wrinkles by extending the PatchGAN [66] architecture to handle 3D meshes.
3 Probabilistic Clothing Mesh Generation
3.1 Clothing Representation
SMPL: Layerwise Body Model. SMPL [1] is a generative model of human bodies that factors the surface of the body into shape () and pose () parameters. The architecture of SMPL starts with a triangulated template mesh, , in rest pose, defined by vertices. Given shape and pose parameters (, 3D offsets are added, which correspond to shape dependent deformations () and pose dependent deformations (). The shape blend shape function, , is a weighted sum of linear shape components, , learned from data, that models the individual body shape. The pose blend shape function, , models posedependent deformations with respect to the template mesh in the rest pose. The resulting mesh is then posed using the skinning function . Mathematically one can write:
(1)  
(2) 
where the blend skinning function rotates the the rest pose vertices around the 3D joint (computed from ), linearly smoothes them with the blend weights , and returns the posed vertices . The pose
is represented by a vector of relative 3D rotations of the
joints in axisangle representation, plus one “joint” for global rotation.The core idea of SMPL is to start with an initial body shape, and then add linear deformation layers to it. The current layers of SMPL are the shape and pose dependent deformations. Following this spirit, we treat clothing as an additional offset layer from the body and add it on top of the SMPL mesh.
Clothing as Offsets from SMPL Body. Representing clothing as offsets from body [9, 10] is practical for a range of clothing type such as pants and shirts. While it does not account for the real physics of clothes, it makes the process of dressing people a simple addition of these offsets to the minimallyclothed body shape.
One key design choice of our method is that we compute the displacements from the minimallyclothed body to the clothing in a canonical space, corresponding to the rest pose configuration (or zeropose space). We name this space the unposed space. Here we use a 3D scan of the person in minimal clothing to accurately capture their shape. In order to compute the displacements, we first register the SMPL model to both scans using [26] and considering all vertices as “skin”. We obtain a SMPL mesh capturing the geometry of the scan, the corresponding pose parameters and the unposed mesh, living in the unposed space. For the mathematical details of registration and the unposing operation, we refer the reader to [26]. In the rest of the paper, denotes the unposed “minimal” mesh vertices, the unposed clothed mesh vertices, and the pose of the clothed mesh. The pose of the minimallyclothed body is not used once it is put in the rest pose. The clothing displacements are , and correspond to the SMPL mesh topology. We exploit this fixed topology and neighborhood structure during learning. has nonzero values only on body parts covered with clothes.
Towards “SMPLClothed”. Our goal is to effectively extend the SMPL body model with a new clothing layer that controls both global clothing shape as well as local clothing structure. Specifically, we extend the deformed template mesh from Eq. (1), with the new additive clothing layer:
(3) 
where is our clothing function. Our clothing layer is parametrized by the body pose , clothing type and a latent variable which defines a point in a learned lowdimensional space that encodes clothing shape and structure. Following Eq. (2), a posed, clothed human body mesh is:
(4) 
The key remaining questions are (1) how to learn a well structured, lowdimensional latent space of the clothing shape and structure information? (2) how to use this latent space in order to dress SMPL meshes in a coherent and controllable manner?
3.2 Mesh VAEGAN
To model the clothing term, , in Eq. (3), we introduce AutoEnClother, a novel generative model for 3D meshes. It has the architecture of a conditional VAEGAN [67, 68], consisting of an encoder , a generator , a condition module and a discriminator , with mesh convolution and mesh sampling layers as building blocks. Specifically, the model is conditioned on the clothing type and pose parameters and is trained in an adversarial way to allow the capture of realistic local clothing structure.
In this section we review the mesh convolutions, present the new architecture and detail the learning strategy.
Preliminaries: Convolution on Meshes. As described in Sec. 3.1, we compute clothing displacements and associate them with the SMPL mesh topology. In order to learn features on the clothing mesh, we use convolutions on 3D meshes, which can be defined as filtering in the spectral domain [13, 14] and approximated using Chebyshev polynomials [11] of the graph Laplacian [69]. Formally, the Chebyshev convolution at layer is given by:
(5) 
where refers to the input features of the input vertices, with features at each vertex; are the output features. A mesh has features per vertex corresponding to its position in 3D Euclidean space. The learnable weights are represented by and are the coefficients of the Chebyshev polynomials as defined in Defferrard . [11]. The parameter maps the input feature to the output feature using the Laplacian, , of the input mesh. The convolution kernel filters features from the ring neighbor of each vertex. This allows control of the receptive field of the kernels by using the parameter . For more details about mesh convolution and sampling, we refer the reader to [11, 12, 17].
3.2.1 Model Architecture
The VAEGAN model proposed by [67] is obtained by combining a VAE [20] and a GAN [23] and consists of an encoder , a generator and a discriminator . The encoder maps the data into a lowdimensional latent code . The generator tries to reconstruct this data from its latent code:
. A multivariate Gaussian prior of zero mean and unit variance
is imposed on the latent code . The discriminator is used for adversarial training and is trained to discriminate generated samples from the real samples .We extend the VAEGAN to the 3D mesh domain by: (1) building the encoder and generator with meshresidual blocks so as to learn deep features; (2) introducing a novel meshpatchwise discriminator for capturing fine local structures. The entire network is trained in an endtoend manner.
Figure 2 shows the overall network design, and architecture details are provided in the Appendix B. The following notation is used in the rest of the paper unless otherwise denoted: input mesh , its reconstruction , latent code , and condition .
EncoderGenerator Module. The generator of our model takes both the clothing condition (pose and clothing type) and the latent code . It is essentially the decoder part of the VAE. It is combined with the encoder during training: the latent code from is concatenated with condition and then fed to . At test, the encoder is discarded and is sampled from the Gaussian prior distribution, . Using an entire VAE instead of only the decoder as the generator is a deliberate choice, as we find in our experiments that the training becomes unstable without the encoder.
Both and
are feedforward neural networks built with mesh convolutional layers. A fullyconnected layer sits at the end of
and beginning of respectively. It transforms the feature maps to/from the latent code vector.Stacking mesh convolution layers can, however, cause the smoothing of local features [70], which is undesirable for mesh generation because fine details are likely to disappear. We replace the standard mesh convolution layers with mesh residual blocks [71]. The shortcut connections within a residual block enable the use of the lowlevel features from its input if necessary.
Patchwise Discriminator Module.
We introduce a patchwise discriminator for meshes, whose counterpart for images has shown success in imagetoimage translation problems
[66, 72]. Instead of looking at the entire generated mesh, the discriminator only classifies whether a patch is real or fake based on its local structure. Intuitively this encourages the discriminator to only focus on fine details, and the global shape is taken care of by the reconstruction loss.
In practice this is achieved by repeating the convolutiondownsampling block several times to get a real/fake prediction map. Following [17], we use the quadric error based [65] downsampling method.
Condition Module. The condition module transforms through a small fully connected network to get its compressed representation: . This condition representation is then fed to the input of and .
3.2.2 Learning
For the reconstruction loss, we use an L1 distance over the vertices of the mesh , because it encourages less smoothing compared to L2:
(6) 
Furthermore, we apply a loss on the face normals to encourage the generation of wrinkles instead of smooth surfaces. Let be the unit normal corresponding to the face in the set of mesh faces of the mesh . We penalize the angle between all corresponding face normals of and :
(7) 
where is vector inner product.
The loss on the latent distribution is formulated by the KL divergence between the posterior and prior:
(8) 
The generator and discriminator are trained as opponents with the adversarial loss:
(9) 
where tries to minimize this loss against the that aims to maximize it.
Overall Objective. All the losses are weighted by coefficients that balance the quality of generation and sampling. The overall objective is written as:
(10) 
3.3 AutoEnClother dresses SMPL
Now we can model the clothing layer in Eq. (3) with AutoEnClother. We train AutoEnClother using the clothing displacements mesh (by associating with SMPL mesh topology, Sec. 3.1) as the data , and for the condition we choose pose and clothing type . The latent variable encodes information that is influenced by all other factors. Consequently, the actual input to the generator, , is essentially disentangled into , pose and clothing type. Note that, as a unified framework, AutoEnClother is not limited to these two factors. Instead, it can scale and take more conditions (tightness and fabric), provided that the data for training is available. When all such factors are controllable, degenerates to a Dirac deltafunction and the model becomes a regressor.
The clothing type is a 4dimensional onehot vector corresponding to the 4 types of clothing (see Sec. 4) in our dataset. Given the data of other clothing types, the extension is straightforward, as long as the garment does not largely differ from the geometry of the human body. The dimensional SMPL pose parameter representation is flattened into a 72dimensional vector to be used in AutoEnClother.
Finally, the clothing term from Eq. (3) can be concretely written as:
(11) 
where is the generator of our network, and denotes the learned parameters of the model.
AutoEnClother effectively dresses SMPL. For a specific pose and clothing type , one can randomly sample in the latent space and generate varied and unique samples of 3D clothing that satisfy the given condition. Similarly, for a fixed latent code , one can obtain varied clothing shapes as a function of pose and clothing type.
4 Data and Training
We demonstrate the characteristics of our model in two respects: representation capacity and generation performance. The model is trained and tested on a realworld scan dataset, and the generated clothing meshes are evaluated perceptually via a user study.
4D Clothing Scan Dataset. We captured temporal sequences of 3D body scans with a highresolution body scanner (3dMD LLC, Atlanta, GA). Each subject was scanned once in the minimal clothing condition. Then, subjects in the clothed condition performed several predefined motion sequences, in four different clothing types: short sleeve Tshirt with long pants (shortlong), short sleeve Tshirt with short pants (shortshort), as well as their opposite versions longshort and longlong. This allowed us to capture a variety of wrinkles in different poses and clothing types. The minimal setting was used to obtain an estimation of the “naked” body shape; the subjects were scanned with tight fitting sports underwear. This scan could also be replaced by an automatic body shape estimation method [26, 34, 73]. The scans were registered using the single mesh alignment method from [9], and the clothing displacements were computed as described in Sec. 3.1.
Our dataset contains 31,402 examples (frames) from 117 motion sequences by 7 male subjects in 4 types of clothing, as well as 4,318 examples from 18 sequences by 5 female subjects in shortlong. It contains unposed registrations (aligned to the scans) and the clothing displacements, as well as the pose parameters and clothing type label. The subjects gave informed written consent to participate.
Implementation Details. We train our model for males and females separately. We split the male dataset into 22,082 training examples, and the remaining 9,320 examples are leftout for different test scenarios (see Sec. 5.1). The female dataset is split into a training set with 4,093 examples and a test set with 225 examples. The following quantitative results are shown on the male dataset due to its higher diversity in subjects, clothing types and motion sequences.
Note that, while we train our model with 4D scan data, it can take meshes from other sources (clothing simulations), as long as these have the SMPL mesh topology.
The model is trained for 150 epochs using stochastic gradient descent with momentum of
, an initial learning rate of , and a decay of after every epoch. The Chebyshev convolutions use . An L2weight decay with strength is used as regularization. The pose and clothing type conditions are compressed to compact vectors with dimensions and, respectively. Batch normalization is not used in our network, as we noticed that it leads to unstable training.
5 Experiments
Evaluation Metrics.
Typical evaluation metrics for generative models such as Inception Score
[74] and FID [75] are not available for 3D meshes. Therefore, we evaluate our model’s representation power by computing the pervertex Euclidean autoencoding error on clothing meshes. When averaging the error over vertices, we exclude the vertices from the head, fingers, toes, hands and feet as they are not covered with clothing.To test if the conditional generated results of our method look realistic, we performed a user study on Amazon Mechanical Turk (AMT). Virtual avatars were dressed in 3D and rendered into frontview images. Following the protocol from [66], raters were presented with a series of “real vs fake” trials. On each trial, the rater is presented with a “real” mesh render (randomly picked from our dataset, alignments of the scan data) and a “fake” render (mesh generated by our algorithm). Both images are shown sidebyside. The raters are asked to pick the one that they think is real. Each pair of renderings is evaluated by 10 raters. Unlike [66], we present both real and fake renderings simultaneously, do not set a time limit for the raters and allow zoomin for detailed comparison. In this setting, the best score that a method can obtain is 50%, meaning that the real and fake examples are indistinguishable.
5.1 Model Capability
seen subj unseen seq  unseen subj seen seq  unseen subj unseen seq  

Baseline Comparison  
PCA  0.590.51  1.190.77  1.310.85 
CoMA1  0.630.56  1.380.86  1.510.94 
CoMA4  0.690.58  1.510.85  1.660.87 
AutoEnClotherfull  0.560.51  1.140.72  1.190.77 
Ablation Study  
no  0.570.53  1.160.73  1.230.78 
global  0.580.53  1.240.78  1.330.83 
no resblock  0.590.52  1.200.79  1.280.86 
no normalloss  0.570.51  1.160.71  1.240.76 
Reconstruction and Generalization. The reconstruction accuracy reflects the capability of the model to encode while preserving the original geometry. We compare against the stateoftheart convolutional mesh autoencoder CoMA [17] and a PCA model. For a fair comparison, we compare to both the original CoMA with a 4 downsampling (denoted as “CoMA4”), and without downsampling (as in our model), denoted “CoMA1”. We use the same latent space dimension (number of principal components in the case of PCA) and hyperparameter settings, where applicable, for all models.
We evaluate the reconstruction accuracy ability on three scenarios: (1) seen subject, unseen motion sequence: the training set includes the subject, but does not include the test motion sequence. Analogously we define the scenarios (2) unseen subject, seen motion sequence, and (3) unseen subject and unseen motion sequence.
The seen subjects, unseen motion setting measures the pose generalization capability of the models. The model is tested on three sequences excluded from training: ballerinaspin, poselikeamodel and soccerkick which include challenging clothing deformations from extreme poses such as kicking and raising the arms. The unseen subjects, seen sequence
scenario is more challenging, because even with a predefined motion sequence, different subjects will have their own way of performing it. Namely, the model is tested with similar, but not identical, poses as in training. The models need to, at the same time, interpolate between poses and extrapolate to new body shapes. This scenario was evaluated on two subjects that were left out for training. We used the motions
hipsshake, shouldersmill and ATUsquat, which cover a range of difficulty in terms of clothing deformations. The unseen subject and motion case is the most challenging one, which requires the models to extrapolate both pose deformations and body shapes.Reconstruction errors are reported in Table 1. While our model has a focus on mesh generation, it outperforms the baselines in the reconstruction task. Figure 3 shows an example from a ballerinaspin test pose. CoMA with downsampling produces smoothed results; improvements are seen when the downsampling layer is removed (CoMA1). The PCA model keeps wrinkles and boundaries, but longrange correlations are missing: the rising hem on the left side completely disappears. Our model manages to model both local structures and global correlations.
Ablation Study. In Table 1, we also present an ablation study of our architecture. We remove one component from our model respectively and evaluate its performance. We observe that both the discriminator and the residual block play an important role in improving the performance of the model. Further, there is an improvement when using the patchwise discriminator on meshes instead of a global discriminator. We also note that the normal loss leads to slight improvement in the accuracy.
5.2 Conditional Generation of Clothing
Next, we evaluate the conditional generation characteristics of our model. Our clothing generation model has three parameters: (see Eq. (3)). We present different scenarios and show how the conditioning effects the generated clothing shape.
Sample in clothing type and latent space. We first fix , and for each clothing type (shortlong, shortshort, longlong and longshort) we sample . Examples of generated clothing are applied to an unseen body. Figure 4 presents examples obtained on a body in an Apose. Each row represents one of the four clothing types and each column has a different sampled from . Examples from the same column correspond to the same value, hence have similar clothing style (tightness and wrinkles).
In the AMT user study, we tested 400 generated examples (100 per category). On average, of the time the raters preferred our synthesized results over renderings from real data.
Sample in pose and latent space. Now we fix the clothing type to shortlong, for example, and generate clothing conditioned on different poses that are not used in training. Again we generate different by sampling from . The generated clothing displacements are again applied to a subject not present in the training data. As shown in Figure 5, our model manages to capture longrange correlations within a mesh, such as the elevated hem as the subject raises the arms, and the lateral wrinkle on the back as he shrugs. The model also synthesizes local details such as wrinkles in the armpit area, and boundaries at cuffs and collars. Here we carry out AMT evaluation with 300 generated meshes with various poses; of the time the raters rated our results as real.
Animating a pose sequence. This scenario represents the use case, where a user has an initial body (not seen in training) and selects a clothing type from the available ones. Then the user wants to generate an animated sequence reproducing a motion (unseen in training). The challenge here is to have a clothing shape that is consistent across poses, yet deforms plausibly.
In this case we fix and use the poses from an unseen entire motion sequence (see Appendix A for more detail). This scenario is difficult because dynamics should be taken into account in the model in order to have an accurate temporal consistency. As our model does not take into account the temporal consistency, small flickering artifacts can be seen in the generated trajectories. Thus it is not evaluated in the perceptual study. However, although considering dynamics is out of scope of this work, our model manages to produce coherent (yet to be improved) clothing deformations over time.
Interpolate between clothing type. The clothing type is represented as discrete onehot vectors, it is however possible to interpolate between different states and get intermediate results. Figure 6 shows an example from longshort to shortlong.
AutoEnClother enhanced by SMPL texture. As our model has the same topology as SMPL, it is compatible with all existing SMPL texture maps. Figure 7 shows an example texture applied to the standard “naked” SMPL model (as done in the SURREAL dataset [2]) and to our clothed body model, respectively. Although the texture creates an illusion of clothing on the SMPL body, the overall shape remains skinny, oversmoothed, and hence unrealistic. In contrast, our model, with its improved clothing geometry, matches more naturally with the clothing texture if the correct clothing type is given. As a future line of research, we plan to model the alignment between the clothing texture boundaries and the underlying geometry by learning a clothing appearance model that is coupled to shape.
6 Conclusions, Limitations, Future Work
We have introduced a novel generative model for 3D meshes that enables us to condition, sample, and preserve detail. We apply this model to clothing deformations from a 3D body mesh and condition the latent space on body pose and clothing type. The generated clothing mesh is added as an extra layer to the SMPL body model, making it possible to dress any body shape with realistic clothing, then animate it in motion. This capability is of wide applicability in computer vision and provides a practical extension to current body modeling technology.
There are several limitations of our approach that point to future work. As our clothing model is anchored to the SMPL mesh topology, it can only handle garments which have similar geometry as human body. Skirts, open jackets, and multiple layers need to be represented in other forms. Similar to what was done for body shape [76], we will build a dynamics clothing model where the clothing deformation depends on the state of the previous time step: . Here we showed how multiple garments can be captured but we plan to extend this to a much wider wardrobe. With more data, we could condition on fabric type, clothing size, and body shape.
Acknowledgements: We thank J. Tesch for rendering the results, P. Karasik for the help with Amazon Mechanical Turk, T. Alexiadis and A. Keller for building the dataset. We thank P. Ghosh, T. Bolkart and Y. Zhang for useful discussions. Q. Ma and S. Tang acknowledge funding by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Projektnummer 276693517 SFB 1233. G. PonsMoll is funded by the Emmy Noether Programme, Deutsche Forschungsgemeinschaft (DFG. German Research Foundation)  409792180.
Disclosure: Michael J. Black has received research gift funds from Intel, Nvidia, Adobe, Facebook, and Amazon. While he is a parttime employee of Amazon and has financial interests in Amazon and Meshcapade GmbH, his research was performed solely at, and funded solely by, MPI.
References
 [1] M. Loper, N. Mahmood, J. Romero, G. PonsMoll, and M. J. Black. Smpl: A skinned multiperson linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.

[2]
G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and
C. Schmid.
Learning from synthetic humans.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 109–117, 2017.  [3] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik. Endtoend recovery of human shape and pose. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7122–7131, 2018.
 [4] M. Omran, C. Lassner, G. PonsMoll, P. Gehler, and B. Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV), pages 484–494. IEEE, 2018.
 [5] R. Alp Güler, N. Neverova, and I. Kokkinos. Densepose: Dense human pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7297–7306, 2018.
 [6] C. Lassner, J. Romero, M. Kiefel, F. Bogo, M. J. Black, and P. V. Gehler. Unite the people: Closing the loop between 3d and 2d human representations. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6050–6059, 2017.
 [7] G. Pavlakos, L. Zhu, X. Zhou, and K. Daniilidis. Learning to estimate 3d human pose and shape from a single color image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 459–468, 2018.
 [8] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman, D. Tzionas, and M. J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
 [9] G. PonsMoll, S. Pujades, S. Hu, and M. J. Black. Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG), 36(4):73, 2017.
 [10] P. Guan, L. Reiss, D. A. Hirshberg, A. Weiss, and M. J. Black. Drape: Dressing any person. ACM Trans. Graph., 31(4):35–1, 2012.
 [11] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
 [12] T. N. Kipf and M. Welling. Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
 [13] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
 [14] M. Henaff, J. Bruna, and Y. LeCun. Deep convolutional networks on graphstructured data. arXiv preprint arXiv:1506.05163, 2015.
 [15] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. AspuruGuzik, and R. P. Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015.
 [16] J. Atwood and D. Towsley. Diffusionconvolutional neural networks. In Advances in Neural Information Processing Systems, pages 1993–2001, 2016.
 [17] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3d faces using convolutional mesh autoencoders. In Proceedings of the European Conference on Computer Vision (ECCV), pages 704–720, 2018.
 [18] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.G. Jiang. Pixel2mesh: Generating 3d mesh models from single rgb images. In ECCV, 2018.
 [19] E. J. Smith, S. Fujimoto, A. Romero, and D. Meger. Geometrics: Exploiting geometric structure for graphencoded objects. arXiv preprint arXiv:1901.11461, 2019.
 [20] D. P. Kingma and M. Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 [21] K. Sohn, H. Lee, and X. Yan. Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pages 3483–3491, 2015.
 [22] Z. Lähner, D. Cremers, and T. Tung. Deepwrinkles: Accurate and realistic clothing modeling. In European Conference on Computer Vision, pages 698–715. Springer, 2018.
 [23] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [24] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
 [25] S. Cheng, M. Bronstein, Y. Zhou, I. Kotsia, M. Pantic, and S. Zafeiriou. Meshgan: Nonlinear 3d morphable models of faces. arXiv preprint arXiv:1903.10384, 2019.
 [26] C. Zhang, S. Pujades, M. J. Black, and G. PonsMoll. Detailed, accurate, human shape estimation from clothed 3d scan sequences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [27] N. MagnenatThalmann, H. Seo, and F. Cordier. Automatic modeling of virtual humans and body clothing. Journal of Computer Science and Technology, 19(5):575–584, 2004.
 [28] E. de Aguiar, L. Sigal, A. Treuille, and J. K. Hodgins. Stable spaces for realtime clothing. ACM Trans. Graph., 29(4):106:1–106:9, July 2010.
 [29] L. Sigal, M. Mahler, S. Diaz, K. McIntosh, E. Carter, T. Richards, and J. Hodgins. A perceptual control space for garment simulation. ACM Transactions on Graphics (TOG), 34(4):117, 2015.
 [30] D. Kim, W. Koh, R. Narain, K. Fatahalian, A. Treuille, and J. F. O’Brien. Nearexhaustive precomputation of secondary cloth effects. ACM Transactions on Graphics, 32(4):87:1–7, July 2013. Proceedings of ACM SIGGRAPH 2013, Anaheim.
 [31] T. Y. Wang, D. Ceylan, J. Popović, and N. J. Mitra. Learning a shared shape space for multimodal garment design. In SIGGRAPH Asia 2018 Technical Papers, page 203. ACM, 2018.
 [32] I. Santesteban, M. A. Otaduy, and D. Casas. Learningbased animation of clothing for virtual tryon. arXiv preprint arXiv:1903.07190, 2019.
 [33] A. O. Bălan and M. J. Black. The naked truth: Estimating body shape under clothing. In European Conference on Computer Vision, pages 15–29. Springer, 2008.
 [34] J. Yang, J.S. Franco, F. HétroyWheeler, and S. Wuhrer. Estimation of human body shape in motion with wide clothing. In European Conference on Computer Vision, pages 439–454. Springer, 2016.
 [35] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. PonsMoll. Video based reconstruction of 3D people models. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
 [36] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. PonsMoll. Detailed human avatars from monocular video. In International Conf. on 3D Vision, sep 2018.
 [37] R. A. Newcombe, D. Fox, and S. M. Seitz. Dynamicfusion: Reconstruction and tracking of nonrigid scenes in realtime. In IEEE Conf. on Computer Vision and Pattern Recognition, pages 343–352, 2015.
 [38] Y. Tao, Z. Zheng, K. Guo, J. Zhao, D. Quionhai, H. Li, G. PonsMoll, and Y. Liu. Doublefusion: Realtime capture of human performance with inner body shape from a depth sensor. In IEEE Conf. on Computer Vision and Pattern Recognition, 2018.
 [39] Y. Tao, Z. Zheng, Y. Zhong, J. Zhao, D. Quionhai, G. PonsMoll, and Y. Liu. Simulcap : Singleview human performance capture with cloth simulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2019.
 [40] D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur. Markerless garment capture. In ACM Transactions on Graphics (TOG), volume 27, page 99. ACM, 2008.
 [41] R. Daněřek, E. Dibra, C. Öztireli, R. Ziegler, and M. Gross. Deepgarment: 3d garment shape estimation from a single image. In Computer Graphics Forum, volume 36, pages 269–280. Wiley Online Library, 2017.
 [42] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. PonsMoll. Learning to reconstruct people in clothing from a single RGB camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), jun 2019.
 [43] A. Neophytou and A. Hilton. A layered model of human body and garment deformation. In International Conference on 3D Vision, 2014.
 [44] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis. Scape: shape completion and animation of people. In ACM transactions on graphics (TOG), volume 24, pages 408–416. ACM, 2005.
 [45] J. Yang, J.S. Franco, F. HétroyWheeler, and S. Wuhrer. Analyzing clothing layer deformation statistics of 3d human motions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 237–253, 2018.
 [46] C. Stoll, J. Gall, E. d. Aguiar, S. Thrun, and C. Theobalt. Videobased reconstruction of animatable human characters. In ACM SIGGRAPH ASIA, 2010.
 [47] H. Zhu, X. Zuo, S. Wang, X. Cao, and R. Yang. Detailed human shape estimation from a single image by hierarchical mesh deformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4491–4500, 2019.
 [48] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima. Siclope: Silhouettebased clothed people. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
 [49] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li. Pifu: Pixelaligned implicit function for highresolution clothed human digitization. arXiv preprint arXiv:1905.05172, 2019.
 [50] S. Wold, K. Esbensen, and P. Geladi. Principal component analysis. Chemometrics and intelligent laboratory systems, 2(13):37–52, 1987.
 [51] F. De la Torre and M. J. Black. Robust principal component analysis for computer vision. In Proceedings Eighth IEEE International Conference on Computer Vision. ICCV 2001, volume 1, pages 362–369. IEEE, 2001.
 [52] A. Brock, J. Donahue, and K. Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
 [53] J. Wu, C. Zhang, T. Xue, B. Freeman, and J. Tenenbaum. Learning a probabilistic latent space of object shapes via 3d generativeadversarial modeling. In Advances in Neural Information Processing Systems, pages 82–90, 2016.
 [54] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
 [55] D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for realtime object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922–928. IEEE, 2015.
 [56] C. Häne, S. Tulsiani, and J. Malik. Hierarchical surface prediction for 3d object reconstruction. In 3D Vision (3DV), 2017 International Conference on, pages 412–420. IEEE, 2017.
 [57] M. Tatarchenko, A. Dosovitskiy, and T. Brox. Octree generating networks: Efficient convolutional architectures for highresolution 3d outputs. In Proc. of the IEEE International Conf. on Computer Vision (ICCV), volume 2, page 8, 2017.
 [58] C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017.
 [59] C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017.
 [60] M. Atzmon, H. Maron, and Y. Lipman. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018.
 [61] R. Klokov and V. Lempitsky. Escape from cells: Deep kdnetworks for the recognition of 3d point cloud models. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 863–872. IEEE, 2017.
 [62] Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon. Dynamic graph cnn for learning on point clouds. arXiv preprint arXiv:1801.07829, 2018.
 [63] N. Verma, E. Boyer, and J. Verbeek. Feastnet: Featuresteered graph convolutions for 3d shape analysis. In CVPR 2018IEEE Conference on Computer Vision & Pattern Recognition, 2018.
 [64] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia. Deformable shape completion with graph convolutional autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1886–1895, 2018.
 [65] M. Garland and P. S. Heckbert. Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 209–216. ACM Press/AddisonWesley Publishing Co., 1997.

[66]
P. Isola, J.Y. Zhu, T. Zhou, and A. A. Efros.
Imagetoimage translation with conditional adversarial networks.
In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.  [67] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
 [68] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Cvaegan: finegrained image generation through asymmetric training. In Proceedings of the IEEE International Conference on Computer Vision, pages 2745–2754, 2017.
 [69] F. R. Chung and F. C. Graham. Spectral graph theory. Number 92. American Mathematical Soc., 1997.

[70]
Q. Li, Z. Han, and X.M. Wu.
Deeper insights into graph convolutional networks for semisupervised learning.
InThirtySecond AAAI Conference on Artificial Intelligence
, 2018.  [71] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
 [72] J.Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired imagetoimage translation using cycleconsistent adversarial networks. In IEEE International Conference on Computer Vision, 2017.
 [73] S. Wuhrer, L. Pishchulin, A. Brunton, C. Shu, and J. Lang. Estimation of human body shape and posture under clothing. Computer Vision and Image Understanding, 127:31–42, 2014.
 [74] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234–2242, 2016.
 [75] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems, pages 6626–6637, 2017.
 [76] G. PonsMoll, J. Romero, N. Mahmood, and M. J. Black. Dyna: A model of dynamic human shape in motion. ACM Transactions on Graphics (TOG), 34(4):120, 2015.
Appendix A Experiment: Animate a Pose Sequence
Here we elaborate on the third experiment described in Section 5.2. We make a comparison between two conditions:

Result produced by our AutoEnClother model. In our model, , we fix a value of the sampled latent code and clothing type , and only change by feeding SMPL pose parameters from a sequence.
For the first condition, despite the sharp wrinkles (since they are directly taken from a captured mesh), the results look as if the subject is wearing an elastic “swimsuit”. No matter how the avatar moves, the clothing remains tightly fitted to the body, and stretches as the body stretches.
In contrast, our AutoEnClother model produces the small pose corrective offsets that change the mesh more naturally, reducing this effect, as shown in Figure 8. When the subject opens his arms, the clothing below the armpit area typically “inflates”. A similar change occurs when the subject raises his arms: the cloth on the back drops naturally with gravity. Note that we do not explicitly integrate physics constraints in our model; such corrections are learned from data.
Appendix B Detailed Network Architecture
We use the following notations:

: data, : condition, : latent code, : the prediction map from discriminator;

: Chebyshev mesh convolution layer with filters;

: linear mesh downsampling layer with rate ;

FC: fully connected layer;

: residual block that uses as filters;
Condition Module:
for pose:
for clothing type:
Encoder:
Generator:
Discriminator:
Residual Block:
where denotes elementwise addition.
Comments
There are no comments yet.