Existing generative human models [6, 23, 34, 38] are becoming more and more expressive and successfully capture the statistics of human shape and pose deformation, but still miss an important component, the clothing. This leads to several problems in various applications. For example, when body models are used to generate synthetic training data [21, 44, 45, 51]
, the minimal body geometry results in a significant domain gap between synthetic and real images of humans. Deep learning methods reconstruct human shape from images, based on minimally dressed human models[5, 24, 27, 28, 31, 37, 38, 39]
. Although the body pose matches the image observation, the minimal body geometry does not match clothed humans for most of the cases. These problems motivate the need for a parametric clothed human model.
Our goal is to create a generative model of clothed human bodies that is low-dimensional, easy to pose, differentiable, can represent different clothing types on different body shapes and poses, and produces visually plausible results. To achieve this, we extend SMPL and factorize clothing shape from the undressed body, treating clothing as an additive displacement on the canonical pose (see Fig. 2). The learned clothing layer is compatible with the SMPL body model by design, enabling easy re-posing and animation. We formulate the problem of modeling the clothing layer as a conditional generative task. Since the clothing shape is intrinsically stochastic, for a single pose and body shape, multiple clothing deformations can be sampled from our model. Our model, called CAPE for “Clothed Auto Person Encoding”, is conditioned on clothing types and body poses, so that it captures different types of clothing, and can generate pose-dependent deformations, which are important to realistically model clothing.
We illustrate the key elements of our model in Fig. 1. Given a SMPL body shape, pose and clothing type, CAPE can generate different structures of clothing by sampling a learned latent space. The resulting clothing layer plausibly adapts to different body shapes and poses.
We represent clothing as a displacement layer using a graph that inherits the topology of SMPL. Each node in this graph represents the 3-dimensional offset vector from its corresponding vertex on the underlying body. To learn a generative model for such graphs, we build a graph convolutional neural network (Sec.4), under the framework of a VAE-GAN [7, 30], using graph convolutions  and mesh sampling  as the backbone layers. This addresses the problem with existing generative models designed for 3D meshes of human bodies [33, 52] or faces  that tend to produce over-smoothed results, which is not suitable for clothing where high-frequency wrinkles matter. Specifically, the GAN  module in our system encourages visually plausible wrinkles. We model the GAN using a patch-wise discriminator for mesh-like graphs, and show that it effectively improves the quality of the generated fine structures.
Dataset. We introduce a dataset of 4D captured people performing a variety of pose sequences, in different types of clothing (Sec. 5). Our dataset consists of over 80K frames of 8 male and 3 female subjects captured using a 4D scanner. We use this dataset to train our network, resulting in a parametric generative model of the clothing layer.
Versatility. CAPE is designed to be “plug-and-play” for many applications that already use SMPL. Dressing SMPL with CAPE yields 3D meshes of people in clothing, which can be used for several applications such as generating training data, parametrizing body pose in a deep network, having a clothing “prior”, or as part of a generative analysis-by-synthesis approach [21, 45, 51]. We demonstrate this on the task of image fitting by extending SMPLify  with our model. We show that using CAPE together with SMPLify can significantly improve the quality of reconstructed human bodies in clothing.
In summary, our key contributions are: (1) We formulate the problem of 3D clothing / clothed human modeling as a conditional generative task. (2) For this task, we learn a conditional Mesh-VAE-GAN that captures both global shape and local detail of a mesh, with controlled conditioning based on human pose and clothing types. (3) The learned model can generate pose-dependent deformations of clothing, and generalizes to a variety of garments. (4) We augment the 3D human body model SMPL with our clothing model, and show an application of the enhanced “clothed-SMPL”. (5) We contribute a dataset of 4D scans of clothed humans performing a variety of motion sequences.
2 Related Work
|Yang et al. ||Yes||Yes||Yes||No||Yes||No||No|
|Clothing||Wang et al. ||Yes||No||No||No||No||Yes||Yes|
|Sanesteban et al. ||Yes||Yes||Yes||Yes||No||No||No|
Group 1: BodyNet , DeepHuman , SiCloPe , PiFU , MouldingHumans . Group 2: Octopus , MGN , Tex2Shape .
Two main 3D clothing method classes exist: (1) Image reconstruction and capture methods, and (2) Clothing models which predict deformation as a function of pose. Within each class, methods differ according to the criteria in the columns.
The capture, reconstruction and modeling of clothing has been widely studied. There are two main classes of methods: (1) reconstruction and capture approaches
, which predict 3D clothing from images or point-clouds, and (2) parametric models of clothing, which predict howclothing deforms as a function of pose and body shape. Table 1
shows recent methods classified according to these criteria.
Reconstructing 3D humans.
Reconstruction of 3D humans from 2D images and videos is a classical computer vision problem. Most approaches[9, 18, 24, 27, 28, 31, 37, 38, 48] output 3D body meshes from images, but not clothing. This ignores image evidence that may be useful. To reconstruct clothed bodies, methods use volumetric [35, 46, 50, 59] or bi-planar depth representations  to model the body and garments as a whole. We refer to these as Group 1 in Table 1. While these methods deal with arbitrary clothing topology and preserve a high level of detail, the reconstructed clothed body is not parametric, which means the pose, shape, and clothing of the reconstruction can not be controlled or animated.
Another group of methods are based on SMPL [1, 2, 3, 4, 8, 60]. They represent clothing as an offset layer from the underlying body as proposed in ClothCap . We refer to these approaches as Group 2 in Table 1. These methods can change the pose and shape of the reconstruction using the deformation model of SMPL. This assumes clothing deforms like an undressed human body; i.e. that clothing shape and wrinkles do not change as a function of pose. We also use a body-to-cloth offset representation to learn our model, but critically, we learn a neural function mapping from pose to multi-modal clothing offset deformations. Hence, our work differs from these methods in that we learn a parametric model of how clothing deforms with pose.
Parametric models for 3D bodies and clothes. Statistical 3D human body models learned from 3D body scans, [6, 23, 34, 38] capture body shape and pose and are an important building block for multiple applications. Most of the time, however, people are dressed and these models do not represent clothing. In addition, clothes deform as we move, producing changing wrinkles at multiple spatial scales. While clothing models learned from real data exist, few generalize to new poses. For example, Neophytou and Hilton  learn a layered garment model on top of SCAPE  from dynamic sequences, but generalization to novel poses is not demonstrated. Yang et al.  train a neural network to regress a PCA-based representation of clothing, but show generalization on the same sequence or on the same subject. Lähner et al.  learn a garment-specific pose-deformation model by regressing low-frequency PCA components and high frequency normal maps. While the visual quality is good, the model is garment-specific and does not provide a solution for full-body clothing. Similarly, Alldieck et al.  use displacement maps with a UV-parametrization to represent surface geometry, but the result is only static. Wang et al.  allow manipulation of clothing with sketches in a static pose. The Adam model  can be considered clothed but the shape is very smooth and not pose-dependent. Clothing models have been learned from physics simulation of clothing [17, 19, 47], but visual fidelity is limited by the quality of the simulations. Furthermore, the above methods are regressors
that produce single point estimates. In contrast, our model isgenerative, which allows us to sample clothing.
A conceptually different approach infers the parameters of a physical clothing model from 3D scan sequences . This generalizates to novel poses, but the inference problem is difficult and, unlike our model, the resulting physics simulator is not differentiable with respect to the parameters.
Generative models on 3D meshes. Our generative model predicts clothing displacements on the graph defined by the SMPL mesh using graph convolutions . There is an extensive recent literature on methods and applications of graph convolutions [11, 26, 33, 43, 52]. Most relevant here, Ranjan et al. 
learn a convolutional autoencoder using graph convolutions with mesh down- and up-sampling layers . Although it works well for faces, the mesh sampling layer makes it difficult to capture the high frequency wrinkles, which are key in clothing. In our work, we capture high frequency wrinkles by extending the PatchGAN  architecture to 3D meshes.
3 Additive Clothed Human Model
To model clothed human bodies, we factorize them into the minimally-clothed body and a clothing layer represented as displacements from the body. This enables us to naturally extend SMPL to a class of clothing types by treating clothing as an additional additive shape term. Since SMPL is in wide use, our goal is to extend it in a way that is consistent with current uses, making it effectively a “drop in” replacement for SMPL.
3.1 Dressing SMPL
SMPL  is a generative model of human bodies that factors the surface of the body into shape () and pose () parameters. As shown in Fig. 2 (a), (b), the architecture of SMPL starts with a triangulated template mesh, , in rest pose, defined by vertices. Given shape and pose parameters (, 3D offsets are added to the template, corresponding to shape dependent deformations () and pose dependent deformations (). The resulting mesh is then posed using the skinning function . Formally:
where the blend skinning function rotates the rest pose vertices around the 3D joints (computed from ), linearly smoothes them with the blend weights , and returns the posed vertices . The pose is represented by a vector of relative 3D rotations of the joints and the global rotation in axis-angle representation.
SMPL adds linear deformation layers to an initial body shape. Following this, we define clothing as an additional offset layer from the body and add it on top of the SMPL mesh, Fig. 2 (d). In this work, we parametrize the clothing layer by the body pose , clothing type and a low-dimensional latent variable that encodes clothing shape and structure.
Let be the clothing displacement layer. We extend Eq. (1) to a clothed body template in the rest pose:
Note that the clothing displacements, , are pose-dependent. The final clothed template is then posed with the SMPL skinning function, Eq. (2):
This differs from simply applying blend skinning with fixed displacements, as done in e.g. [1, 8]. Here, we train the model such that pose-dependent clothing displacements in the template pose are correct once posed by blend skinning.
3.2 Clothing representation
Our representation of clothing as vertex displacements is not a physical model and cannot represent all types of clothing, but this approach achieves a balance between expressiveness and simplicity, and has been widely used in deformation modeling , 3D clothing capture  and recent work that reconstructs clothed humans from images [1, 8, 60].
Our displacement layer is a graph that inherits the SMPL topology: the edges . is the set of vertices, and the feature on each vertex is the 3-dimensional offset vector, , from its corresponding vertex on the underlying body mesh.
We train our model using 3D scans of people in clothing. To acquire data for the displacements, pairs of are needed, where stands for the vertices of a clothed human mesh, and the vertices of a minimally-clothed mesh. To do so, we first scan subjects in both clothed and minimally-clothed conditions, and then use the SMPL model + free deformation [1, 58] to register the scans. As a result, we obtain SMPL meshes capturing the geometry of the scans, the corresponding pose parameters, and vertices of the unposed meshes that live in the zero-pose space111We follow SMPL and use the T-pose as the zero-pose. For the mathematical details of registration and unposing, we refer the reader to .. For each pair, the displacements are then calculated as , where the subtraction is performed per-vertex along the feature dimension. Ideally, has non-zero values only on body parts covered with clothes.
To summarize, our method extends the SMPL body model to clothed bodies. Compared to volumetric representation of clothed people [35, 46, 50, 59], our combination of the body model and the garment layer is superior in the ease of re-posing and garment retargeting: the former uses the same blend skinning as the body model, while the latter is a simple addition of the displacements to a minimally-clothed body shape. In contrast to similar models that also dress SMPL with offsets [1, 8], our garment layer is parametrized, low-dimensional, and pose-dependent.
Our clothing term in Eq. (3) is a function of , a code in a learned low-dimensional latent space that encodes the shape and structure of clothing, body pose , and clothing type . The function outputs the clothing displacement graph as described in Sec. 3.2. We parametrize this function using a graph convolutional neural network (Graph-CNN) as a VAE-GAN framework [16, 25, 30].
4.1 Network architecture
As shown in Fig. 3, our model consists of three major components: a generator with an encoder-decoder architecture, a discriminator . We also use auxiliary networks to handle the conditioning. The network is differentiable and is trained end-to-end.
For simplicity, we use the following notation in this section. : the vertices of the input displacement graph; : vertices of the reconstructed graph; and : the pose and clothing type condition vector; : the latent code.
Graph generator. We build the graph generator following the VAE-GAN framework. At training, an encoder takes in the displacement , extract its features through multiple graph convolutional layers, and maps it to the low-dimensional latent code . A decoder is trained to reconstruct the input graph from
. Both the encoder and decoder are feed-forward neural networks built with mesh convolutional layers. Fully-connected layers are used at the end of the encoder and the beginning of the decoder. The full architecture is shown in AppendixA.1.
Stacking graph convolution layers causes a loss of local features 
in the deeper layers. This is undesirable for clothing generation because fine details, corresponding to wrinkles, are likely to disappear. Therefore, we improve the standard graph convolution layers with residual connections, which enable the use of low-level features from the layer input if necessary. Within a residual block, aconvolution is performed aggregating features from the vertex itself.
At test time, the encoder is not needed. Instead, is sampled from the Gaussian prior distribution, and the decoder serves as the graph generator: . We detail different use cases below.
Patchwise discriminator. To further enhance fine details in the reconstructions, we introduce a patchwise discriminator for graphs, which has shown success in the image domain [22, 61]. Instead of looking at the entire generated graph, the discriminator only classifies whether a graph patch is real or fake based on its local structure. Intuitively this encourages the discriminator to only focus on fine details, and the global shape is taken care of by the reconstruction loss.
We implement the graph patchwise-discriminator using four graph convolution-downsampling blocks . We add a discriminative real / fake loss for each of the output vertices. This enables the discriminator to capture a patch of neighboring nodes in the reconstructed graph and classify them as real / fake (see Fig. 3).
Conditional model. We condition the network with body pose and clothing type . The SMPL pose parameters are in axis-angle representation, and are difficult for the neural network to learn [28, 31]. Therefore, following previous work [28, 31], we transform the pose parameters into rotational matrices using the Rodrigues equation. The clothing types are discrete by nature, and we represent them using one-hot labels.
Both conditions are first passed through a small fully-connected embedding network, , respectively, so as to balance the dimensionality of learned graph features and of the condition features. We also experiment with different ways of conditioning the mesh generator: concatenation in the latent space; appending the condition features to the graph features at all nodes in the generator; and the combination of the two. We find that the combined strategy works better in terms of network capability and the effect of conditioning.
4.2 Losses and learning
For reconstruction, we use an L1 loss over the vertices of the mesh , because it encourages less smoothing compared to L2, given by
Furthermore, we apply a loss on the mesh edges to encourage the generation of wrinkles instead of smooth surfaces. Let be an edge in the set of edges, , of the ground truth graph, and the corresponding edge in the generated graph. We penalize the mismatch of all corresponding edges by
We also apply a KL divergence loss between the distribution of latent codes and the Gaussian prior
Moreover, the generator and the discriminator are trained using an adversarial loss
where tries to minimize this loss against the that aims to maximize it.
The overall objective is a weighted sum of these loss terms given by
Network training details are provided in Appendix A.2.
5 CAPE Dataset
We build a dataset of 3D clothing by capturing temporal sequences of 3D human body scans with a high-resolution body scanner (3dMD LLC, Atlanta, GA). Approximately 80K 3D scan frames are captured at 60 FPS, and a mesh with SMPL model topology is registered to each scan to get surface correspondences. We also scanned the subjects in a minimally-clothed condition to obtain an accurate estimate of their body shape under clothing. We extract the clothing as displacements from the minimally-clothed body as described in Sec. 3.2. Noisy frames and failed registrations are removed through manual inspection.
The dataset consists of 8 male subjects and 3 female subjects, performing a wide range of motion. The subjects gave informed written consent to participate and to release the data for research purposes. There are four types of outfits, which cover a wide range of common garments: T-shirts, long-sleeve shirts, long jerseys, shorts, long pants, etc.; see Appendix D for details and examples from the dataset.
Compared to existing datasets of 3D clothed humans, our dataset provides captured data and alignments of SMPL to the scans, separates the clothing from body, and provides accurate, captured ground truth body shape under clothing. For each subject and outfit, our dataset contains large pose variations, which induces a wide variety of wrinkle patterns. Since our dataset of 3D meshes has a consistent topology, it can be used for quantitative evaluation of different Graph-CNN architectures.
We first show the representation capability of our model and then demonstrate the model’s ability to generate new examples by probabilistic sampling. We then show an application to human pose and shape estimation.
6.1 Representation power
3D mesh auto-encoding errors. As our network is based on a VAE, we can use the reconstruction accuracy to measure the capability of the model to encode while preserving the original geometry. We compare with a recent convolutional mesh autoencoder, CoMA , and a linear (PCA) model. We compare to both the original CoMA with a 4 downsampling (denoted as “CoMA-4”), and without downsampling (denoted “CoMA-1”) to study the effect of downsampling on over-smoothing. We use the same latent space dimension (number of principal components in the case of PCA) and hyper-parameter settings, where applicable, for all models.
Table 2 shows the result of per-vertex Euclidean error when using our network to reconstruct the clothing displacement graphs from a held-out test set of 5852 examples in our CAPE dataset. Vertices from the head, fingers, toes, hands and feet are excluded from accuracy computation as they are not covered with clothing.
Our model outperforms the baselines in the auto-encoding task, meanwhile the reconstructed shape from our model is probabilistic and pose-dependent. Note that, CoMA here is a deterministic auto-encoder with a focus on reconstruction. Although PCA performs comparably well with our method, PCA can not be used directly in the inference phase with a pose parameter as input. Furthermore, PCA can not produce probabilistic samples of shapes without knowing the data distribution. Our method tackles both of these scenarios.
|Baseline Comparison||Ablation Study|
|Method||Error (mm)||Removed||Error (mm)|
Ablation study. We remove key components from our model while keeping all the others, and evaluate the model performance; see Table 2. We observe that the discriminator, residual block and edge loss all play important roles in the model performance. Comparing the performance of CoMA-4 and CoMA-1, we find that the mesh the downsampling layer causes a loss of fidelity. However, even without any spatial downsampling, CoMA-1 still underperforms our model. This shows benefits of adding the discriminator, residual block and edge loss in our model.
Fig. 4 shows a qualitative comparison of the methods. PCA keeps wrinkles and boundaries, but long-range correlations are missing: the rising hem on the left side disappears. CoMA-1 and CoMA-4 are able to capture global correlation, but the wrinkles tend to be smoothed. By incorporating all the key components, our model manages to model both local structures and global correlations.
6.2 Conditional generation of clothing
As a generative model, CAPE can be sampled and generate new data. The model has three parameters: (see Eq. (3)). By sampling one of them while keeping the other two fixed, we show how the conditioning effects the generated clothing shape.
Sampling. Fig. 5 presents the sampled clothing dressed on unseen bodies, in an variety of poses that are not used in training. For each subject, we fix the pose and clothing type , and sample several times to generate varied clothing shapes. The sampling trick in  is used. Here we only show untextured rendering to highlight the variation in the generated geometry. As CAPE inherits the SMPL topology, the generated clothed body meshes are compatible with all existing SMPL texture maps. See Appendix C.2 for a comparison between a CAPE sample and a SMPL sample applied with the same texture.
As shown in the figure, our model manages to capture long-range correlations within a mesh, such as the elevated hem as the subject raises the arms, and the lateral wrinkle on the back as he raises arms. The model also synthesizes local details such as wrinkles in the armpit area, and boundaries at cuffs and collars.
Pose-dependent clothing deformation. Another practical use case of CAPE is to animate an existing clothed body. This corresponds to fixing the clothing shape variable and clothing type , and repose the body by changing pose . The challenge here is to have a clothing shape that is consistent across poses, yet deforms plausibly. We demonstrate the pose-dependent effect on a test pose in Fig. 6. The difference of the clothing layer between the two poses is calculated in the zero-pose space, and shown in color coding. The result shows that the clothing type is consistent while local deformation changes along with pose.
User study of generated examples. To test the realism of the conditionally generated results of our method, we performed a user study on Amazon Mechanical Turk (AMT). In the study, virtual avatars are dressed in 3D and rendered into front-view images. Following the protocol from , raters are presented with a series of “real vs fake” trials. On each trial, the rater is presented with a “real” mesh render (randomly picked from our dataset, i.e. alignments of the scan data) and a “fake” render (mesh generated by our model). Both images are shown side-by-side. The raters are asked to pick the one that they think is real. Each pair of renderings is evaluated by 10 raters. More strictly than , we present both real and fake renderings simultaneously, do not set a time limit for the raters and allow zoom-in for detailed comparison. In this setting, the best score that a method can obtain is 50%, meaning that the real and fake examples are indistinguishable.
We carry out AMT evaluation with two test cases. In Test Case 1, we fix the clothing type to be “shortlong” (the most common clothing type in training), and generate 300 clothed body meshes with various poses for the evaluation. In Test Case 2, we fix the pose to be an A-pose (the most frequent pose in training), and sample 100 examples per clothing type for evaluation. The percentage of the raters that label our synthsized data as “real” is shown in Table 3. On average, in the direct comparison with real data, our synthesized data “fools” and participants, for the two test cases respectively.
|Test Case 1||Test Case 2|
6.3 Image fitting
CAPE is fully differentiable with respect to the clothing shape variable , body pose and clothing type . Therefore, it can also be used in optimization frameworks. We show an application of CAPE on the task of reconstructing body mesh from a single image, by enhancing a popular optimization-based method, SMPLify . Specifically, we dress the minimally-clothed output mesh from SMPLify using CAPE, project it back to the image using a differentiable renderer  and optimize for , with respect to the silhouette discrepancy.
We evaluate our image fitting pipeline on renderings of 120 meshes from the CAPE dataset. To compare, we measure the reconstruction error on SMPLify and our results against ground truth mesh using mean square vertex error. To eliminate the error introduced by the ambiguity of human scale and distance to the camera, we optimize the global scaling and translation of predictions for both methods on each test sample. A mask is applied to exclude error in the non-clothed regions such as head, hands and feet. We report the errors of both methods in Table 4. Our model performs 18% better than SMPLify due to its ability to capture clothing shape. More details about the objective function, experimental setup and qualitative results of the image fitting experiment are provided in Appendix B.
Furthermore, once a clothed human is reconstructed from image, our model can repose and animate it, as well as changing the subject’s clothes by re-sampling or clothing type . This shows the potential in a wide range of applications.
7 Conclusions, Limitations, Future Work
We have introduced a novel graph-CNN-based generative shape model of that enables us to condition, sample, and preserve fine shape detail in 3D meshes. We use this to model clothing deformations from a 3D body mesh and condition the latent space on body pose and clothing type. The training data represents 3D displacements from the SMPL body model for varied clothing and poses. This design means that our generative model is compatible with SMPL in that clothing is an additional additive term applied to the SMPL template mesh. This makes it possible to sample clothing, dress SMPL with it, and then animate the body with pose-dependent clothing wrinkles. A clothed version of SMPL has wide applicability in computer vision. As shown, we can apply it to fitting the body to images of clothed humans. Another application would use the model to generate training data of 3D clothed people to train regression-based pose-estimation methods.
There are a few limitations of our approach that point to future work. As with recent methods that use a displacement-based clothing representation [1, 8], our model only handles garments that have a similar geometry as human body. A different representation is needed for skirts and open jackets. While our generated clothing depends on pose, it does not depend on dynamics. This is fine for most slow motions but does not generalize to faster motions like those in sports. Future work will address modeling clothing deformation conditioned on the state of previous time steps. Learning a model of clothing texture (albedo) is another promising direction.
Acknowledgements: We thank Daniel Scharstein for several revisions of the manuscript. We thank Joachim Tesch for rendering the results, Pavel Karasik for the help with Amazon Mechanical Turk, Tsvetelina Alexiadis and Andrea Keller for building the dataset. We thank Partha Ghosh, Timo Bolkart and Yan Zhang for useful discussions. Qianli Ma and Siyu Tang acknowledge funding by Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) Projektnummer 276693517 SFB 1233. Gerard Pons-Moll is funded by the Emmy Noether Programme, Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180.
Disclosure: Michael J. Black has received research gift funds from Intel, Nvidia, Adobe, Facebook, and Amazon. While he is a part-time employee of Amazon and has financial interests in Amazon and Meshcapade GmbH, his research was performed solely at, and funded solely by, MPI.
Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and
Learning to reconstruct people in clothing from a single RGB
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Jun 2019.
-  Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Detailed human avatars from monocular video. In International Conference on 3D Vision (3DV), Sep 2018.
-  Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Video based reconstruction of 3D people models. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and Marcus Magnor. Tex2shape: Detailed full human body geometry from a single image. In The IEEE International Conference on Computer Vision (ICCV). IEEE, Oct 2019.
-  Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Densepose: Dense human pose estimation in the wild. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7297–7306, 2018.
-  Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. In ACM transactions on graphics (TOG), volume 24, pages 408–416. ACM, 2005.
-  Jianmin Bao, Dong Chen, Fang Wen, Houqiang Li, and Gang Hua. CVAE-GAN: fine-grained image generation through asymmetric training. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2745–2754, 2017.
-  Bharat Lal Bhatnagar, Garvita Tiwari, Christian Theobalt, and Gerard Pons-Moll. Multi-Garment Net: Learning to dress 3D people from images. In The IEEE International Conference on Computer Vision (ICCV). IEEE, oct 2019.
-  Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In The European Conference on Computer Vision (ECCV), pages 561–578. Springer, 2016.
-  Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann Lecun. Spectral networks and locally connected networks on graphs. In International Conference on Learning Representations (ICLR), 2014.
-  Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems, pages 3844–3852, 2016.
-  Valentin Gabeur, Jean-Sebastien Franco, Xavier Martin, Cordelia Schmid, and Gregory Rogez. Moulding humans: Non-parametric 3d human shape estimation from single images. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
-  Michael Garland and Paul S Heckbert. Surface simplification using quadric error metrics. In Proceedings of the 24th annual conference on Computer graphics and interactive techniques, pages 209–216. ACM Press/Addison-Wesley Publishing Co., 1997.
-  Partha Ghosh, Mehdi SM Sajjadi, Antonio Vergari, Michael Black, and Bernhard Schölkopf. From variational to deterministic autoencoders. arXiv preprint arXiv:1903.12436, 2019.
-  Ke Gong, Xiaodan Liang, Yicheng Li, Yimin Chen, Ming Yang, and Liang Lin. Instance-level human parsing via part grouping network. In The European Conference on Computer Vision (ECCV), pages 770–785, 2018.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
-  Peng Guan, Loretta Reiss, David A Hirshberg, Alexander Weiss, and Michael J Black. DRAPE: DRessing Any PErson. ACM Trans. Graph., 31(4):35–1, 2012.
-  Peng Guan, Alexander Weiss, Alexandru O Balan, and Michael J Black. Estimating human shape and pose from a single image. In The IEEE International Conference on Computer Vision (ICCV), pages 1381–1388. IEEE.
-  Erhan Gundogdu, Victor Constantin, Amrollah Seifoddini, Minh Dang, Mathieu Salzmann, and Pascal Fua. GarNet: A two-stream network for fast and accurate 3D cloth draping. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8739–8748, 2019.
-  Paul Henderson and Vittorio Ferrari. Learning single-image 3D reconstruction by generative modelling of shape, pose and shading. International Journal of Computer Vision, 2019.
-  David T. Hoffmann, Dimitrios Tzionas, Michael J. Black, and Siyu Tang. Learning to train with synthetic humans. In German Conference on Pattern Recognition (GCPR), Sept. 2019.
-  Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1125–1134, 2017.
-  Hanbyul Joo, Tomas Simon, and Yaser Sheikh. Total capture: A 3D deformation model for tracking faces, hands, and bodies. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8320–8329, 2018.
-  Angjoo Kanazawa, Michael J Black, David W Jacobs, and Jitendra Malik. End-to-end recovery of human shape and pose. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7122–7131, 2018.
-  Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
-  Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
-  Nikos Kolotouros, Georgios Pavlakos, Michael J. Black, and Kostas Daniilidis. Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In The IEEE International Conference on Computer Vision (ICCV), Oct. 2019.
-  Nikos Kolotouros, Georgios Pavlakos, and Kostas Daniilidis. Convolutional mesh regression for single-image human shape reconstruction. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4501–4510, 2019.
-  Zorah Lähner, Daniel Cremers, and Tony Tung. DeepWrinkles: Accurate and realistic clothing modeling. In The European Conference on Computer Vision (ECCV), pages 698–715. Springer, 2018.
Anders Boesen Lindbo Larsen, Søren Kaae Sønderby, Hugo Larochelle, and
Autoencoding beyond pixels using a learned similarity metric.
International Conference on Machine Learning (ICML), pages 1558–1566, 2016.
-  Christoph Lassner, Javier Romero, Martin Kiefel, Federica Bogo, Michael J Black, and Peter V Gehler. Unite the people: Closing the loop between 3D and 2D human representations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6050–6059, 2017.
Qimai Li, Zhichao Han, and Xiao-Ming Wu.
Deeper insights into graph convolutional networks for semi-supervised learning.In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
-  Or Litany, Alex Bronstein, Michael Bronstein, and Ameesh Makadia. Deformable shape completion with graph convolutional autoencoders. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1886–1895, 2018.
-  Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. SMPL: A skinned multi-person linear model. ACM Transactions on Graphics (TOG), 34(6):248, 2015.
-  Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo Morishima. SiCloPe: Silhouette-based clothed people. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
-  Alexandros Neophytou and Adrian Hilton. A layered model of human body and garment deformation. In International Conference on 3D Vision (3DV), 2014.
-  Mohamed Omran, Christoph Lassner, Gerard Pons-Moll, Peter Gehler, and Bernt Schiele. Neural body fitting: Unifying deep learning and model based human pose and shape estimation. In 2018 International Conference on 3D Vision (3DV), pages 484–494. IEEE, 2018.
-  Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3D hands, face, and body from a single image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
-  Georgios Pavlakos, Luyang Zhu, Xiaowei Zhou, and Kostas Daniilidis. Learning to estimate 3D human pose and shape from a single color image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 459–468, 2018.
-  Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4929–4937, 2016.
-  Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J Black. Clothcap: Seamless 4D clothing capture and retargeting. ACM Transactions on Graphics (TOG), 36(4):73, 2017.
-  Albert Pumarola, Jordi Sanchez-Riera, Gary P. T. Choi, Alberto Sanfeliu, and Francesc Moreno-Noguer. 3DPeople: Modeling the geometry of dressed humans. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
-  Anurag Ranjan, Timo Bolkart, Soubhik Sanyal, and Michael J Black. Generating 3D faces using convolutional mesh autoencoders. In The European Conference on Computer Vision (ECCV), pages 704–720, 2018.
-  Anurag Ranjan, David T. Hoffmann, Dimitrios Tzionas, Siyu Tang, Javier Romero, and Michael J. Black. Learning multi-human optical flow. International Journal of Computer Vision (IJCV), 12 2019.
-  Anurag Ranjan, Javier Romero, and Michael J Black. Learning human optical flow. In British Machine Vision Conference (BMVC), 2018.
-  Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. PIFu: Pixel-aligned implicit function for high-resolution clothed human digitization. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
-  Igor Santesteban, Miguel A Otaduy, and Dan Casas. Learning-based animation of clothing for virtual try-on. arXiv preprint arXiv:1903.07190, 2019.
-  David Smith, Matthew Loper, Xiaochen Hu, Paris Mavroidis, and Javier Romero. FACSIMILE: Fast and accurate scans from an image in less than a second. In The IEEE International Conference on Computer Vision (ICCV), October 2019.
-  Carste Stoll, Jürgen Gall, Edilson de Aguiar, Sebastian Thrun, and Christian Theobalt. Video-based reconstruction of animatable human characters. In ACM SIGGRAPH ASIA, 2010.
-  Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volumetric inference of 3D human body shapes. In The European Conference on Computer Vision (ECCV), pages 20–36, 2018.
-  Gul Varol, Javier Romero, Xavier Martin, Naureen Mahmood, Michael J Black, Ivan Laptev, and Cordelia Schmid. Learning from synthetic humans. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 109–117, 2017.
-  Nitika Verma, Edmond Boyer, and Jakob Verbeek. Feastnet: Feature-steered graph convolutions for 3D shape analysis. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
-  Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popović. Articulated mesh animation from multi-view silhouettes. In ACM Transactions on Graphics (TOG), volume 27, page 97. ACM, 2008.
-  Tuanfeng Y Wang, Duygu Ceylan, Jovan Popović, and Niloy J Mitra. Learning a shared shape space for multimodal garment design. In SIGGRAPH Asia 2018 Technical Papers, page 203. ACM, 2018.
-  Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), pages 3–19, 2018.
-  Jinlong Yang, Jean-Sébastien Franco, Franck Hétroy-Wheeler, and Stefanie Wuhrer. Estimation of human body shape in motion with wide clothing. In The European Conference on Computer Vision (ECCV), pages 439–454. Springer, 2016.
-  Jinlong Yang, Jean-Sébastien Franco, Franck Hétroy-Wheeler, and Stefanie Wuhrer. Analyzing clothing layer deformation statistics of 3D human motions. In The European Conference on Computer Vision (ECCV), pages 237–253, 2018.
-  Chao Zhang, Sergi Pujades, Michael J. Black, and Gerard Pons-Moll. Detailed, accurate, human shape estimation from clothed 3D scan sequences. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  Zerong Zheng, Tao Yu, Yixuan Wei, Qionghai Dai, and Yebin Liu. DeepHuman: 3D human reconstruction from a single image. In The IEEE International Conference on Computer Vision (ICCV), 2019.
-  Hao Zhu, Xinxin Zuo, Sen Wang, Xun Cao, and Ruigang Yang. Detailed human shape estimation from a single image by hierarchical mesh deformation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4491–4500, 2019.
-  Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
Appendix A Implementation Details
a.1 CAPE network architecture
Here we provide the details of the CAPE architecture, as discribed in Sec. 4.1. We use the following notations:
: data, : output (reconstruction) from the decoder, : latent code, : the prediction map from the discriminator;
: Chebyshev graph convolution layer with filters;
: conditional residual block that uses as filters;
: linear graph downsampling layer with a spatial downsample rate ;
: linear graph upsampling layer with a spatial upsample rate ;
FC: fully connected layer;
Condition module: for pose , we remove the parameters that are not related to clothing, e.g. head, hands, fingers, feet and toes, resulting in 14 valid joints from the body. The pose parameters from each joint are represented by the flattened rotational matix (see Sec. 4.1, “Conditional model”). This results in the overall pose parameter . We feed this into a small fully-connected network:
The clothing type refers to the type of “outfit”, i.e. a combination of upper body clothing and lower body clothing. There are four types of outfits in our training data: longlong: long sleeve shirt / T-shirt / jersey with long pants; shortlong: short sleeve shirt / T-shirt / jersey with long pants; and their opposites, shortshort and longshort. As the types of clothing are discrete by nature, we represent them using a one-hot vector, , and feed it into a linear layer:
Conditional residual block: We adopt the graph residual block from Kolotouros et al.  that includes Group Normalization , non-linearity, graph convolutional layer and graph linear layer (i.e. Chebyshev convolution with polynomial order of ). Right after the input to the residual block, we append the condition vector to every input node along the feature channel. Our CResBlock is given by
where is the input to the CResBlock. has nodes and features on each node. ResBlock is the graph residual block from  that outputs features on each node.
a.2 Training details
The convolutions use the Chebyshev polynomial of order for the generator, and of order for the discriminator. An L2-weight decay with strength is used as regularization.
We train our model for males and females separately. We split the male dataset into a training set of 26,574 examples and 5,852 test examples. The female dataset is split into a training set of 41,472 examples and a test set of 12,656 examples. Training takes approximately 15 minutes per epoch on the male dataset and 20 minutes per epoch on the female dataset. The results reported in Table 2
are based on the male dataset due to its higher variance in the number of subjects and poses. We also provide results on the female dataset in the Table5.
Appendix B Image Fitting
Here we detail the objective function, experimental setup and extended results of the image fitting experiments, as described in Sec. 6.3.
b.1 Objective function
Similar to , we introduce a silhouette term to encourage the shape of the clothed body to match the image evidence. The silhouette is the set of all pixels that belong to a body’s projection onto the image. Let be the rendered silhouette of a clothed body mesh (see Eq. (4)), and be the ground truth silhouette. The silhouette objective is defined by the bi-directional distance between and :
where is the L1 distance from a point x to the closest point in the silhouette . The distance is zero if the point is inside . is the camera parameter that is used to render the mesh to the silhouette on the image plane. The clothing type is derived from upstream pipeline and is therefore not optimized.
For our rendered scan data, the ground truth silhouette and clothing type are acquired for free during rendering. For in-the-wild images, this information can be acquired using human-parsing networks, e.g. .
After the standard SMPLify optimization pipeline, we apply the clothing layer to the body, and apply an additional optimization step on body shape , pose and clothing structure , with respect to the overall objective:
The overall objective is a weighted sum of the silhouette loss with other standard SMPLify energy terms. is a weighted 2D distance between the projected SMPL joints and the detected 2D points, . is the mixture of Gaussians pose prior term, the shape prior term, the penalty term that discourages unnatrual joint bents, and the L2-regularizer on to prevent extreme clothing deformations. For more details about these terms please refer to Bogo et al. .
We render 120 textured meshes (aligned to SMPL topology) from the test set of the CAPE dataset that include variations in gender, pose and clothing type, at a resolution of . The ground truth meshes are used for evaluation. Examples of the rendering are shown in Fig. 7.
We re-implement the SMPLify work by Bogo et al. 
in Tensorflow, using the gender neutral SMPL body model. Compared to the original SMPLify, there are two major changes. First, we do not include the interpenetration error term, as it slows down the fitting but brings little performance gain. Second, we use OpenPose for the ground truth 2D keypoint detection instead of DeepCut .
We measure the mean square error (MSE) between ground truth vertices and reconstructed vertices from SMPLify , and from our pipeline (Eq. (B.1)) , respectively. As discussed in Sec. 6.3, to eliminate the influence of the ambiguity caused by focal length, camera translation and body scale, we estimate the body scale and camera translation for both and . Specifically, we optimize the following energy function for and respectively:
where is vertex index, the set of clothing vertex indices, and the number of elements in . Then, the MSE is computed with estimated scale and translation using:
b.5 Extended image fitting results
Appendix C Extended Experimental Results
c.1 Reconstruction error on the female dataset
We train and evaluate our model on the female dataset, with the same setup and metric as for the male dataset (Sec. A.2). Same as in Sec. 6.1 in the main manuscript, we compare the result with a PCA model, and two variations of the Convolutional Mesh Autoencoder (CoMA) , as shown in Table 5. Again, our model outperforms all the baselines.
c.2 CAPE with SMPL texture
As our model has the same topology as SMPL, it is compatible with all existing SMPL texture maps, which are mostly of clothed bodies. Fig. 8 shows an example texture applied to the standard minimally-clothed SMPL model (as done in the SURREAL dataset ) and to our clothed body model, respectively. Although the texture creates an illusion of clothing on the SMPL body, the overall shape remains skinny, oversmoothed, and hence unrealistic. In contrast, our model, with its improved clothing geometry, matches more naturally with the clothing texture if the correct clothing type is given. This visual contrast becomes even stronger when the texture map has no shading information (albedo map), and when the object is viewed in a 3D setting.
As a future line of research, one can model the alignment between the clothing texture boundaries and the underlying geometry by learning a texture model that is coupled to shape.
Appendix D CAPE Dataset Details
|Dataset||Captured||Body Shape||Registered||Large Pose||Motion||High Quality|
|Inria dataset ||Yes||Yes||No||No||Yes||No|
|Adobe dataset ||Yes||No||Yes||No||Yes||No|
|3D People ||No||Yes||Yes||Yes||Yes||Yes|
Elaborating on the main manuscript Sec. 5, our dataset consists of:
40K registered 3D meshes of clothed human scans for each gender.
8 male and 3 female subjects.
4 different types of outfits, covering more than 20 pieces of clothing from common clothing types.
Large variations in pose.
Precise, captured minimally clothed body shape.
Table 6 shows a comparision with public 3D clothed human datasets. Our dataset is distinguished by accurate alignment, consistent mesh topology, ground truth body shape scans, and a large variation of poses. These features makes it not only suitable for studies on human body and clothing, but also for the evaluation of various Graph-CNNs. See Fig. 9 for examples of the dataset.