TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style

03/10/2020 ∙ by Chaitanya Patel, et al. ∙ Max Planck Society 0

In this paper, we present TailorNet, a neural model which predicts clothing deformation in 3D as a function of three factors: pose, shape and style (garment geometry), while retaining wrinkle detail. This goes beyond prior models, which are either specific to one style and shape, or generalize to different shapes producing smooth results, despite being style specific. Our hypothesis is that (even non-linear) combinations of examples smooth out high frequency components such as fine-wrinkles, which makes learning the three factors jointly hard. At the heart of our technique is a decomposition of deformation into a high frequency and a low frequency component. While the low-frequency component is predicted from pose, shape and style parameters with an MLP, the high-frequency component is predicted with a mixture of shape-style specific pose models. The weights of the mixture are computed with a narrow bandwidth kernel to guarantee that only predictions with similar high-frequency patterns are combined. The style variation is obtained by computing, in a canonical pose, a subspace of deformation, which satisfies physical constraints such as inter-penetration, and draping on the body. TailorNet delivers 3D garments which retain the wrinkles from the physics based simulations (PBS) it is learned from, while running more than 1000 times faster. In contrast to PBS, TailorNet is easy to use and fully differentiable, which is crucial for computer vision algorithms. Several experiments demonstrate TailorNet produces more realistic results than prior work, and even generates temporally coherent deformations on sequences of the AMASS dataset, despite being trained on static poses from a different dataset. To stimulate further research in this direction, we will make a dataset consisting of 55800 frames, as well as our model publicly available at https://virtualhumans.mpi-inf.mpg.de/tailornet.



There are no comments yet.


page 1

page 7

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Animating digital humans in clothing has numerous applications in 3D content production, games, entertainment and virtual try on. The predominant approach is still physics based simulation (PBS). However, the typical PBS pipeline requires editing the garment shape in 2D with patterns, manually placing it on the digital character and fine tuning parameters to achieve desired results, which is laborious, time consuming and requires expert knowledge. Moreover, high quality PBS methods are computationally expensive, complex to implement and to control, and are not trivially differentiable. For computer vision tasks, generative models of clothed humans need to be differentiable, easy to deploy, and should be easy to integrate within CNNs, and to fit them to image and video data.

In order to make animation easy, several works learn efficient approximate models from PBS complied off-line. At least three factors

influence clothing deformation: body pose, shape and garment style (by style we mean the garment geometry). Existing methods either model deformation due to pose 

[9, 29] for a fixed shape, shape and pose [46, 17] for a fixed style, or style [58] for a fixed pose. The aforementioned methods inspire our work; however, they do not model the effects of pose, shape and style jointly, even though they are intertwined. Different garments deform differently as a function of pose and shape, and garment specific models [46, 29, 16, 9] (1 style) have limited use. Furthermore, existing joint models of pose and shape [16, 7, 17] often produce over-smooth results (even for a fixed style), and they are not publicly available. Consequently, none of existing approaches has materialized in a model which can be used to solve computer vision and graphics problems.

What is lacking is a unified model capable of generating different garment styles, and animating them on any body shape in any pose, while retaining wrinkle detail. To that end, we introduce TailorNet

, a mixture of Neural Networks (NN) learned from physics based simulations, which decomposes clothing deformations into style, shape and pose – this effectively approximates the physical clothing deformation allowing intuitive control of synthesized animations. In the same spirit as SMPL 

[32] for bodies, TailorNet learns deformations as a displacements to a garment template in a canonical pose, while the articulated motion is driven by skinning. TailorNet can either take a real garment as input, or generate it from scratch, and drape it on top of the SMPL body for any shape and pose. In contrast to [16], our model predicts how the garment would fit in reality, e.g., a medium size garment is predicted tight on a large body, and loose on a thin body.

Learning TailorNet required addressing several technical challenges. To generate different garment styles in a static pose, we compute a PCA subspace using the publicly available digital wardrobe of real static garments [5]. To generate more style variation while satisfying garment-human physical constraints, we sample from the PCA subspace, run PBS for each sample, and recompute PCA again, to obtain a static style subspace. Samples from this subspace produce variation in sleeve length, size and fit in a static pose. To learn deformation as a function of pose and shape, we generated a semi-real dataset by animating garments (real or samples from the static style subspace) using PBS on top of SMPL [32] body for static SMPL poses, and for different shapes.

Our first observation is that, for a fixed style

(garment instance) and body shape, predicting high frequency clothing deformations as a function of pose is possible – perhaps surprisingly, our experiments show that, for this task, a simple multi-layer perceptron model (MLP) performs as well as or better than Graph Neural Networks 

[31, 28] and Image-decoder on a UV-space [29]. In stark contrast, straightforward prediction of deformation as a function of style, shape and pose results in overly smooth un-realistic results. We hypothesize that any attempt to combine training examples smoothes out high frequency components, which explains why previous models [46, 16, 17], even for a single style, lack fine scale wrinkles and folds.

These key observations motivate the design of TailorNet: we predict the clothing low frequency geometry with a simple MLP. High frequency geometry is predicted with a mixture of high frequency style-shape specific models, where each specific model consists of a MLP which predicts deformation as a function of pose, and the weights of the mixture are obtained using a kernel which evaluates similarity in style and shape. A kernel with a very narrow bandwidth, prevents smoothing out fine scale wrinkles. Several experiments demonstrate that our model generalizes well to novel poses, predicts garment fit dependent on body shape, retains wrinkle detail, can produce style variations for a garment category (e.g., for T-shirts it produces different sizes, sleeve lengths and fit type), and importantly is easy to implement and control. To summarize, the main contributions of our work are:

  • The first joint model of clothing style, pose and shape variation, which is simple, easy to deploy and fully differentiable for easy integration with deep learning.

  • A simple yet effective decomposition of mesh deformations into low and high-frequency components, which coupled with a mixture model, allows to retain high-frequency wrinkles.

  • A comparison of different methods (MLP, Graph Neural Nets and image-to-image translation on UV-space) to predict pose dependent clothing deformation.

  • To stimulate further research, we make available a Dataset of aligned real static garments, simulated in poses, for body shapes, totaling frames.

Several experiments show that our model generalizes to completely new poses of the AMASS dataset (even though we did not use AMASS to train our model), and produces variations due to pose, shape and style, while being more detailed than previous style specific models [46, 17]. Furthermore, despite being trained on static poses, TailorNet produces smooth continuous animations.

2 Related Work

There are two main approaches to animation of clothing: physics based simulation (PBS), and efficient data-driven models learned from offline PBS, or real captures.

Physics Based Simulation (PBS).

Super realistic animations require simulating millions of triangles [53, 47, 22], which is computationally expensive. Many works focus on making simulation efficient [15] by adding wrinkles to low-resolution simulations [14, 25, 26, 55], or using simpler mass-spring models [43] and position based dynamics [37, 38] compromising accuracy and physical correctness for speed. Tuning the simulation parameters is a tedious task that can take weeks. Hence, several authors attempted to infer physical parameters mechanically [35, 57], from multi-view video [45, 62, 50], or from perceptual judgments of humans [49], but they only work in controlled settings and still require running PBS. PBS approaches typically require designing garments in 2D, grading them, adjusting them to the 3D body, and fine tuning parameters, which can take hours if not weeks, even for trained experts.

Data-driven cloth models.

One way to achieve realism is to capture real clothing on humans from images [5, 4, 2, 3, 18], dynamic scans [41, 39] or RGBD [52, 51] and re-target it to novel shapes, but this is limited to re-animating the same motion on a novel shape. While learning pose-dependent models from real captures [61, 29, 33] is definitely an exciting route, accurately capturing sufficient real data is still a major challenge. Hence, these models [61, 29, 33] only demonstrate generalization to motions similar to the training data.

Another option is to generate data with PBS, and learn efficient data-driven models. Early methods relied on linear auto-regression or efficient nearest neighbor search to predict pose [9, 26, 56], or pose and shape dependent clothing deformation [16]. Recent ones are based on deep learning, and vary according to the factor modeled (style or pose and shape), the data representation, and architecture used. Style variation is predicted from a user sketch with a VAE [58] but the pose is fixed. For a single style, pose and shape variation is regressed with MLPs and RNNs [46, 61], or Graph-NNs that process body and garment [17]. Pose effects are predicted with a normal map [29] or a displacement map [23] in UV-space. These models [9, 17, 16] tend to produce over-smooth results, with the exception of [29], but the model is trained for a single garment and subject, and as mentioned before generalization to in the wild motions (for example CMU [10]) is not demonstrated. Since there is no consensus on what representation and learning model is best suited for this task, for a fixed style and shape, we compare representations and architectures, and find that MLPs perform as well as more sophisticated Graph-NNs or image-to-image translation on UV-space. While we draw inspiration from previous works, unlike our model, none of them can jointly model style, pose and shape variation.

Pixel based models.

An alternative to 3D cloth modeling is to retrieve and warp images and videos [60, 8]. Recent methods use deep learning to generate pixels [19, 13, 12, 63, 65, 67, 20, 44, 54, 66], often guided by 2D warps, or smooth 3D models [30, 64, 59, 48], or learn to transfer texture from cloth images to 3D models [36]. They produce (at best) photo-realistic images in controlled settings, but do not capture the 3D shape of wrinkles, cannot easily control motion, view-point, illumination, shape and style.

3 Garment Model Aligned with SMPL

Figure 1: Overview of our model to predict the draped garment with style on the body with pose and shape . Low frequency of the deformations are predicted using a single model. High frequency of pose dependent deformations for prototype shape-style pairs are separately computed and mixed using a RBF kernel to get the final high frequency of the deformations. The low and high frequency predictions are added to get the unposed garment output, which is posed to using standard skinning to get the garment.
Method Static/Dynamic Pose Shape Style Model Dataset
Variations Variations Variations Public Public
Santesteban et al[46] Dynamic
Wang et al[58] Static
DeepWrinkles [29] Dynamic
DRAPE [16] Dynamic
GarNet [17] Static
Ours Static
Table 1: Comparison of our method with other works. Ours is the first method to model the garment deformations as a function of pose, shape and style. We also make our model and dataset public for further research.

Our base garment template is aligned with SMPL [32] as done in [41, 5]. SMPL represents the human body as a parametric function of pose() and shape()


composed of a linear function which adds displacements to base mesh vertices in a T-pose, followed by learned skinning . Specifically, adds pose-dependent deformations, and adds shape dependent deformations. are the blend weights of a skeleton .

We obtain the garment template topology as a submesh (with vertices ) of the SMPL template, with vertices . Formally, the indicator matrix evaluates to if garment vertex is associated with body shape vertex . The particular garment style draped over in a 0-pose is encoded as displacements over the unposed body shape . Since the garment is associated with the underlying SMPL body, previous works [5, 41] deform every clothing vertex with its associated SMPL body vertex function. For a given style , shape and pose , they deform clothing using the un-posed SMPL function


followed by the SMPL skinning function in Eq. 1


Since is fixed, this assumes that clothing deforms in the same way as the body, which is a practical but clearly over-simplifying assumption. Like previous work, we also decompose deformation as a non-rigid component (Eq. 3) and an articulated component (Eq. 3), but unlike previous work, we learn non-rigid deformation as a function of pose, style and shape. That is, we learn true clothing variations as we explain in the next section.

4 Method

In this section, we describe our decomposition of clothing as non-rigid deformation (due to pose, shape and style) and articulated deformation, which we refer to as un-posing (Section 4.1). The first component of our model is a subspace of garment styles which generates variation in A-pose (Section 4.2). As explained in the introduction, pose-models specific to a fixed shape and style (explained in Section 4.3), do preserve high-frequencies, but models that combine different styles and shapes produce overly smooth results. Hence, at the heart of our technique is a model, which predicts low frequency with a straight-forward MLP, and high-frequency with a mixture model Sections 4.24.4. An overview of our method is shown in Fig. 1.

4.1 Un-posing Garment Deformation

Given a set of simulated garments for a given pose and shape , we first disentangle non-rigid deformation from articulation by un-posing – we invert the skinning function


and subtract the body shape in a canonical pose, obtaining un-posed non-rigid deformation as displacements from the body. Since joints are known, computing in Eq. 5 entails un-posing every vertex , where are the part transformation matrices, and is the un-posed vertex. Non-rigid deformation in the unposed space is affected by body pose, shape and the garment style (size, sleeve length, fit). Hence, we propose to learn deformation as a function of shape , pose and style , i.e. . The model should be realistic, easy to control and differentiable.

4.2 Generating Parametric Model of Style

We generate style variation in A-pose by computing a PCA subspace using the public 3D garments of [5]. Although the method of Bhatnagar et al[5] allows to transfer the 3D garment to a body shape in a canonical pose and shape (), physical constraints might be violated – garments are sometimes slightly flying on top of the body, and wrinkles do not correspond to an A-pose. In order to generate style variation while satisfying physics constraints, we alternate sampling form the PCA space and running PBS on the samples. We already find good results alternating PBS and PCA two times, to obtain a sub-space of style variation in a static A-pose – with PCA coefficients . Fig. 2 shows the parametric style space for t-shirt. This is however, a model in a fixed A-pose and shape. Different garment styles will deform differently as a function of pose and shape. In Section 4.3 we explain our style-shape specific model of pose, and in Section 4.4 we describe our joint mixture model for varied shapes and styles.

4.3 Single Style-Shape Model

For a fixed garment style and a fixed body shape pair, denoted as , we take paired simulated data of body poses and garment displacements ,computed according to Eq. 5, and train a model to predict pose dependent displacements. In particular, we train using a multi-layer perceptron(MLP) to minimize the L1-loss between the predicted displacements and the ground-truth displacements . We observe that predicts reasonable garment fit along with fine-scale wrinkles and generalizes well to unseen poses, but this model is specific to a particular shape and style.

Figure 2: Overview of T-shirt style space. First component(top) and second component(middle) changes the overall size and the sleeve length respectively. Sampling from this style space generates a wide range of T-shirt styles (bottom). For the first two rows, corresponding components are from left to right.

4.4 TailorNet

For the sake of simplicity, it is tempting to directly regress all non-rigid clothing deformations as function of pose, shape and style jointly, i.e., learning directly with an MLP. However, our experiments show that, while this produces accurate predictions quantitatively, qualitatively results are overly smooth lacking realism. We hypothesize that any attempt to combine geometry of many samples with varying high frequency (fine wrinkles) smoothes out the details. This might be a potential explanation for the smooth results obtained in related works [46, 17, 16] – even for single style models. Our idea is to decompose garment mesh vertices, in an unposed space , into a smooth low-frequency shape , and a high frequency shape with diffusion flow. Let be a function on the garment surface, then it is smoothed with the diffusion equation:


which states that the function changes over time by a scalar diffusion coefficient times its spatial Laplacian . In order to smooth mesh geometry, we apply the diffusion equation to the vertex coordinates (which are interpreted as a discretized function on the surface)


where is the discrete Laplace-Beltrami operator applied at vertex , and and the number of iterations control the level of smoothing. Eq. 7 is also known as Laplacian smoothing [11, 6]. We use and 80 iterations, to obtain a smooth low-frequency mesh and a high-frequency residual . We then subtract the body shape as in Eq. 5, and predict displacement components separately as


where the low-frequency component is predicted with an MLP (smooth but accurate), whereas the high frequency component is predicted with a mixture of style-shape specific models of pose . As we show in the experiments, style-shape specific high frequency models retain details. We generalize to new shape-styles beyond the prototypes with a convex combination of specific models. The mixture weights are computed with a kernel with a narrow bandwidth to combine only similar wrinkle patterns


where is a shallow MLP which maps from style-shape to the garment displacements in a canonical A-pose. Ideally, the kernel should measure similarity in style and shape, but this would require simulating training data for every possible pose-shape-style combination, which is inefficient and resource intensive. Our key simplifying assumption is that two garments on two different people will deform similarly if their displacements to their respective bodies is similar – this measures clothing fit similarity. While this is an approximation, it works well in practice.

The bandwidth is a free-parameter of our model, allowing to generate varying high-frequency detail. We find qualitatively good results by keeping small in order to combine only nearby samples. We postprocess the output to remove the garment intersections with the body.

Choosing K Style-Shape Prototypes

For each chosen style-shape pair, we simulate a wide variety of poses, which is time consuming. Hence, we find prototype shape-styles such that they cover the space of static displacements in A-pose . Specifically, we want to choose style-shape pairs such that any other shape-styles can be approximated as a convex combination of prototypes with least error – this is a non-convex problem with no global optimum guarantees. While iterative methods like K-SVD [1] might yield better coverage, here we use a simple but effective greedy approach to choose good prototypes.

We take a dataset of garments in different styles draped on different body shapes in canonical pose. We start with a pool of body shapes in one common style and try to fit each of the other style-shape pairs as a convex combination of this pool. Then we take the style-shape with the highest approximation error and add it to the pool. We repeat this process until we get the pool of style-shape pairs. We find all shape-styles can be well approximated when . Thus, we use here.

5 Dataset

We learn our model from paired data of parameters and the garment which we obtained from simulation. We use a commercial 3D cloth design and simulation tool Marvelous Designer [21] to simulate these garments. Instead of designing and positioning garments manually, we use the publicly available digital wardrobe [5] captured from the real world data which includes variations in styles, shapes and poses. Although we describe our dataset generation with reference to T-shirt, method remains the same for all garments.

5.1 Garments to Generate Styles

We first retarget [41] the 3D garments from the digital wardrobe [5] to the canonical shape() and slowly simulate them to canonical pose(). We get the dataset of garments to learn style-space as in Section 4.2. However, the learnt style-space may not be intuitive due to limited variation in the data, and may contain irregular patterns and distortions owing to the registration process. So we sample 300 styles from this PCA model, simulate them again to get a larger dataset with less distortions. With 2 iterations of PCA and simulation, we generate consistent and meaningful style variations. We find first 2 PCA components enough to represent parameters. See Figure 2.

5.2 Shape, Pose and Style Variations

We choose shapes manually as follows. We sample the first two shape components at equally spaced intervals, while leaving the other components to zero. To these shapes, we add canonical zero shape . We choose styles in a similar way - by sampling the first two sytle components.

We simulate all combinations of shapes and styles in canonical pose, which results in instances. We choose prototypes out of style-shape pairs as training style-shape pairs using the approach mentioned in Section 4.4. We randomly choose more style-shape pairs for testing.

5.3 Simulation details

For pose variations, we use static SMPL poses, including a wide range of poses, including extreme ones. For a given style-shape and poses

, we simulate them in one sequence. The body starts with canonical pose and greedily transitions to the nearest pose until all poses are traversed. To avoid dynamic effects (in this paper we are interested in quasi static deformation), we insert linearly interpolated intermediate poses, and let the garment relax for few frames. PBS simulation fails sometimes, and hence we remove frames with self-interpenetration from the dataset.

For training style-shape pairs, we simulate SMPL poses with interpolated poses. For testing style-shape pairs, we simulate a subset of SMPL poses with a mix of seen and unseen poses. Finally, we get 4 splits as follows: (1) train-train : training style-shape pairs each with training poses. (2) train-test : training style-shape pairs each with unseen poses. (3) test-train : testing style-shape pairs each with training poses. (4) test-test : testing style-shape pairs each with unseen poses.

6 Experiments

We evaluate our method on T-Shirt quantitatively and qualitatively and compare it to our baseline, and previous works. Our baseline and TailorNet use several MLPs - each of them has two hidden layers with ReLU activation and a dropout layer. We arrive to optimal hyperparameters by tuning our baseline, and then keep them constant to train all other MLPs. See suppl. for more details.

6.1 Results of Single Style-Shape Model

We train single style-shape models on style-shape pairs separately. The mean per vertex test error of each model varies depending upon the difficulty of particular style-shape. The average error across all models is mm with maximum error of mm for a loose fitting, and minimum error of mm for a fit fitting.

For single style-shape model, in the initial stages of the work, we experimented with different modalities of 3D mesh learning: UV map, similar to [29], and graph convolutions (Graph CNN) on the garment mesh. For learning UV map, we train a decoder from the input pose parameters to UV map of the garment displacements. For Graph CNN [27] by putting the input on each node of the graph input. We report results on two style-shape pairs in Table 2, which shows that a simple MLP is sufficient to learn pose dependent deformations for a fixed style-shape .

Figure 3: Results on unseen pose, shape and style. Prediction by [46] (left), our mixture model (right). Note that our mixture model is able to retain more meaningful folds and details on the garment.


Style-shape MLP UV Decoder Graph CNN
Loose-fit 14.5 15.9 16.1
Tight-fit 10.1 11.4 11.7


Table 2: Mean per vertex error in mm for the pose dependent prediction by MLP, UV Decoder and Graph CNN for two style-shapes.

6.2 Results of TailorNet

Figure 4: Baseline method (left) smooths out the fine details over the garment. TailorNet (middle) is able to retain as many wrinkles as PBS groundtruth (right).
Figure 5: (left) Mixing the outputs of prototypes directly without decomposition smooths out the fine details. (right) The decomposition into high and low frequencies allows us to predict good garment fit with details.
Figure 6: Left: Predictions of TailorNet for multiple garments and completely unseen poses. Right: TailorNet prediction with a real texture.

We define our baseline implemented as a MLP to predict the displacements. Table 3 shows that our mixture model outperforms the baseline by a slight margin on 3 testing splits.


Split Style-shape Pose Our Our Mixture
No. set set Baseline Model
2 train test 10.6 10.2
3 test train 11.7 11.4
4 test test 11.6 11.4


Table 3: Mean per vertex error in mm for 3 testing splits in Section 5.3. Our mixture performs slightly better than the baseline quantitatively, and significantly better qualitatively.

Qualitatively, TailorNet outperforms the baseline and previous works. Figure 3 shows the garment predicted by Santesteban et al. [46] and our mixture model. Figure 4 shows the qualitative difference between the output by TailorNet and our baseline trained on the same dataset.

To validate our choice to decompose high and low frequency details, we consider a mixture model where individual MLPs predict style-shape dependent displacements directly without decomposition. Figure 5 shows that although it can approximate the overall shape well, it looses the fine wrinkles. TailorNet retains the fine details and provides intuitive control over high frequency details.

Sequences of AMASS [34]: TailorNet generalizes (some poses shown in Fig. 6) despite being trained on completely different poses. Notably, the results are temporally coherent, despite not modelling the dynamics. See suppl. video for visualization.

6.3 Multiple Garments

To show the generalizability of TailorNet, we trained it separately for 3 more garments - Shirt, Pants and Skirt. Since skirt do not follow the topology of template body mesh, we attach a skirt template to the root joint of SMPL [41]. For each of these garments, we simulate the dataset for 9 shapes with a common style and trained the model. Figure 6 shows the the detailed predictions by TailorNet. Since we use the base garments, which come from a real digital wardrobe [5], we can also transfer the textures from real scan on our predictions.

6.4 Runtime Performance

We implement TailorNet using PyTorch 

[40]. On a gaming laptop with an NVIDIA GeForce GTX 1060 GPU and Intel i7 CPU, our approach runs from 1 to 2 ms per frame, which is 1000 times faster than PBS it is trained from. Just on CPU, it runs 100 times faster than PBS.

7 Discussion and Conclusion

TailorNet is the first data-driven clothing model of pose, shape and style. The experiments show that a simple MLP approximates clothing deformations with an accuracy of mm, which is as good as more sophisticated graph-CNN, or displacements prediction in UV-space, but shares the same limitations with existing methods: when it is trained with different body shapes (and styles in our case) results are overly smooth and lack realism. To address this, we proposed a model which predicts low and high frequencies separately. Several experiments show that our narrow bandwidth mixture model to predict high-frequency preserves significantly more detail than existing models, despite having a much harder task, that is, modelling pose dependent deformation as a function of shape and style jointly.

In future work, we plan to fit the model to scans, images and videos, and will investigate refining it using video data in a self-supervised fashion, or using real 3D cloth captures [41]. We also plan to make our model sensitive to the physical properties of the cloth fabric and human soft-tissue [42].

Our TailorNet times faster than PBS, allows easy control over style, shape and pose without manual editing. Furthermore, the model is differentiable, which is crucial for computer vision and learning applications. We will make TailorNet and our dataset publicly available, and will continuously add more garment categories and styles to make itl even more widely useable. TailorNet fills a key missing component in existing body models like SMPL and ADAM [32, 24]: realistic clothing, which is necessary for animation of virtual humans, and to explain the complexities of dressed people in scans, images, and video.

Acknowledgements. This work is partly funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) - 409792180 (Emmy Noether Programme, project: Real Virtual Humans) and Google Faculty Research Award. We thank Bharat Lal Bhatnagar and Garvita Tiwari for insightful discussions, and Dan Casas for providing us test results of [46].


  • [1] M. Aharon, M. Elad, and A. Bruckstein (2006) K-svd: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on signal processing 54 (11), pp. 4311–4322. Cited by: §4.4.
  • [2] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-Moll (2019) Learning to reconstruct people in clothing from a single RGB camera. In

    IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §2.
  • [3] T. Alldieck, M. Magnor, W. Xu, C. Theobalt, and G. Pons-Moll (2018) Video based reconstruction of 3D people models. In IEEE Conf. on Computer Vision and Pattern Recognition, Cited by: §2.
  • [4] T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor (2019-10) Tex2Shape: detailed full human body geometry from a single image. In IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [5] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll (2019-10) Multi-garment net: learning to dress 3d people from images. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3, §3, §4.2, §5.1, §5, §6.3.
  • [6] M. Botsch, L. Kobbelt, M. Pauly, P. Alliez, and B. Lévy (2010) Polygon mesh processing. A K Peters. External Links: Link, ISBN 978-1-56881-426-1 Cited by: §4.4.
  • [7] D. Casas, M. Volino, J. Collomosse, and A. Hilton (2014) 4d video textures for interactive character appearance. Computer Graphics Forum 33 (2), pp. 371–380. Cited by: §1.
  • [8] D. Casas, M. Volino, J. Collomosse, and A. Hilton (2014) 4D Video Textures for Interactive Character Appearance. Computer Graphics Forum (Proceedings of EUROGRAPHICS) 33 (2), pp. 371–380. Cited by: §2.
  • [9] E. de Aguiar, L. Sigal, A. Treuille, and J. K. Hodgins (2010) Stable spaces for real-time clothing. ACM Trans. Graph. 29 (4), pp. 106:1–106:9. External Links: Link, Document Cited by: §1, §2.
  • [10] F. De la Torre, J. Hodgins, J. Montano, S. Valcarcel, R. Forcada, and J. Macey (2009) Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. Robotics Institute, Carnegie Mellon University 5. Cited by: §2.
  • [11] M. Desbrun, M. Meyer, P. Schröder, and A. H. Barr (1999) Implicit fairing of irregular meshes using diffusion and curvature flow. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pp. 317–324. Cited by: §4.4.
  • [12] H. Dong, X. Liang, X. Shen, B. Wang, H. Lai, J. Zhu, Z. Hu, and J. Yin (2019-10) Towards multi-pose guided virtual try-on network. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [13] H. Dong, X. Liang, X. Shen, B. Wu, B. Chen, and J. Yin (2019-10) FW-gan: flow-navigated warping gan for video virtual try-on. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [14] R. Gillette, C. Peters, N. Vining, E. Edwards, and A. Sheffer (2015) Real-time dynamic wrinkling of coarse animated cloth. In Proc. Symposium on Computer Animation, Cited by: §2.
  • [15] R. Goldenthal, D. Harmon, R. Fattal, M. Bercovier, and E. Grinspun (2007) Efficient simulation of inextensible cloth. ACM Transactions on Graphics (Proceedings of SIGGRAPH 2007) 26 (3), pp. to appear. Cited by: §2.
  • [16] P. Guan, L. Reiss, D. A. Hirshberg, A. Weiss, and M. J. Black (2012) DRAPE: dressing any person. ACM Trans. Graph. 31 (4), pp. 35:1–35:10. External Links: Link, Document Cited by: §1, §1, §1, §2, Table 1, §4.4.
  • [17] E. Gundogdu, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, and P. Fua (2018) GarNet: A two-stream network for fast and accurate 3d cloth draping. CoRR abs/1811.10983. External Links: Link, 1811.10983 Cited by: §1, §1, §1, §2, Table 1, §4.4.
  • [18] M. Habermann, W. Xu, M. Zollhoefer, G. Pons-Moll, and C. Theobalt (2019-10) LiveCap: real-time human performance capture from monocular video. Transactions on Graphics (ToG) 2019. Cited by: §2.
  • [19] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis (2018) VITON: an image-based virtual try-on network. In CVPR, Cited by: §2.
  • [20] W. Hsiao, I. Katsman, C. Wu, D. Parikh, and K. Grauman (2019-10) Fashion++: minimal edits for outfit improvement. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [21] Https://www.marvelousdesigner.com/. Cited by: §5.
  • [22] C. Jiang, T. Gast, and J. Teran (2017) Anisotropic elastoplasticity for cloth, knit and hair frictional contact. ACM Transactions on Graphics (TOG) 36 (4), pp. 152. Cited by: §2.
  • [23] N. Jin, Y. Zhu, Z. Geng, and R. Fedkiw (2018) A pixel-based framework for data-driven clothing. CoRR abs/1812.01677. External Links: Link, 1812.01677 Cited by: §2.
  • [24] H. Joo, T. Simon, and Y. Sheikh (2018) Total capture: a 3d deformation model for tracking faces, hands, and bodies. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8320–8329. Cited by: §7.
  • [25] L. Kavan, D. Gerszewski, A. W. Bargteil, and P. Sloan (2011-07) Physics-inspired upsampling for cloth simulation in games. ACM Trans. Graph. 30 (4), pp. 93:1–93:10. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [26] D. Kim, W. Koh, R. Narain, K. Fatahalian, A. Treuille, and J. F. O’Brien (2013-07) Near-exhaustive precomputation of secondary cloth effects. ACM Transactions on Graphics 32 (4), pp. 87:1–7. Note: Proceedings of ACM SIGGRAPH 2013, Anaheim External Links: Link Cited by: §2, §2.
  • [27] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), Cited by: §6.1.
  • [28] N. Kolotouros, G. Pavlakos, and K. Daniilidis (2019) Convolutional mesh regression for single-image human shape reconstruction. In CVPR, Cited by: §1.
  • [29] Z. Lahner, D. Cremers, and T. Tung (2018) Deepwrinkles: accurate and realistic clothing modeling. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 667–684. Cited by: §1, §1, §2, §2, Table 1, §6.1.
  • [30] C. Lassner, G. Pons-Moll, and P. V. Gehler (2017-10) A generative model of people in clothing. In Proceedings IEEE International Conference on Computer Vision (ICCV), Piscataway, NJ, USA. Cited by: §2.
  • [31] O. Litany, A. Bronstein, M. Bronstein, and A. Makadia (2018)

    Deformable shape completion with graph convolutional autoencoders

    CVPR. Cited by: §1.
  • [32] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015-10) SMPL: a skinned multi-person linear model. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §1, §1, §3, §7.
  • [33] Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. Black (2020-06) Learning to dress 3d people in generative clothing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [34] N. Mahmood, N. Ghorbani, N. F. Troje, G. Pons-Moll, and M. J. Black (2019-10) AMASS: archive of motion capture as surface shapes. In IEEE International Conference on Computer Vision (ICCV), Cited by: TailorNet: Predicting Clothing in 3D as a Function of Human Pose, Shape and Garment Style, §6.2.
  • [35] E. Miguel, D. Bradley, B. Thomaszewski, B. Bickel, W. Matusik, M. A. Otaduy, and S. Marschner (2012-05)

    Data-driven estimation of cloth simulation models

    Comput. Graph. Forum 31 (2pt2), pp. 519–528. External Links: ISSN 0167-7055, Link, Document Cited by: §2.
  • [36] A. Mir, T. Alldieck, and G. Pons-Moll (2020-06) Learning to transfer texture from clothing images to 3d humans. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [37] M. Müller, B. Heidelberger, M. Hennix, and J. Ratcliff (2007) Position based dynamics. Journal of Visual Communication and Image Representation 18 (2), pp. 109–118. Cited by: §2.
  • [38] M. Müller (2008) Hierarchical position based dynamics. Cited by: §2.
  • [39] A. Neophytou and A. Hilton (2014) A layered model of human body and garment deformation. In 2014 2nd International Conference on 3D Vision, Vol. 1, pp. 171–178. Cited by: §2.
  • [40] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §6.4.
  • [41] G. Pons-Moll, S. Pujades, S. Hu, and M. Black (2017) ClothCap: seamless 4D clothing capture and retargeting. ACM Transactions on Graphics, (Proc. SIGGRAPH) 36 (4). External Links: Link Cited by: §2, §3, §3, §5.1, §6.3, §7.
  • [42] G. Pons-Moll, J. Romero, N. Mahmood, and M. J. Black (2015-08) Dyna: a model of dynamic human shape in motion. ACM Transactions on Graphics, (Proc. SIGGRAPH) 34 (4), pp. 120:1–120:14. Cited by: §7.
  • [43] X. Provot et al. (1995) Deformation constraints in a mass-spring model to describe rigid cloth behaviour. In Graphics interface, pp. 147–147. Cited by: §2.
  • [44] A. Raj, P. Sangkloy, H. Chang, J. Hays, D. Ceylan, and J. Lu (2018) Swapnet: image based garment transfer. In European Conference on Computer Vision, pp. 679–695. Cited by: §2.
  • [45] B. Rosenhahn, U. Kersting, K. Powell, R. Klette, G. Klette, and H. Seidel (2007) A system for articulated tracking incorporating a clothing model. Machine Vision and Applications 18 (1), pp. 25–40. Cited by: §2.
  • [46] I. Santesteban, M. A. Otaduy, and D. Casas (2019) Learning-based animation of clothing for virtual try-on. Comput. Graph. Forum 38 (2), pp. 355–366. External Links: Link, Document Cited by: §1, §1, §1, §2, Table 1, §4.4, Figure 3, §6.2, §7.
  • [47] A. Selle, J. Su, G. Irving, and R. Fedkiw (2009) Robust high-resolution cloth using parallelism, history-based collisions, and accurate friction. IEEE Transactions on Visualization and Computer Graphics 15 (2), pp. 339–350. Cited by: §2.
  • [48] A. Shysheya, E. Zakharov, K. Aliev, R. Bashirov, E. Burkov, K. Iskakov, A. Ivakhnenko, Y. Malkov, I. Pasechnik, D. Ulyanov, et al. (2019) Textured neural avatars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2387–2397. Cited by: §2.
  • [49] L. Sigal, M. Mahler, S. Diaz, K. McIntosh, E. Carter, T. Richards, and J. Hodgins (2015-07) A perceptual control space for garment simulation. ACM Trans. Graph. 34 (4), pp. 117:1–117:10. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [50] C. Stoll, J. Gall, E. de Aguiar, S. Thrun, and C. Theobalt (2010-12) Video-based reconstruction of animatable human characters. ACM Trans. Graph. 29 (6), pp. 139:1–139:10. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [51] Y. Tao, Z. Zheng, K. Guo, J. Zhao, D. Quionhai, H. Li, G. Pons-Moll, and Y. Liu (2018-06) DoubleFusion: real-time capture of human performance with inner body shape from a depth sensor. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [52] Y. Tao, Z. Zheng, Y. Zhong, J. Zhao, D. Quionhai, G. Pons-Moll, and Y. Liu (2019-06) SimulCap : single-view human performance capture with cloth simulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [53] D. Terzopoulos, J. Platt, A. Barr, and K. Fleischer (1987) Elastically deformable models. ACM Siggraph Computer Graphics 21 (4), pp. 205–214. Cited by: §2.
  • [54] B. Wang, H. Zheng, X. Liang, Y. Chen, and L. Lin (2018) Toward characteristic-preserving image-based virtual try-on network. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 589–604. Cited by: §2.
  • [55] H. Wang, F. Hecht, R. Ramamoorthi, and J. F. O’Brien (2010-07) Example-based wrinkle synthesis for clothing animation. ACM Transactions on Graphics 29 (4), pp. 107:1–8. Note: Proceedings of ACM SIGGRAPH 2010, Los Angles, CA External Links: Link Cited by: §2.
  • [56] H. Wang, F. Hecht, R. Ramamoorthi, and J. F. O’Brien (2010) Example-based wrinkle synthesis for clothing animation. In Acm Transactions on Graphics (TOG), Vol. 29, pp. 107. Cited by: §2.
  • [57] H. Wang, J. F. O’Brien, and R. Ramamoorthi (2011-07) Data-driven elastic models for cloth: modeling and measurement. ACM Transactions on Graphics, Proc. SIGGRAPH 30 (4), pp. 71:1–11. Cited by: §2.
  • [58] T. Y. Wang, D. Ceylan, J. Popovic, and N. J. Mitra (2018) Learning a shared shape space for multimodal garment design. ACM Trans. Graph. 37 (6), pp. 203:1–203:13. External Links: Link, Document Cited by: §1, §2, Table 1.
  • [59] C. Weng, B. Curless, and I. Kemelmacher-Shlizerman (2019) Photo wake-up: 3d character animation from a single photo. In IEEE Conf. on Computer Vision and Pattern Recognition, Cited by: §2.
  • [60] F. Xu, Y. Liu, C. Stoll, J. Tompkin, G. Bharaj, Q. Dai, H. Seidel, J. Kautz, and C. Theobalt (2011-07) Video-based characters: creating new human performances from a multi-view video database. ACM Trans. Graph. 30 (4), pp. 32:1–32:10. External Links: ISSN 0730-0301, Link, Document Cited by: §2.
  • [61] J. Yang, J. Franco, F. Hétroy-Wheeler, and S. Wuhrer (2018) Analyzing clothing layer deformation statistics of 3d human motions. In European Conf. on Computer Vision, pp. 237–253. Cited by: §2, §2.
  • [62] S. Yang, Z. Pan, T. Amert, K. Wang, L. Yu, T. Berg, and M. C. Lin (2018) Physics-inspired garment recovery from a single-view image. ACM Transactions on Graphics (TOG) 37 (5), pp. 170. Cited by: §2.
  • [63] R. Yu, X. Wang, and X. Xie (2019-10) VTNFP: an image-based virtual try-on network with body and clothing feature preservation. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [64] M. Zanfir, A. Popa, A. Zanfir, and C. Sminchisescu (2018) Human appearance transfer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5391–5399. Cited by: §2.
  • [65] B. Zhao, X. Wu, Z. Cheng, H. Liu, Z. Jie, and J. Feng (2018) Multi-view image generation from a single-view. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 383–391. Cited by: §2.
  • [66] S. Zhu, S. Fidler, R. Urtasun, D. Lin, and C. C. Loy (2017) Be your own prada: fashion synthesis with structural coherence. In ICCV, Cited by: §2.
  • [67] S. Zhu, R. Urtasun, S. Fidler, D. Lin, and C. Change Loy (2017) Be your own prada: fashion synthesis with structural coherence. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1680–1688. Cited by: §2.