DeePSD: Automatic Deep Skinning And Pose Space Deformation For 3D Garment Animation

09/06/2020 ∙ by Hugo Bertiche, et al. ∙ 83

We present a novel approach to the garment animation problem through deep learning. Previous approaches propose learning a single model for one or few garment types, or alternatively, extend a human body model to represent multiple garment types. These works are not able to generalize to arbitrarily complex outfits we commonly find in real life. Our proposed methodology is able to work with any topology, complexity and multiple layers of cloth. Because of this, it is also able to generalize to completely unseen outfits with complex details. We design our model such that it can be efficiently deployed on portable devices and achieve real-time performance. Finally, we present an approach for unsupervised learning.



There are no comments yet.


page 1

page 6

page 7

page 9

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dressed human animation has been a topic of interest since decades due to its numerous applications in entertainment and videogame industries, and recently, in virtual reality and augmented reality. Depending on the application we find two main classical computer graphics approaches. Physically Based Simulation (PBS) [baraff1998large, liu2017quasi, provot1997collision, provot1995deformation, tang2013gpu, vassilev2001fast, zeller2005cloth] approaches are able to obtain highly realistic cloth dynamics at the expense of a huge computational cost. Additionally, it needs accurate parameter fine-tuning to obtain the desired results, making it a time-consuming task that requires expertise on the field. On the other hand, Linear Blend Skinning (LBS) [kavan2008geometric, kavan2005spherical, le2012smooth, Magnenat-thalmann88joint-dependentlocal, wang2007real, wang2002multi] and Pose Space Deformation (PSD) [allen2002articulated, anguelov2005scape, lewis2000pose, loper2015smpl] models are suitable for environments where computational resources are limited or real-time performance is necessary. To do so, realism is highly compromised. In conclusion, classical computer graphics approaches present a trade-off between realism and performance.

Deep learning has already proven successful in complex 3D tasks [arsalan2017synthesizing, han2017deepsketch2face, madadi2020smplr, omran2018neural, qi2017pointnet, richardson20163d, socher2012convolutional]. Due to the interest in the topic and the recently available 3D datasets on garments, we see the scientific community pushing this research line [alldieck2018video, alldieck2019tex2shape, bertiche2019cloth3d, bhatnagar2019multi, guan2012drape, lahner2018deepwrinkles, patel2020tailornet, santesteban2019learning]. Most proposals are built as non-linear PSD models learnt through deep learning. These methods yield models describing one or few garment types and, therefore, they lack on generalization capabilities. To overcome this, some works propose encoding garment types as a subset of body vertices. This allows generalizing to more garments, but they are still limited to single simplified garments, as this approach does not allow working with multiple layers of cloth or arbitrary topology.

In this paper we propose learning a mapping from the space of template garments to the space of PSD models that describe their motion. We will show how this allows generalization to completely unseen garments with arbitrary topology. Our method can work with multiple garments at once (whole outfits), multiple layers of cloth and multiple resolutions, plus, it also allows arbitrarily complex details. Additionally, we propose a deep model of small size that can achieve real-time performance. The list of our contributions is as follows:

  • Outfit Generalization. Our proposal is the only current work that is able to generalize to completely unseen outfits without additional training. This overcomes the main drawbacks of the current literature. It greatly increases applicability in scenarios with huge and growing number of outfits, such as virtual try-ons and videogames, where customization is key.

  • Automatic Garment Skinning. To define valid PSD models, it is required to obtain a skinning consistent w.r.t. the deformations. We are the first to propose an automatic deep learning based approach for garment skinning.

  • Efficient Deployment. We demonstrate our new approach to the problem allows obtaining appealing results with a relatively small model (around MB) and achieve real-time performance. This aspect is also key for deployment in portable devices or scenarios where computational resources are limited (virtual try-ons and VR/AR goggles).

  • Unsupervised learning. With the findings presented through this paper we prove it is possible to train the proposed model without the need of PBS data.

The paper is structured as follows. In Sec. 2, we describe the current state-of-the-art on garment animation. Next, in Sec. 3, we present a novel approach to the problem. In Sec. 4 we describe the methodology. Later, in Sec. 5 we show experiments and analyses. In Sec. 6 we describe the unsupervised approach. Finally, in Sec. 7 we present the conclusions and set the future directions of research.

2 State-of-the-art

Garment animation domain has been widely explored by the computer graphics community due to its numerous applications in entertainment industry. With the recent developments of deep learning in 3D domain and the increasingly amount of available data, neural networks begin to show promising results.

2.1 Classical Computer Graphics Approaches

Obtaining realistic cloth behaviour is possible through PBS (Physically Based Simulation). Commonly through the well known mass-spring model. Current literature on the topic is quite extensive, focused on improving the efficiency and stability of the simulation by simplifying and/or specializing on specific setups [baraff1998large, provot1997collision, provot1995deformation, vassilev2001fast], or, alternatively, proposing new energy-based algorithms that enhance robustness, realism and generalize to other soft bodies [liu2017quasi]. Other authors propose leveraging the parallel computation capabilities of modern GPUs [tang2013gpu, zeller2005cloth]

. These approaches can achieve high realism at the expense of a great computational cost, even with the aforementioned improvements. For this reason, PBS is not an appropriate solution when real-time performance is required or computational capacity is limited (portable devices). Additionally, simulations require exhaustive fine-tuning of hyperparameters through trial and error. On the other hand, for applications that prioritize performance, we find

LBS (Linear Blend Skinning). This is the standard approach on computer graphics for animation of 3D models. Each vertex of the object to animate is attached to a skeleton through a set of weights that are used to linearly combine joint transformations. In garment domain, outfits are attached to the skeleton driving body motion. The approach has also been widely studied and its drawbacks addressed [kavan2008geometric, kavan2005spherical, le2012smooth, Magnenat-thalmann88joint-dependentlocal, wang2007real, wang2002multi]. While it is possible to achieve real-time performance, cloth dynamics is highly non-linear, which results on a significant loss of realism when applied to garments. Finally, we can also find hybrid approaches, where tighter parts of the outfit are rigged to a skeleton and loose parts are simulated in a simplified manner and low vertex count. The garment interacts only with the body, not with the environment. It is common to find this approach in modern video games, as it increases realism without excessively hurting performance.

2.2 Learning-Based Approaches

Due to the drawbacks found in the classical LBS approach, PSD (Pose Space Deformation) models appeared [lewis2000pose]. To avoid artifacts due to skinning, corrective deformations are applied to the mesh in rest pose. Additionally, PSD handles pose-dependant high frequency details of 3D objects. While hand-crafted PSD is possible, in practice, it is learnt from data. We find applications of this technique for body models [allen2002articulated, anguelov2005scape, loper2015smpl], where deformation basis are computed through linear decomposition of registered body scans. Similarly, in garment domain, Guan et al. [guan2012drape] apply the same techniques for a few template garments on data obtained through simulation. Lähner et al. [lahner2018deepwrinkles] also propose linearly learnt PSD for garments, but conditioned on temporal features processed by a RNN to achieve a non-linear mapping. Later, Santesteban et al. [santesteban2019learning] propose an explicit non-linear mapping for PSD through a MLP for a single template garment. The main drawback of these approaches is that PSD must be learnt for each template garment, which in turns requires new simulations to obtain the corresponding data. To address this issue, many researchers propose an extension of a human body model (SMPL[loper2015smpl]), encoding garments as additional displacements and topology as subsets of vertices [alldieck2018video, alldieck2019tex2shape, bertiche2019cloth3d, bhatnagar2019multi, patel2020tailornet]. Alldieck et al. [alldieck2018video, alldieck2019tex2shape] propose a single model for body and clothes, first as vertex displacements and later as texture displacement maps, to infer 3D shape from single RGB images. Similarly, Bhatnagar et al. [bhatnagar2019multi] also learn a space for body deformations to encode outfits, plus an additional segmentation to separate body and clothes, also to infer 3D garments from RGB. Patel et al. [patel2020tailornet] encode a few different garment types as subsets of body vertices and propose a strategy to explicitly deal with cloth high frequency pose-dependant details for different body shapes and garment styles. Bertiche et al. [bertiche2019cloth3d] encode thousands of garments on top of the human body by masking its vertices. They learn a continuous space for garment types, on which later they condition, along with the pose, the vertex deformations. Using a body model to represent garments allows handling multiple types with a single model. Nonetheless, it is still limited to single garments, as it cannot work with multiple layers of cloth. For the same reason, they cannot handle complex garment details. This means these approaches are constrained to deal with single simplified garments, reducing their applicability in the real world. Our proposed methodology allows working with an arbitrary topology, number of layers and complex details.

3 Automatic Skinning and PSD

Classical computer graphics approaches resort to manual template skinning and/or costly simulation, which compromise realism, performance or applicability. On the other hand, learning based approaches, non-deep and deep, propose a manual skinning followed by a data-driven method to compute Pose Space Deformations. Alternatively, some state-of-the-art works adapt existent body models (usually SMPL) to represent garments. Nonetheless, these strategies yield skinned models with a learnt PSD as well, as the support body model also belongs to this family of models. More formally, we can describe previous approaches as:


where is the skinning function with blend weights , deforms the template garment as a function of the pose (learnt PSD) and represents the articulated skeleton that drives the motion as a set of joints. It is very common to find the function conditioned to shape coefficients to represent static deformations (e.g.: body shape for SMPL) as well as to additional variables or temporal pose information to provide with more complex deformations. While these approaches work well in practice, they lack on generalization, as they can only animate a single template garment. Additional template garments need to be designed, simulated, skinned and trained. Therefore, these approaches are not suitable to represent the real world outfit distribution.

We propose an alternative methodology that overcomes the lack of generalization in terms of garment types and styles. Instead of training or designing a single or a few PSD models, we propose learning a model able to automatically provide valid blend weights and PSD to different template garments. More formally:


where is a function that maps the template garment to its corresponding skinning function with blend weights and PSD . Through this paper, we will see that this new approach for the problem has several interesting properties, being the main one the possibility to generalize to new template garments without training. Additionally, we will present a deep model architecture which is efficient in training and, especially, in deployment time. This makes it suitable for portable devices such as smartphones and VR/AR goggles, as well as scenarios where computing power might be limited, such as virtual try-ons.

4 Methodology

Figure 2: Proposed model architecture. Three main components: 1) , computes learnt local geometric descriptors for each template vertex, 2)

, estimates blend weights from geometric descriptors and 3)

, obtains vertex deformations as a function of geometric descriptors concatenated with body pose (broadcasted across vertices). For deployment,

and need to be computed only once per template (static pass) and are implemented as a set of graph convolutions. requires to be updated for each pose (dynamic pass) and it is implemented as a set of fully connected layers applied to vertex features.

In this section we formally describe the problem and present the chosen architecture for the model along with detailed explanations of each of its components.

4.1 Defining the problem

Our goal is to implement a model able to define an automatic skinning and PSD for any template garment. Given the standard formulation for this family of models (eq. 1), to obtain the transformed -th vertex :


where is the blend weight for vertex and joint ,

yields the corresponding linear transformation for the joint

, is the -th vertex in the template mesh and is the PSD for the -th vertex. We reformulate the problem as:


where is a function over template garment applied to the local neighbourhood of the -th vertex, then, and are functions that compute the blend weight and the PSD , respectively, for the -th vertex from its local neighbourhood features. This formulation presents interesting properties:

  • Vertex order and topology invariance. Skinning blend weights and deformations do not depend on vertex index. They are computed from local geometric information, which is invariant to the vertex order under which the template has been defined. For the same reason, it is also invariant to topology.

  • Automatic skinning. Classical approaches compute an estimation by proximity to joints, which is used as an initialization and requires posterior human refinement. We are the firsts to propose a fully automatic learnt skinning for garment animation.

  • Compatibility. Animating skinned 3D models with an articulated skeleton is the current standard on computer graphics. This means that a model that generates blend weights and a PSD can be efficiently integrated into the existing computer graphics pipelines.

The most important property is the invariance to vertex order and topology. This allows training with heterogeneus and unstructured meshes, eliminating the need of making data dimensionality and topology uniform, as well as working with whole outfits, instead of single garments (we refer to as template outfits for the rest of the paper). This property also allows the model to generalize to new templates of arbitrary dimensionality and topology. This is specially useful for the videogame industry, CGI artists and virtual try-ons, as it removes the need of manual design or training for each outfit, yielding automatic scalability to these applications.

4.2 Architecture

Following the formulation presented in eq. 4, we define an architecture able to estimate the functions , and . This corresponds to the per-vertex notation, in a more general way, we define it as:


where is the function that maps each vertex of the input template outfit to its corresponding geometric descriptor, computes the blend weights for each template vertex as a function of its geometric descriptor and obtains each vertex deformation based on its geometric descriptor and body pose . represents the dimensionality chosen for the learnt geometric descriptors. Note that and depend only on the template outfit, while requires pose parameters as input. This choice of architecture allows understanding the model in two meaningful subcomponents: the static pass ( and ), which corresponds to the skinning, and the dynamic pass (), which represents the PSD. Fig. 2 depicts the pipeline of our method.

4.2.1 Static pass

We define the static pass as those model components that do not depend on the current pose of the body. Keeping in mind the model applicability in garment animation domain, we know that the static pass shall be run once per outfit. This means that we can arbitrarily increase its complexity (number of parameters and layers) without hurting the animation performance. The static pass corresponds to the function defined in eq. 2. It yields the blend weights and a function that describes the PSD for the input template outfit.

In order to obtain vertex order and topology invariance, we define both and as a set of graph convolutions. Applying graph convolutions to a mesh returns another mesh with the same topology but different features, as described in eq. 5 and 6. For our experiments, we use four graph convolution layers for both components and and a dimensionality of for the intermediate geometric descriptor.

4.2.2 Dynamic pass

The dynamic pass consists of the only component that depends on the body pose , that is, . Knowing this part of the model needs to be computed once per frame in deployment time (outfit animation), we prioritize efficiency during its design. For this reason, instead of implementing it as graph convolutions, we apply fully connected layers to vertex features. The features of each vertex are the geometric descriptors obtained with concatenated with the body pose . Note that graph convolutions would perform many redundant information passes, as is shared across vertices. This module can also be understood as a set of graph convolutions over a graph with no edges. By choosing this architecture, we maintain vertex order and topology invariance. For our experiments, we implement as a set of four fully connected layers.

4.3 Training

The chosen loss for the model is:


The data term is implemented as a standard L2 loss between the predicted outfit and ground truth. Since predictions are obtained through eq. 4 with the estimated blend weights

and deformations, backpropagating the loss through the model implicitly unposes the outfit. This is a significant advantage against other state-of-the-art approaches that require heavy and noisy pre-processing to obtain unposed deformations.

The cloth term is formulated as a prior distribution for the output. While the data term will enforce proximity between predicted vertices and ground truth, it does not guarantee cloth-consistent meshes. Inspired by mass-spring models (standard model for cloth simulation), we implement this prior as:


where is the edge term and is the bending term. Then, is the set of edges of the given outfit, is the predicted edge length and is the edge length on the template outfit . Then, is the Laplace-Beltrami operator applied to vertex normals of the predicted outfit and balances both losses. enforces the output meshes to have the same edge lengths as the input template outfit, while helps yielding locally smooth surfaces, as it penalizes differences on neighbouring vertex normals. To avoid excessive flattening, we choose .

Finally, we want to avoid predictions that present interpenetration against the human body. The collision term can also be considered a prior and is implemented as:


where represents the set of correspondences between predicted outfit and body, respectively, through nearest neighbour,

is the vector going from the

-th vertex of the body to the -th vertex of the outfit, is the -th vertex normal of the body and is a small positive threshold to increase robustness. This loss is a simplified formulation that assumes cloth is close to the skin, and penalizes outfit vertices placed inside the skin. In our experiments, we choose mm.

5 Ablation study

In this section we describe the data used in the experimental part, the setup for the experiments and analyses of the results.

5.1 Data

From the current public datasets on garments, only CLOTH3D[bertiche2019cloth3d] contains the necessary outfit variability to implement this approach and achieve proper generalization. It contains k sequences, each with a different template outfit in rest pose plus up to frames where the outfit is simulated in top of an animated 3D human, each with a different body shape. Additionally, we test our approach on TailorNet[patel2020tailornet] dataset. Nonetheless, due to its low template garment variability, we consider this data suboptimal for leveraging the capabilities of our approach. The human model used for the generation of both datasets is SMPL. For this reason, we use SMPL skeleton in eq. 4, which means it shall drive the motion of the outfits. Note that SMPL skeleton is defined as a function of shape parameters and gender .

5.2 Experimental setup

For the ablation study, we subsample training frames and validation frames from CLOTH3D in a stratified manner w.r.t. sequences without outfit overlapping between both sets. Since this data is purely 3D, we understand frames as instants of time. We refer to it like this through the rest of the paper. Then, each sample consists of: template outfit mesh in rest pose (vertices and triangulated faces), garment vertex locations , pose , body shape and gender . Rest outfit vertices , faces , body shape and gender remain constant through the frames of a given sequence. Given a sample , we follow eq. 4 to obtain an estimation of with the model described. Since SMPL skeleton depends on gender and body shape, we have . We train for epochs with an initial batch size of outfits, doubled every epochs. Loss data term is used in every experiment, each extra loss term is tested individually along with the data term. Additionally, a final experiment with all loss terms is performed.

5.3 Results

Figure 3: Qualitative comparison of the ablation study. Ground truth (blue). Predictions obtained by training with: only L2 loss (green), L2 plus edge loss (brown) and L2 plus bending loss (pink).
Figure 4: Visualization of the effect on the predictions of training without (blue) and with (green) the collision loss.
Figure 5: Qualitative samples of the model corresponding to the last row of Tab. 1.

To evaluate the performance of the model and the impact of each loss term, we associate an error metric to each one. These metrics are computed outfit-wise.

  • Euclidean Error. L2-norm between predicted and ground truth vertex locations. Expressed in millimeters.

  • Edge Length. Length difference between predicted and rest outfit edges. Expressed in millimeters.

  • Bend Angle. Cosine distance for pairs of neighbouring vertex normals.

  • Collision. Ratio of collided vertices.

Tab. 1 shows the results of the ablation study. To further illustrate results, Fig. 3 shows a qualitative comparison of output outfits. Each loss term minimizes its corresponding metric w.r.t. to other setups. The first row of the table shows the performance closest to ground truth. Nonetheless, we observe a significant edge error and bend angle. This translates to locally distorted regions, where the model sacrificed cloth-consistency to minimize the error. Additionally, the ratio of collided vertices is too high for this model to have practical applications. On the next row, we see that including yields the desired cloth-consistency w.r.t. edges. Looking at Fig. 3, we notice an interesting behaviour, the model is capable of creating wrinkles not present on the data to allocate the required edge length, it can be seen near the belly and the hip for the sample shown. It also implicitly improves surface quality. Next, the third row shows a similar effect for the bend angle. As it can be seen in Fig. 3, it yields locally smooth surfaces. Note that a value for bend angle means a totally flat surface, therefore, lower does not necessarily mean better. Surface smoothness must be evaluated qualitatively. Empirically, a value for this metric shows a reasonably smooth surface without excessive flattening. On the following row, we observe that has a massive impact on the ratio of collided vertices, proving this loss is crucial for the applicability of the approach. Qualitative results are shown in Fig. 4. Note how with almost collision free predictions it is possible to use the formulation of for a fast collision solving by moving each collided vertex by with a small to guarantee a margin between skin and cloth. Finally, in the last row we see how applying all loss terms at once can achieve a compromise between accurate and physically consistent (cloth-like and collision free) predictions. More qualitative results shown in Fig. 5.

Quantitative vs. Qualitative Evaluation. Tab. 1 shows how edge and collision losses significantly hurt accuracy. Nonetheless, without these terms we obtain distorted outfits with a lot of body penetration, which renders them useless for practical applications. This means that while predictions become less accurate, they also become more valid. These results prove that, for this domain, euclidean error is misleading. A proper evaluation should be performed qualitatively, assessing cloth and physical consistency.

Error Edge Bend Collision (%)
30.11 1.69 0.078 13.37
33.93 0.45 0.067 19.80
30.14 1.35 0.059 13.80
32.11 2.15 0.081 0.71
32.66 0.69 0.060 2.40
Table 1: Ablation study. Shows the impact of each loss term. The error shows the accuracy of the model to predict the ground truth vertex locations. Both edge and bend metrics are related to the cloth-consistency of the predictions. Collision shows the of vertices placed within the body.

5.4 Model components

Figure 6: Comparison of SMPL body blend weights (first row) and automatic outfit deep skinning (). Second row: Jumpsuit. Third row: Dress. Irrelevant joints are omitted. Blue = 0, red = 1.

Skinning. Fig. 6 shows the assigned blend weights given to two sample template outfits. We compare them to SMPL body blend weights. We observe that, even without any supervision on these weights, they resemble the distribution of SMPL. This means the model is able to correctly leverage the articulated skeleton to explain most of the outfit motion. Nevertheless, we see some differences on the skinning for skirts. This garment type breaks the assumption that cloth closely follows body motion. We can see that most of the skirt follows the root joint orientation only, as opposed to trousers. The model learns to explain skirts motion through PSD instead.

Figure 7: Visualization of maximum deformation magnitudes through a sequence for a few sample outfits. Normalized separately, they cannot be compared.
Figure 8: Effect of learnt PSD. Two samples depicted. Each one, from left to right: template outfit, outfit with deformations in rest pose, posed outfit without deformations, posed outfit with deformations.

PSD. Fig. 7 shows the distributions of maximum vertex deformation magnitudes through the sequences corresponding to each template outfit shown. Results are normalized per-outfit, meaning visualizations cannot be directly compared among themselves. We observe a higher deformation intensity at the looser parts of each outfit. This is consistent with the fact that loose clothes do not closely follow body motion, and, therefore, linear transformations (skinning) are not enough to explain its variability. For the same reason we can also see high magnitudes at the skirt of the dresses. Fig. 8 shows the impact of PSD on two samples. For each sample we show the template outfit and the outfit with deformations applied in rest pose. Then, we visualize both meshes posed using SMPL skeleton and their predicted blend weights. The first sample shows how deformations can provide the outfit with wrinkles not present in the template. In the next sample, we see a significant deformation on the right leg due to its pose. Note how deformations are also the main tool for the network to avoid collisions with the body.

5.5 Generalization

Figure 9: Generalization capabilities of our model. Trained only on CLOTH3D, we evaluate on unseen template outfits. First row: TailorNet dataset sample outfit. Second row: high resolution outfit ( times more vertices). Third row: low resolution outfit ( of vertices). Fourth row: multiple layers of cloth. Last row: high complexity outfit. Leftmost figures show the different resolutions and connectivity of each template.

We evaluate the generalization properties of our model by computing predictions for completely unseen outfits after training only on CLOTH3D. Fig. 9 shows the results of this analysis. On the first row we gathered a sample outfit from TailorNet [patel2020tailornet] dataset. This template has different connectivity and resolution than the ones used for training. Nonetheless, we see the model is able to achieve appealing results. For the second and third rows, we increased and reduced, respectively, the number of vertices of a sample outfit by a factor of . We observe the model is capable of working at different resolutions, even if they were not present on training time. This has potential applications on videogame and 3D design, as models can be replaced on-the-fly by more simple or complex ones depending on the available resources and desired performance. The next row shows an outfit with multiple layers: shirt, top and trousers. Near the hip we find up to three overlapping layers of cloth. Note how the model is able to properly articulate the outfit without excessive interpenetration. Training data does not contain layered outfits like this one. Finally, on the last row, we designed an outfit with several layers and complex details (large bow tie below the back). Again, predictions are what we would expect the outfit to move like. Note how trousers follow correctly the legs position, but the skirt-like piece of cloth follows the hips orientation. It is interesting to note how the large bow tie also follows the hips. The capability of properly skinning completely unseen outfit details and ornaments has many potential applications on videogame and VR/AR industries. See Fig. 1 for more samples of this kind. Character customization can hugely benefit from this, as well as alleviating skinning related work to designers.

CLOTH3D[bertiche2019cloth3d] TailorNet[patel2020tailornet]
Authors 29.0 11.4
Ours 28.3 19.9
Table 2: Comparison against state-of-the-art works. For each dataset, we compare against the approaches of their authors. Results in millimeters.

On Tab. 2 we compare our method trained on different datasets against the approaches of their corresponding authors. Our model is the only one able to work with full outfits, so for the sake of the comparison, we present instead per-garment performance in this table. In CLOTH3D, for fairness, we compare ourselves against their best result without temporal information. Our approach not only can obtain more accurate predictions, but also shows much more appealing qualitative results. Note that we train only on K out of the M samples. Moreover, our model is much smaller and efficient, and does not require pre-processing or post-processing. Against TailorNet, we observe that the style-specific learnt mixture models for high frequency details proposed by their authors outperforms our model. Nonetheless, note that new styles would require fitting additional mixture models, which means their approach does not generalize to new garments. We also observe that TailorNet dataset is less challenging than CLOTH3D. This is due to CLOTH3D huge outfit variability and rich dynamics, while TailorNet contains just a few garment types on static poses. Training our model on TailorNet dataset yields poor generalization to unseen outfits.

5.6 Performance

As described in Sec. 4, each model component is composed of merely four layers that work with vertex features individually. This allows using a relatively small number of parameters, around k, which translate to MB of memory. This model size can be handled by most current portable devices, increasing applicability. During test, without any specific code optimization, we find times of s and s for the static and the dynamic passes respectively for outfits with k-k vertices. In deployment, outfit animation speed is given by , which means it achieves real-time performance. For videogames and VR/AR, running the model and scene rendering are done in the GPU, which eliminates the need of costly I/O operations between CPU and GPU.

5.7 Applications

Figure 10: A simple virtual try-on with the proposed model. SMPL parameters are estimated with an off-the-shelf CNN.

We mentioned the applicability of the proposed approach in videogames and VR/AR as a real-time alternative to classical realistic cloth simulations with generalization capacity. We also find interesting applications for 3D designers, as it removes the work load of outfit skinning and PSD learning, or time-consuming try and error simulation processes. Moreover, interactively editing the template outfit automatically reflects these changes in whole sequences without re-simulation. It also allows mesh resolution changes on-the-fly. Additionally, deep models are differentiable, and it can be therefore used to further develop research in this domain or related domains.

We developed a proof-of-concept application for our model. We take an off-the-shelf CNN to regress SMPL parameters from RGB images. With estimated parameters and template outfits, we have an effectively working virtual try-on. Fig. 10 shows qualitative samples of this.

6 Unsupervised training

Figure 11: Qualitative results for the unsupervised experiment. Note how all of them are valid outfit predictions.
Error Edge Bend Collision (%)
Unsupervised 41.58 0.55 0.059 1.86
Table 3: Results of the unsupervised experiment. Note how euclidean error is larger, yet cloth consistency and collisions are still handled properly.

We have seen how the edge, bending and collision losses increase predictions consistency, while a higher accuracy does not mean better predictions. Given that , and are defined as priors (do not use ground truth ), if we remove we can perform unsupervised training. While edge and bending losses can guarantee cloth-consistent predictions, the collision loss works under the assumption that cloth and skin are close to each other. Without data term guiding the learning, it is unlikely to achieve a good performance under this setup. Nonetheless, we observed that the learnt skinning is similar to SMPL body blend weights. This observation allows us to define a prior for the blend weights based on proximity to the human body on rest pose. We assign as labels the weights of the closest body vertex. To avoid artifacts on skirts we also apply the Laplace-Beltrami operator on the predicted weights as regularization, enforcing them to be smooth. Finally, we also implement an regularization on the predicted deformations.


where are the blend weights after assignment through nearest neighbour, are the predicted weights and are the deformations. This term is applied during the first epochs to guide the learning. Its value is reduced every epoch until it can be completely removed. More formally:


where controls the influence of the blend weights prior and the rest is as defined previously. Tab. 3 shows the quantitative metrics of this experiment. As expected, error against ground truth is significantly higher than the supervised experiments. On the other hand, physical consistency metrics are actually better than its corresponding supervised counterpart (last row of Tab. 1). Fig. 11 shows qualitative results of this experiment. While predictions are likely to differ from ground truth, all of them are valid outfits.

Being able to train unsupervisedly opens interesting possibilities. By defining physical constraints as losses, along with a prior, the model is able to learn valid outfit positions. Furthermore, noting that edge and bend losses resemble the elastic potential energy of a mass-spring model, we demonstrate how deep neural networks are capable to predict low energy, stable configurations for physical systems. On a more technical view, unsupervised training allows splitting samples into and , and train on every possible combination of these splits. Additionally, it can be combined with supervised training and benefit from both strategies.

7 Conclusions and Future Work

We presented a novel approach to the garment animation problem. Instead of learning a PSD for a skinned garment (or a few), we learn a mapping from template outfits to their corresponding skinning and PSD. We demonstrated that a model trained with this approach can generalize to completely unseen outfits with multiple layers of cloth and complex geometric details and ornaments. None of the previous works on the current literature achieves this level of generalization.

To prove the advantages of this approach, we train a simple baseline. We have shown how a small model is enough to obtain appealing results at almost final application level and achieve real-time performance. Future work may consider dealing with high frequency deformations and leveraging temporal information. Finally, we also showed the possibility of unsupervised learning. We believe this approach can also be further improved by handling motion, gravity or friction, among others.

Acknowledgements. This work has been partially supported by the Spanish project PID2019-105093GB-I00 (MINECO/FEDER, UE) and CERCA Programme/Generalitat de Catalunya.) This work is partially supported by ICREA under the ICREA Academia programme.