The efficient modeling of digital garments is an active area of research due to the large number of applications, including fashion design, e-commerce, virtual try-on, and video games. The traditional approach to this problem is through physics-based simulation [nealen2006physically], but the high computational cost required at run time hinders the deployment of these techniques to real-world applications. Recently, learning-based methods [santesteban2019virtualtryon, patel2020tailor, gundogdu2019garnet, ma2020dressing3d, vidaurre2020fcgnn, tiwari20sizer, wang2018multimodalspace, shen2020garmentgeneration] have demonstrated that it is possible to closely approximate the accuracy of physics-based solutions. These methods use supervised learning strategies to find a function that outputs a deformed garment given an input body descriptor. During the training phase, the supervision is enforced by directly minimizing at a vertex level the difference between the predicted garment and ground-truth 3D meshes. Despite requiring hours of training, learning-based methods are highly-efficient to evaluate at run time, therefore they potentially offer an attractive alternative to traditional physics-based solutions.
However, the need for large datasets in current supervised methods is far from ideal. Ground-truth meshes must be obtained –for each combination of garment, body shape, and pose– via computationally-expensive simulations [narain2012arcsim] or complex 3D scanning setups[pons2017clothcap], which heavily hinders the scalability of current learning-based methods. We observe that for similar image-based problems, self-supervised strategies have shown that it is possible to learn complex tasks without requiring ground-truth data [raj2018swapnet, wu2019mm]. Unfortunately, self-supervision for dynamic 3D clothing has not been explored.
In this work, we present a self-supervised method to learn dynamic deformations of 3D garments worn by parametric human bodies. The key to our success is realizing that the solution to the equations of motion used in current physics-based methods can also be formulated as an optimization problem [Martin2011]. More specifically, we show that the per-time-step numerical integration scheme used to update the vertex position (e.g.,
backward Euler) in physics-based simulators, can be recast as an optimization problem, and demonstrate that the function for this minimization can become the central ingredient of a self-supervised learning scheme. Since this objective function includes both an inertial term and static term directly derived from the equations of motion, we are able to learn time-dependent and pose-dependent deformationswithout any ground-truth data.
The advantages of self-supervision go beyond removing the need for ground-truth data. By reformulating the learning tasks in terms of physics-based intrinsic properties instead of explicit 3D surface similarity, we also mitigate the smoothing artifacts common in supervised methods where L2 losses are used directly at the vertex level [patel2020tailor]. Additionally, self-supervised approaches also generalize better to test sequences outside the distribution of the training set. Finally, we also show how different material models can be easily formulated in our self-supervised framework, bringing the generalization capabilities of physics-based solutions (i.e., deform any material) to learning-based methods, without requiring any precalculation or offline step.
All in all, our main contribution is a novel learning-based method capable of learning to dynamically deform garments using a self-supervised strategy. We demonstrate the superiority of our approach in terms of data requirements, training time, and inference time, and we quantitatively and qualitatively compare our results with state-of-the-art supervised methods.
2 Related Work
Existing methods that model how cloth and garment deform can be categorized into two groups: physics-based models and learning-based models.
Physics-based simulation of cloth is to date a very mature field. Over the years, many methods have been developed to solve the most relevant challenges. These include the design of deformation models such as in-plane and bending energies [Kim2020, Grinspun2003], robust implicit solvers [Baraff1998], rich and efficient contact handling [Bridson2002, tang2018gpu], or adaptive discretizations [narain2012arcsim]. Recent efforts also include the design of differentiable physics simulators [hu2019difftaichi], including specific problems of cloth simulation such as continuous collision detection and constraint-based solvers [liang2019differentiable], and physics-based objectives for tracking and reconstruction of garments [yu2019simulcap, li3DVphysicsaware]. While the majority of the cloth simulation models represent the fabric as a continuum, a recent line of research uses yarn-level representations for high-resolution detail [Kaldor2008, cirio2014yarn]. Some works also show two-way coupling between garments and soft-body avatars [romero2020skinmechanics, Montes2020]. While we do not tackle this level of detail in our paper, more accurate methods could be used to replace our cloth and body models.
In contrast to physics-based models, which typically require solving large systems of nonlinear equations at each time step, learning-based methods aim at estimating a single function that directly outputs the desired deformation for any input. Inspired by early works on Pose Space Deformation[lewis2000pose], a common strategy is to learn parametric garment deformations, which are added to a mesh template, as a function of pose [guan2012drape, wang2019intrisicspace], shape [vidaurre2020fcgnn], pose-and-shape [santesteban2019virtualtryon, bertiche2020cloth3d], design [patel2020tailor, wang2018multimodalspace, ma2020dressing3d], or garment size [tiwari20sizer].
To this end, state-of-the-art methods for garments use supervised strategies that require large datasets of ground-truth data of the specific task to be learned. This methodology has been recently explored for many use cases, including 3D reconstruction [alldieck19cvpr, alldieck2018detailed, saito2019pifu, zhu2020deep], garment design [shen2020garmentgeneration, vidaurre2020fcgnn, wang2018multimodalspace], animation [bertiche2021ICCV, huang2020arch, wang2019intrisicspace, patel2020tailor, gundogdu2019garnet, ma2020dressing3d], and virtual try-on [Zhao2021M3DVTONAM, bhatnagar2019multi, santesteban2019virtualtryon, guan2012drape]. To efficiently tackle the learning task, and depending on the goal of each method, different supervision terms and domains have been used. Most methods use direct 3D supervision at the vertex level [santesteban2019virtualtryon, patel2020tailor, vidaurre2020fcgnn, gundogdu2019garnet], but image-based 2D supervision in form of UV maps [lahner2018deepwrinkles, shen2020garmentgeneration, jin2020pixel], point clouds [Saito:CVPR:2021, ma2021power], or sketches [wang2018multimodalspace] also exist. Very recently, implicit representations have shown impressive results on learning to deform humans [deng2020nasa, leapCVPR21, alldieck2021imghum] and dress avatars [Saito:CVPR:2021, tiwari21neuralgif, corona2021smplicit, MetaAvatar:NeurIPS:2021].
Datasets are a fundamental piece to enable supervision, and most methods [santesteban2019virtualtryon, patel2020tailor, wang2018multimodalspace, bertiche2020cloth3d] opt for synthetic data generated with physics-based simulators such as ARCSim [narain2012arcsim] or Argus [li2018implicit]. Alternatively, other methods [lahner2018deepwrinkles, tiwari20sizer, ma2020dressing3d, Saito:CVPR:2021] use high-quality 3D scans obtained in expensive multi-camera setups [zhang2017detailed, pons2017clothcap]. Despite the success of all these supervised methods for learning-based garments, relaying on ground-truth data to train the models is a major limitation due to the associated costs and hinders to create datasets.
Self-supervised strategies are the ideal alternative to circumvent the need for ground-truth data in learning-based methods [stewart2017label]. Instead of relying on losses that evaluate prediction error based on the difference with respect to ground-truth samples, self-supervised methods use implicit properties of the training data (or domain) as a supervision signal [zhu2019physics]. This strategy is nowadays very popular in data-driven methods for image-based problems [zhu2017unpaired, umr2020, raj2018swapnet], however, almost all state-of-the-art approaches to learn 3D garment deformations rely on ground-truth data [patel2020tailor, santesteban2019virtualtryon, gundogdu2019garnet]. For 3D deformations tasks not related to garments, many works use physics laws or constraints as a supervision signal [zhu2019physics, tompson2017accelerating, xie2018tempoGAN]. For example, Tompson et al. [tompson2017accelerating] enforce incompressibility constraints to learn to solve the system of equations required in physics-based fluid simulation, Xie et al. [xie2018tempoGAN] enforce temporal coherence of consecutive frames in fluid simulations to enhance detail, and Zhu et al. [zhu2019physics] incorporates the governing equations of the physical model (i.e.
, Partial Differential Equations, PDEs) in the loss to learn image-based flow simulations.
Despite the significant progress in self-supervised learning, no previous works addresses the learning of 3D garments in self-supervised strategy, with just the notable and very recent exception of PBNS [bertiche2021pbns]. PBNS proposes to learn pose space deformations for garments by enforcing static physical consistency during the training of the model. We follow a similar underlying idea, but propose to use a full physics-based deformation scheme recast as an optimization problem to learn, for first time, a model for dynamic garment deformations with self-supervision only. Additionally, our approach learns shape-dependent effects and is able to cope with a material model that produces highly-realistic and finer wrinkles.
Our goal is to find a function that deforms a 3D garment given the underlying body parameters and motion. To this end, in Sec. 3.1, we first describe our garment model used to implement , which is based on per-vertex dynamic 3D displacements that are added to a rigged template mesh. Then, in Sec. 3.2, we direct our attention to an optimization-based formulation of dynamic deformations. Based on this formulation, in Sec. 3.3, we introduce our main contribution and describe a physics-based deformation model that allows us to train a regressor for 3D garment displacements. Importantly, our loss is driven by fundamental physical properties of deformable objects, not by the reconstruction of ground-truth garments, and therefore it enables self-supervised learning. In Sec. 3.4 we specify the material model used in the different terms of our loss, and define the relevant energies such as the strain, and bending energies. Finally, in Sec. 3.5 we describe the recurrent architecture used to implement the regressor . See Figure 2 for an overview of our method.
3.1 Garment Model
Similar to state-of-the-art methods for data-driven garments [santesteban2019virtualtryon, gundogdu2019garnet, patel2020tailor, bertiche2020cloth3d, vidaurre2020fcgnn], we leverage and extend existing human body models [loper2015smpl, feng2015avatar] to encode garment deformations. More specifically, we build our representation on top of the popular SMPL human model [loper2015smpl]. SMPL encodes bodies by deforming a rigged human template according to shape and pose-dependent deformations that are learned from data. Following this idea, we define our garment model as
where is a skinning function (e.g., linear blend skinning or dual quaternion) with skinning weights , joint locations , and motion parameters that articulate an unposed deformed garment mesh . The latter is computed from a garment template mesh deformed by a function that outputs per-vertex 3D displacements to encode dynamic deformations conditioned to the underlying body shape and body motion . The body motion
contains the current body poseas well as the global velocity of the root joint.
Assuming that the garment template is correctly located on top of the mean SMPL body mesh [loper2015smpl], we define by borrowing the SMPL skinning weights of closest body vertex in rest pose. In the remainder of this section we introduce our novel strategy to learn the 3D displacement regressor .
3.2 Optimization-Based Dynamic Deformation
Our goal is to learn the 3D displacement regressor in Equation 2 using a self-supervised strategy. To this end, our first task is to find a set of physics-based properties that describe how cloth behaves. Physics-based simulators traditionally solve dynamics by applying a numerical integration scheme, e.g., backward Euler, to the differential equations of motion, and finding the roots of the resulting nonlinear discrete equations [nealen2006physically]. This formulation is applied independently at each simulation frame, to iteratively update the positions and velocities of garment vertices. Our key observation is to realize that the solution to the equations of motion discretized with backward Euler can also be formulated as an optimization problem [Martin2011], and the objective function for this minimization can become the central ingredient of a self-supervised learning scheme. Optimization-based dynamics have been used in the Computer Graphics literature to increase the efficiency and robustness of dynamics solvers, through quasi-Newton schemes and step-size selection [gast2015optimization, Liu2017]. Instead, we propose to leverage such optimization-based formulation to define a loss for training a neural network that generalizes well to any input (i.e., any body shape and motion).
The equations of motion can be discretized with backward Euler as
where M is the mass matrix, are forces, and and are the positions and velocities of garment nodes. The solution to these equations can be recast as an optimization [Martin2011, gast2015optimization]:
where is a tentative (explicit) position update, and is the potential energy due to internal and external forces of the system.
3.3 Turning Dynamics into Self-Supervision
The key to our method is to define a set of losses based on Equation 4 to train the regressor . To this end, we propose a loss with two terms
where models the inertia of the garment and it is defined analogous to the first term of Equation 4
Intuitively, this term prevents the change of garment velocities over time, but garment velocities will change anyway due to the underlying body motion, which makes dynamics and wrinkle effects appear.
, the second term of our loss , models the potential energy of Equation 4 which represents the internal and external forces that affect the garment. Inspired by works from cloth simulation literature [narain2012arcsim, sifakis2012fem], we define as the sum of different physics-based terms that model the energies that emerge on deformable solids, including strain, bending, gravity, and collisions
This formulation of is general, and the definition of each term depends on the material model used, which we detail in the next section.
3.4 Material Model
The literature of simulation of elastic solids characterizes materials using equations that relate stimuli (e.g., deformations) to material response (e.g., energies) [sifakis2012fem]. Inspired by this, and with the goal of learning physically-correct garment behaviors, we define the terms of our static loss based on equations of state-of-the-art cloth simulators [narain2012arcsim] to model the following energies:
Membrane Strain Energy.
The membrane strain term models the response of the material to in-plane deformation. Given a deformed position and an undeformed position (i.e., the garment template), it defines an internal energy based on a first-order deformation metric, typically the deformation gradient . In our loss we implement it using the Saint Venant Kirchhoff (StVK) elastic material model that defines membrane strain energy as
where and are the Lamé constants, and
is the Green strain tensor. The membrane strain energy of the mesh is computed as
where is the volume of each triangle (i.e., area thickness).
The bending term models the energy due to the angle of two adjacent faces and we model it as
where is the dihedral angle between the faces and is a bending stiffness.
To model the effect of gravity in the learned deformations, we add a loss term with the potential energy of each cloth vertex
where is the vertex mass, and is the gravitational acceleration.
This term is crucial to learn plausible deformations, enforcing the garment to follow the underlying body motion. We implement it as
where is a function that computes the distance to the body, is a collision stiffness, and is a safety margin to prevent the garment from overlapping with the body surface.
To highlight the realism of the proposed material, in Figure 3 we show a ground-truth simulation of our model, and the simpler material model used in PBNS[bertiche2021pbns] based on a traditional mass-spring formulation. Overall, our model is capable of reproducing more complex behaviors typically present in garments, including wrinkles and folds at different scales.
3.5 Regressing Garment Deformations
With our novel self-supervised loss defined in Section 3.3, we are ready to train the garment displacement regressor from Equation 2 without requiring ground-truth data. To this end, in order to model the time dependencies of the inertial term
, we implement the regressor using 4 Gated Recurrent Units (GRU), each with an output of size 256, andtanh
as the activation function (see Figure2). However, the recurrent nature of GRUs combined with the lack of ground-truth values to guide the training process make the regressor converge to bad solutions if a naive recurrent training protocol is used. We need to take special care into how the hidden states of the GRUs are initialized and updated.
Intuitively, the model should be able to learn dynamics from just 3 frames, since from Equation 6 depends only on the vertex positions and velocities of the previous step. Therefore, we train our network using sub-sequences of 3 frames. Interestingly, we found that training on longer sub-sequences also minimizes correctly, but the learned deformations do not model true dynamics.
At runtime, the network supports sequences of arbitrary length, but results can degrade noticeably for sequences longer than those used in training if initialization of the GRU hidden states is not well handled. More specifically, we observe that for each training sub-sequence, setting the initial hidden states hinders the network to generalize to sequences longer than 3 frames. We address this issue by sampling the initial state of each GRU from (empirically, and ), which allows the model to generalize well even for sequences with thousands of frames. Notice that at runtime the state
depends on an arbitrarily large number of previous frames, not just the last 3, hence the use of noise to initialize states on train sub-sequences is fundamental to augment variance in states.
To self-supervise the training process of our regressor we need to feed it with human motions and shapes. To this end, we use a set of 52 sequences from the AMASS dataset [AMASS:ICCV:2019], totaling 6,519 frames, which we split into sub-sequences of 3 frames as described in Section 3.5. We set aside 4 full sequences for validation purposes. To provide body shape variety at train time, each of the sub-sequences is assigned a different body shape sampled from
at each epoch. Notice that, enabled by our self-supervised approach, this strategy allows us to train using thousands of different body shapes, while competitive supervised methods are limited to a dramatically smaller shape sample ([patel2020tailor] uses 9 shapes, [santesteban2019virtualtryon] uses 17) due to the computational restrictions caused by the need for a ground-truth database.
Regarding the network hyper-parameters, we use a batch size of 16, initially train for 10 epochs using a learning rate of 0.001, and then resume the learning with a learning rate of 0.0001 until it converges. This approach is fast, works for all garments, and avoids erroneous states. The rest of the material and training parameters do not affect stability. Larger learning rates can introduce instabilities due to energy spikes that make the training struggle to recover (i.e. the predicted mesh has collisions that are too large to be resolved). Small body-garment collisions are not a problem – e.g., we can handle pants despite self-collisions in the legs on some poses.
Our approach does not require balancing loss terms, we just need to set the material properties of the garment. To this end, we tune material parameters to produce a desired fabric behavior, hence the parameters of the loss have a physical meaning – they are not arbitrary hyperparameters. To compute the mass matrixwe use real measurements of the thickness and density of 100% cotton fabric ( and respectively). The rest of the material parameters have the following values: the Lamé constants are set to and , the bending stiffness , the collision stiffness , and the collision margin . We use the same parameters for all our garments.
To thoroughly validate our model, in addition to comparisons to SOTA methods, in this section we also include ablations and comparisons that use a ground-truth simulated dataset. For as fair as possible evaluations, such dataset is created using the same motions, and the same train-test split, that we use to train SNUG.
We implement our method in a regular desktop PC equipped with an AMD Ryzen 7 2700 CPU, an Nvidia GTX 1080 Ti GPU, and 32GB of RAM.
|angle=25,lap=0pt-(0em)W/o bending||angle=25,lap=0pt-(0em)W/o strain||angle=25,lap=0pt-(1.5em)W/o gravity||angle=25,lap=0pt-(0.5em)W/o inertia||angle=25,lap=0pt-(0em)Full|
|TailorNet [patel2020tailor]||29 h||6.5 h||10.1 ms||2114 MB|
|Santesteban [santesteban2019virtualtryon]||180 h||17 h||2.5 ms||109 MB|
|SNUG (Ours)||0 h||2 h||2.2 ms||19 MB|
4.2 Quantitative Evaluation
To quantitatively evaluate our approach, we measure the physics-based terms of our loss in test motions and compare it with the predictions of PBNS [bertiche2021pbns]. Notice that the original PBNS method uses a different (and simpler) material model but, in order to get a meaningful quantitative comparison, we extended and re-trained the publicly available PBNS implementation with our material model defined in Section 3.4. Also, notice that we cannot provide this comparison for supervised state-of-the-art methods (e.g., [patel2020tailor, santesteban2019virtualtryon, gundogdu2019garnet]) because the simulation schemes, material models, and parameters used to build their datasets are different and, therefore, the ground-truth physics properties (i.e., our loss terms) might differ significantly.
Figure 4 shows the quantitative evaluation for the most important terms of our loss, and compares it with the extended implementation of PBNS [bertiche2021pbns] using our material, in the test sequence 01_01 of AMASS [AMASS:ICCV:2019]. Notice how our method consistently produces lower error values across all terms (strain, bending, and inertia), indicating that test samples processed with SNUG better match the behavior of physics-based solutions (i.e., the minimization of the terms). Table 1 presents a quantitative evaluation of both methods in our full test set (4 sequences, 598 frames unseen at train time), which further demonstrates that our approach improves upon the method of PBNS.
To validate each term of our formulation, in Table 2 we show an ablation of the mean-curvature error, evaluated in the test set of our ground truth simulated dataset, when leaving out some of the terms.
Finally, in Table 3 we also evaluate the memory requirements, training time, and runtime performance of our approach and compare to existing state-of-the-art supervised methods. Even if these methods do no address exactly the same problem (e.g. TailorNet [patel2020tailor] models garment variations and SNUG does not, but the latter models dynamics), SNUG outperforms supervised methods by a large margin in all metrics, resulting in a compact model, only 19MB, trained in just 2h, which opens the door to scalable learning-based garment models.
4.3 Qualitative Evaluation
We qualitatively evaluate our method in Figure 7 and, more extensively, in the supplementary video. To this end, notice that we always use body shapes and motions unseen during training. Additionally, we provide comparisons to the state-of-the-art supervised methods of Santesteban et al.[santesteban2019virtualtryon] and TailorNet [patel2020tailor], as well as to the recent work PBNS [bertiche2021pbns] that uses physics-constraints as supervision. To ease the assessment of the realism of each method, we also show results computed with a physics-based simulator [narain2012arcsim], but notice that this is a traditional offline method, several orders of magnitude slower.
These results demonstrate that our self-supervised method SNUG produces garment deformations that are, at least, on par with the state-of-the-art supervised methods [santesteban2019virtualtryon, patel2020tailor], while we do not require any ground-truth dataset. For PBNS [bertiche2021pbns], we use a mean body shape because it does not generalize to different bodies. Because PBNS does not model an inertial term and it is limited to a simpler material model, the garment deformations are generally more stiff, less realistic, and do not change naturally as a function of body pose. This is visible in rows 1 and 3 for PBNS in Figure 7, where the overall wrinkles are the same despite the significant change in body pose.
To further validate our model, we use the ground-truth simulated dataset (described in Section 4.1, used for validation purposes only) to retrain our neural network in a per-vertex supervised manner. In Figure 5 and in the supplementary we qualitatively demonstrate that the self-supervised method learns more detailed wrinkles than the supervised counterpart trained with exactly the same motions.
Additionally, in Figure 6 we show more results for a variety of garments learned with our approach, including t-shirts, tops, sleeveless shirts, pants, and shorts, worn by different body shapes. Notice how our approach produces different wrinkles for each garment type, pose, and shape combination, demonstrating the generalization capabilities of our self-supervised approach. For this figure, we trained one regressor for each garment type. In the supplementary video you can see animated results of these garments, showcasing for the first time realistic dynamic deformations of self-supervised learning-based garments.
5 Limitations and Conclusion
We believe SNUG makes an important step towards efficient learning-based models for 3D garments. To improve the state-of-the-art, instead of following the standard route of training with more data, adding more explicit supervision, or designing more complex architectures, we show that self-supervision based on physical properties of deformable solids leads to simpler and smaller yet highly-realistic models.
While our physics-based loss terms are the fundamental key to self-supervision, we also want to point out that our strategy of exploiting optimization-based schemes (originally derived for simulation problems) to train a neural network carries a few weaknesses and important considerations to take into account.
Specifically, we notice that the self-supervised network tends to converge to simpler solutions than a traditional simulator. For example, although our approach is capable of learning pose- and shape-dependent wrinkles and overall dynamics, we struggle to predict fine-level dynamics. We hypothesize that this limitation arises from a fundamental difference in how our method works: while standard simulators solve physics for one frame at a time, our model optimizes thousands of frames simultaneously during training. This makes our approach more prone to converge to simpler local minima. Nevertheless, we want to highlight that, despite this limitation, the cloth dynamics learned by our method are on par with other data-driven approaches.
Another aspect open to future research is the collision handling. Although our loss penalizes collisions between the garment and the body in train samples, we found noticeable collisions in test motions. Although these collisions can be efficiently solved with a postprocessing step, we believe it would be valuable to explore ways to enforce this constraint on the network. Addressing self-collisions of the garment is another aspect that would be worth taking into consideration.
To foster future research on the field, our trained models and the code to run them are available on http://mslab.es/projects/SNUG.
The work was funded in part by the European Research Council (ERC Consolidator Grant no. 772738 TouchDesign) and Spanish Ministry of Science (RTI2018-098694-B-I00 VizLearning).