1 Introduction
Cloth is particularly challenging for neural networks to model due to the complex physical processes that govern how cloth deforms. In physical simulation, cloth deformation is typically modeled via a partial differential equation that is discretized with finite element models ranging in complexity from variational energy formulations to basic masses and springs, see e.g.
baraff1998large; bridson2002robust; bridson2003simulation; grinspun2003discrete; baraff2003untangling; selle2008robust. Mimicking these complex physical processes and numerical algorithms with machine learning inference has shown promise, but still struggles to capture highfrequency folds/wrinkles. PCAbased methods
de2010stable; hahn2014subspaceremove important high variance details and struggle with nonlinearities emanating from joint rotations and collisions. Alternatively,
guan2012drape; neophytou2014layered; pons2017clothcap; lahner2018deepwrinkles; yang2018analyzing; jin2018pixel leverage body skinning magnenat1988joint; lander1998skin; lewis2000pose to capture some degree of the nonlinearity; the cloth is then represented via learned offsets from a codimension one skinned body surface. Building on this prior work, we propose replacing the skinned codimension one body surface parameterization with a skinned (fully) threedimensional parameterization of the volume surrounding the body.We parameterize the threedimensional space corresponding to the volumetric region of air surrounding the body with a tetrahedral mesh. In order to do this, we leverage the work of lee2018skinned; lee2019robust, which proposed a number of techniques for creating and deforming such a tetrahedral mesh using a variety of skinning and simulation techniques. The resulting kinematically deforming skinned mesh (KDSM) was shown to be beneficial for both hair animation/simulation lee2018skinned and water simulation lee2019robust. Here, we only utilize the most basic version of the KDSM, assigning skinning weights to its vertices so that it deforms with the underlying joints similar to a skinned body surface (alternatively, one could train a neural network to learn more complex KDSM deformations). This allows us to make a very straightforward and fair comparison between learning offsets from a skinned body surface and learning offsets from a skinned parameterization of threedimensional space. Our experiments showed an overall reduction in error of approximately 50% (see Table 3 and Figure 8) as well as the removal of visual/geometric artifacts (see e.g. Figure 9) that can be directly linked to the usage of the body surface mesh, and thus we advocate the KDSM for further study. In order to further illustrate the efficacy of our approach, we show that the KDSM is amenable to being used with recently proposed works on texture sliding for better threedimensional reconstruction wu2020recovering as well as in conjunction with networks that use a postprocess for better physical accuracy in the L norm geng2020coercing (see Figure 10).
In summary, our specific contributions are: 1) a novel threedimensional parameterization for virtual cloth adapted from the KDSM, 2) an extension (enabling plastic deformation) of the KDSM to accurately model cloth deformation, and 3) a learning framework to efficiently infer such deformations from body pose. The mean error of the cloth predicted in
jin2018pixel is five standard deviations higher than the mean error of our results.2 Related Work
Cloth:
Datadriven cloth prediction using deep learning has shown significant promise in recent years. To generate clothing on the human body, a common approach is to reconstruct the cloth and body jointly
alldieck2018detailed; alldieck2018video; xu2018monoperfcap; alldieck2019learning; alldieck2019tex2shape; habermann2019livecap; natsume2019siclope; saito2019pifu; yu2019simulcap; bhatnagar2019multi. In such cases, human body models such as SCAPE anguelov2005scape and SMPL loper2015smpl can be used to reduce the dimensionality of the output space. To predict cloth shape, a number of works have proposed learning offsets from the body surface guan2012drape; neophytou2014layered; pons2017clothcap; lahner2018deepwrinkles; yang2018analyzing; jin2018pixel; gundogdu2019garnet such that body skinning can be leveraged. There are a variety of skinning techniques used in animation; the most popular approach is linear blend skinning (LBS) magnenat1988joint; lander1998skin. Though LBS is efficient and computationally inexpensive, it suffers from wellknown artifacts addressed in kavan2005spherical; kavan2007skinning; jacobson2011stretchable; le2016real. Since regularization often leads to overly smooth cloth predictions, others have focused on adding additional wrinkles/folds to initial network inference results popa2009wrinkling; mirza2014conditional; robertini2014efficient; lahner2018deepwrinkles; wu2020recovering.3D Parameterization: Parameterizing the air surrounding deformable objects is a way of treating collisions during physical simulation sifakis2008globally; muller2015air; wu2016real. For hair simulation in particular, previous works have parameterized the volume enclosing the head or body using tetrahedral meshes lee2018skinned; lee2019robust or lattices volino2004animating; volino2006real. These volumes are animated such that the embedded hairs follow the body as it deforms enabling efficient hair animation, simulation, and collisions. Interestingly, deforming a lowdimensional reference map that parameterizes highfrequency details has been explored in computational physics as well, particularly for fluid simulation, see e.g. bellotti2019coupled.
3 Skinning a 3D Parameterization
We generate a KDSM using red/green tetrahedralization molino2003crystalline; teran2005adaptive to parameterize a threedimensional volume surrounding the body. Starting with the body in the Tpose, we surround it with an enlarged bounding box containing a threedimensional Cartesian grid. As is typical for collision bodies in computer graphics bridson2003simulation, we generate a level set representation separating the inside of the body from the outside (see e.g. osher2002level). See Figure 0(a). Next, a thickened level set is computed by subtracting a constant value from the current level set values (Figure 0(b)). Then, we use red/green tetrahedralization as outlined in molino2003crystalline; teran2005adaptive to generate a suitable tetrahedral mesh (Figure 0(c)). Optionally, this mesh could be compressed to the level set boundary using either physics or optimization, but we forego this step because the outer boundary is merely where our parameterization ends and does not represent an actual surface as in molino2003crystalline; teran2005adaptive.
Skinning weights are assigned to the KDSM using linear blend skinning (LBS) magnenat1988joint; lander1998skin, just as one would skin a codimension one body surface parameterization. In order to skin the KDSM so that it follows the body as it moves, each vertex is assigned a nonzero weight for each joint it is associated with. Then, given a pose with joint transformations , the world space position of each vertex is given by where is the untransformed location of vertex in the local reference space of joint . See Figure 0(d). Importantly, it can be quite difficult to significantly deform tetrahedral meshes without having some tetrahedra invert irving2004invertible; teran2005robust; thus, we address inversion and robustness issues/details in Section 5.
4 Embedding cloth in the KDSM
In continuum mechanics, deformation is defined as a mapping from a material space to the world space, and one typically decomposes this mapping into purely rigid components and geometric strain measures, see e.g. bonet1997nonlinear. Similar in spirit, we envision the Tpose KDSM as the material space and the skinned KDSM as being defined by a deformation mapping to world space for each pose . As such, we denote the position of each cloth vertex in the material space (i.e. Tpose, see Figure 1(a)) as . We embed each cloth vertex into the tetrahedron that contains it via barycentric weights , which are only nonzero for the parent tetrahedron’s vertices. Then, given a pose , a cloth vertex’s world space location is defined as so that it is constrained to follow the KDSM deformation, assuming linearity in each tetrahedron (see Figure 1(b)). Technically, this is an indirect skinning of the cloth with its skinning weights computed as a linear combination of the skinning weights of its parent tetrahedron’s vertices, and leads to the obvious errors one would expect (see e.g. Figure 3, second row).
The KDSM approximates a deformation mapping for the region surrounding the body. This approximation could be improved via physical simulation (see e.g. lee2018skinned; lee2019robust), which is computationally expensive but could be made more efficient using a neural network. However, the tetrahedral mesh is only well suited to capture deformations of a volumetric threedimensional space and as such struggles to capture deformations intrinsic to codimension one surfaces/shells including the bending, wrinkling, and folding important for cloth. Thus, we take further motivation from constitutive mechanics (see e.g. bonet1997nonlinear) and allow the cloth vertices to move in material space (the Tpose) akin to plastic deformation. That is, we use plastic deformation in the material space in order to recapture elastic deformations (e.g. bending) lost/recovered when embedding cloth into a tetrahedral mesh. These elastic deformations are encoded as a posedependent plastic displacement for each cloth vertex, i.e. ; then, the posedependent, plastically deformed material space position of each cloth vertex is given by .
Given a pose , will not necessarily have the same parent tetrahedron or barycentric weights as ; thus, a new embedding is computed for obtaining new barycentric weights . Using this new embedding, the position of the cloth vertex in pose will be . Ideally, if the are computed correctly, will agree with the ground truth location of cloth vertex in pose . The second row of Figure 4 shows cloth in the material space Tpose plastically deformed such that its skinned location in pose (Figure 4, first row) well matches the ground truth shown in the first row of Figure 3. Learning for each vertex can be accomplished in exactly the same fashion as learning displacements from the skinned body surface mesh, and thus we use the same approach as proposed in jin2018pixel. Afterwards, an inferred is used to compute followed by , and finally . Addressing efficiency, note that only the vertices of the parent tetrahedra of need to be skinned, not the entire tetrahedral mesh.






In order to compute each training example , we examine the ground truth cloth in pose , i.e. . For each cloth vertex , we find the deformed tetrahedron it is located in and compute barycentric weights resulting in . Then, that vertex’s material space (Tpose) location is given by where are the material space (Tpose) positions of the tetrahedral mesh (which are the same for all poses, and thus not a function of ). Finally, we define .
5 Inversion and robustness
Unfortunately, the deformed KDSM will generally contain both inverted and overlapping tetrahedra, both of which can cause a ground truth cloth vertex to be contained in more than one deformed tetrahedron, leading to multiple candidates for and . Although physical simulation can be used to reverse some of these inverted elements irving2004invertible; teran2005robust as was done in lee2018skinned; lee2019robust, it is typically not feasible to remove all inverted tetrahedra. Additionally, overlapping tetrahedra occur quite frequently between the arm and the torso, especially because the KDSM needs to be thick enough to ensure that it contains the cloth as it deforms.
Before resolving which parent tetrahedron each vertex with multiple potential parents should be embedded into, we first robustly assemble a list of all such candidate parent tetrahedra as follows. Given a deformed tetrahedral mesh in pose , we create a bounding box hierarchy acceleration structure hahn1988realistic; webb1992using; barequet1996boxtree; gottschalk1996obbtree; lin1998collision for the tetrahedral mesh built from a slightly thickened bounding box around each tetrahedron. Then given a ground truth cloth vertex, , we robustly find all tetrahedra containing (or almost containing) it using a minimum barycentric weight of with . We prune this list of tetrahedra, keeping only the most robust tetrahedron near each element boundary where numerical precision could cause a vertex to erroneously be identified as inside multiple or no tetrahedra. This is done by first sorting the tetrahedra on the list based on their largest minimum barycentric weight, i.e. preferring tetrahedra the vertex is deeper inside. Starting with the first tetrahedron on the sorted list, we identify the face across from the vertex with the smallest barycentric weight and prune all of that face’s vertex neighbors (and thus face/edge neighbors too) from the remainder of the list. Then, the next (nondeleted) tetrahedron on the list is considered, and the process is repeated, etc.
Method 1: Any of the parent tetrahedra that remain on the list may be chosen to obtain training examples with zero error as compared to the ground truth, although different choices lead to higher/lower variance in and thus higher/lower demands on the neural network. To establish a baseline, we first take the naive approach of randomly choosing when multiple candidates exist. This can lead to high variance in and subsequent ringing artifacts during inference. See Figure 5.
[capbesidesep=quad, justification=justified, capbesideposition= right,center, capbesidewidth=] figure[]
Method 2: Aiming for lower variance in the training data, we leverage the method of jin2018pixel where UV texture space and normal direction offsets from the skinned body surface are calculated for each pose in the training examples. These same offsets can be used in any pose, since the UVN coordinate system is still defined (albeit deformed) in every pose. Thus, we utilize these UVN offsets in our material space (Tpose) in order to define and subsequently . In particular, given the shrinkwrapped cloth in the Tpose, we apply UVN offsets corresponding to pose . Although this results in lower variance than that obtained from Method 1, the resulting do not exactly recover the ground truth cloth . See Figure 4(c).
[capbesidesep=quad, justification=justified, capbesideposition= right,center, capbesidewidth=] figure[]
Hybrid Method: When a vertex has only one candidate parent tetrahedron, Method 1 is used. When there is more than one candidate parent tetrahedron, we choose the parent that gives an embedding closest to the result of Method 2 (in the Tpose) as long as the disagreement is below a threshold (1 cm). As shown (for a particular training example) in Figure 6(g), this can leave a number of potentially high variance vertices undefined. Aiming for smoothness, we use the Poisson morph from cong2015fully to morph from the low variance results of Method 2 to the partiallydefined cloth mesh shown in Figure 6(g), utilizing the already defined/valid vertices as Dirichlet boundary conditions. See Figure 6(h). Although smooth, the resulting predictions may contain significant errors, and thus we only validate those that are within a threshold (1 cm) of the results of Method 2. See Figure 6(i). The Poisson equation morph guarantees smoothness, while only utilizing the morphed vertices close to the results of Method 2 limits errors (as compared to the ground truth) to some degree. This process is repeated until no newly newly morphed vertices are within the threshold (1 cm). At that point, the remaining vertices are assigned their morphed values despite any errors they might contain. See Figure 6(j).
6 Experiments
Dataset Generation: Our cloth dataset consists of Tshirt meshes corresponding to about 10,000 poses for a particular body physbamcloth (the same as in jin2018pixel). We applied an 801010 split to obtain training, validation, and test datasets, respectively. Table 1 compares the maximum L and L norms as compared to the ground truth for each of the three methods used to generate training examples. While Method 1 minimizes cloth vertex errors, the resulting contains high variance. Method 2 has significant vertex errors, but significantly lower variance in . We leverage the advantages of both using the hybrid method in our experiments.
Method  Max Vertex Error  Avg Vertex Error  Max  Avg 

Method 1  136.5  9.35  
Method 2  12.7  0.549  14.9  0.75 
Hybrid Method  11.6  0.021  14.7  0.79 
Network Training: We adapt the network architecture from jin2018pixel for learning the displacements , i.e. by storing the displacements as pixelbased cloth images for the front and back sides of the Tshirt. Given joint transformation matrices of shape for pose
, the network applies transpose convolution, batch normalization, and ReLU activation layers. The output of the network is
, where the first three dimensions represent the predicted displacements for the front side of the Tshirt, and the last three dimensions represent those for the back side. We train with an loss on the difference between the ground truth displacements and network predictions , using the Adam optimizer kingma2014adam with alearning rate in PyTorch
paszke2017automatic.Network Inference: From the network output , we define , which is then embedded into the material space (Tpose) tetrahedral mesh and subsequently skinned to world space to obtain the cloth mesh prediction . Table 3 summarizes the network inference results on the test dataset (not used in training). While all three methods detailed in Section 5 outperform the method proposed in jin2018pixel, the hybrid method achieved the lowest average vertex error and standard deviation. Figure 8 shows histograms of the average vertex error over all examples in the test dataset for the hybrid method and jin2018pixel. Note that the mean error of jin2018pixel is five standard deviations above the mean of the hybrid method. Table 3 shows the errors in volume enclosed by the cloth (after capping the neck/sleeves/torso). There are significant visual improvements as well, see e.g. Figure 9. In addition, we evaluate the hybrid method network on a motion capture sequence from cmumocap and compare the inferred cloth to the results in jin2018pixel. The hybrid method is able to achieve greater temporal consistency; see http://physbam.stanford.edu/~fedkiw/animations/clothkdsm_mocap.mp4. To demonstrate the efficacy of our approach in conjunction with other approaches, we apply texture sliding from wu2020recovering and the physical post process from geng2020coercing to the results of the hybrid method network predictions, see Figure 10.



7 Discussion
In this paper, we presented a framework for learning cloth deformation using a volumetric parameterization of the air surrounding the body. This parameterization was implicitly defined via a tetrahedral mesh that was skinned to follow the body as it animates, i.e. KDSM. A neural network was used to predict offsets in material space (the Tpose) such that the result well matched the ground truth after skinning the KDSM. The cloth predicted using the hybrid method detailed in Section 5 exhibits half the error as compared to stateoftheart; in fact, the mean error from jin2018pixel is five standard deviations above the mean resulting from our hybrid approach. Our results demonstrate that the KDSM is a promising foundation for learning virtual cloth and potentially for hair and solid/fluid interactions as well. Moreover, the KDSM should prove useful for treating cloth collisions, multiple garments, and interactions with external physics.
The KDSM intrinsically provides a more robust parameterization of threedimensional space, since it contains a true extra degree of freedom as compared to the degenerate codimension one body surface. In particular, embedding cloth into a tetrahedral mesh has stability guarantees that do not exist when computing offsets from the body surface. See Figure
11. We believe that the significant decrease in network prediction errors is at least partially attributable to increased stability from using a volumetric parameterization.8 Acknowledgements
Research supported in part by ONR N000141310346, ONR N000141712174, and JD.com. We would like to thank Reza and Behzad at ONR for supporting our efforts into machine learning, as well as Rev Lebaredian and Michael Kass at NVIDIA for graciously loaning us a GeForce RTX 2080Ti to use for running experiments. We would also like to thank Zhengping Zhou for contributing to the early stages of this work.
Comments
There are no comments yet.