High-fidelity animation of clothed humans is the key for a wide range of applications in AR/VR, 3D content production and virtual try-on. One of the main challenges when generating these animations is to create realistic cloth deformations with plausible wrinkles, creases, pleats, and folds. Such simulations are typically carried out by physics engines that model clothes via meshes with neighboring vertices connected using spring-mass systems. Unfortunately, these simulators need to be fine-tuned by a human expert and are subject to computationally intensive processes to calculate collisions between vertices. These two limitations prevent their deployment as a layer of a larger deep learning architecture.
With the advent of deep learning there have been a number of learning-based approaches that attempt to emulate the physical engines using differentiable networks [16, 10, 37, 24]. These approaches, however, propose models that are focused on specific clothes, and need to be retrained to simulate novel garments. Most recent approaches [24, 37, 17]
build upon an MLP architecture conditioned on body pose, shape and garment style. While the style parameter allows certain control on cloth attributes (sleeve length, size and fit), the range of variation is still fairly reduced.
In this paper, we present PhysXNet, a method to predict cloth dynamics of dressed people that is adaptable to different clothing types, styles and topologies without need of being retrained. For this purpose we build upon a simple but powerful representation based UV maps encoding cloth displacements. These UV maps are carefully designed in order to simultaneously encapsulate many different cloth types (upper body, lower body and dresses) and cloth styles (from long-sleeve to sleeve-less T-shirts). Given this representation, we then formulate the problem as a mapping between the human body kinematic space and the cloth deformation space. The input human kinematics are similarly represented as UV maps, in this case encoding body velocities and accelerations. Therefore, the problem boils down to learning a mapping between two different UV maps, from the human to the clothing, which we do using a conditional GAN network.
In order to train our system we build a synthetic dataset with the Blender physical engine, consisting of 50 skeletal actions and a human wearing three different garment templates: tops, bottoms and dresses. The results show that PhysXNet is then able to predict very accurate cloth deformations for clothes seen at train, while being also adaptable to clothes with a other topologies with a simple UV mapping.
Our key contributions can be summarized as follows:
A model that is able to predict simultaneously deformations on three garment templates.
A garment template representation by means of UV maps that allows us to easily map different cloth topologies onto the these templates.
A differentiable network that can be integrated into larger deep learning pipelines.
A new dataset of physically plausible cloth deformations with 50 human actions and 3 garment templates: tops, bottoms, and dresses.
2 Related Work
Estimating the deformation of a piece of cloth when a human is striking a pose or performing an action is a very difficult task. While estimating cloth deformation has been traditionally addressed by model-based approaches [23, 28, 22], recent deep learning techniques build upon data-driven methods. These datasets, however, usually ignore cloth deformation physics, producing unrealistic renders. This problem is generally addressed by obtaining the data from registered scans or including cloth simulation engines into the data generation process. Below, we briefly review different methods and datasets used to achieve realistic cloth and body reconstructions.
Synthetic Datasets. One of the main problems when generating a dataset is to obtain natural cloth deformations when a human is performing an action. Scan based approaches [41, 11, 35] have the advantage that can capture every cloth detail without having to worry about cloth physical models, however, the main drawback is that they need of dedicated hardware and software to process all the data. On the other side, synthetic based approaches [36, 26, 2] can be easily annotated and modified, but have the trouble of obtaining realistic cloth deformations. Cloth behaviour has to be tuned for every different 3D model and for each action requiring some professional expertise. Recent cloth physical engines can achieve very natural cloth behaviors [15, 33, 34], even for complex meshes, which makes the synthetic simulation a good competitor for the scanned data. In our work, we generate a synthetic dataset with a parametric human 3D model  and use Blender  cloth engine as a cloth simulator. We create high quality cloth deformations for three garment templates over motion sequences.
Data driven cloth deformations. Using the generated datasets either from scans or synthetic data, a big part of the research concentrate in achieve high detailed cloth deformations with tailored designed networks [19, 40, 10, 3, 16, 42], GANs [32, 14] or even more recently with implicit functions . These methods assume each cloth deformation frame is independent from each other and just concentrate to obtain reliable reconstructions in still images. Other methods go one step further and they try to infer the cloth deformation given a human pose and shape [13, 7, 25] obtaining very convincing results. In some other cases [37, 39, 31], the cloth size and style also can be adjusted changing some statistical model weights, which allows more flexible simulations. However, these methods are designed to deal with a single cloth at a time, and cloth dynamics generated by body movements are ignored.
Physics based cloth deformations. Above methods reason about cloth geometry to obtain plausible cloth deformations, but ignore the underlying physics of the cloth, which can help to achieve more natural deformations. This is especially true when the cloth deformations are affected by the motion of the body. Using the physics information obtained from a dataset, different networks [9, 38, 30] are able to simulate cloth wrinkles and deformations given a body pose and shape. Tailornet , extends this work and allows for a cloth style variations obtained from a base template model. While these methods are designed to be optimal on a T-shirt cloth, other cloth garments can be also estimated [17, 31]. All simulations are achieved using a dedicated network per cloth garment, which makes these methods not very flexible in case our cloth mesh is different that the one they used for train. Moreover, a human model usually wears more than a single cloth garment, which means that these methods need to use different networks for the different garments and make them more difficult to integrate in a more larger pipeline.
3.1 Problem Formulation
Physics-based engines model clothes using spring-mass models. In an oversimplification of how a simulation is performed, we can understand that the force (and hence the displacement) that receives each of these spring-mass elements is estimated as a function of the the skeleton velocities and accelerations. Building upon this intuitive idea we formulate the problem of predicting cloth dynamics as a regression from current and past body velocities and accelerations to cloth-to-body offset displacements. We encapsulate all these information by means of UV maps.
Concretely, assume we have a sequence of training frames, consisting of 3D body meshes and their corresponding cloth meshes. For a specific frame , we know the body and cloth UV maps that transform 3D points on the surface to 2D points on a planar domain. Let and denote the body () UV maps of size encoding the velocity (v) and accelerations (a) of the surface points. Similarly, let be the cloth () UV map, also with size , encoding the offset (o) of the cloth surface points w.r.t. the body surface. We consider three different cloth templates, tops (), bottoms () and dresses (), that is, .
Given this notation, we can formulate our problem as that of learning the mapping , where are the velocities and accelerations of the body surface points in the frames , and ; and are the garment offsets at the current frame .
Fig. 2 shows an schematic of the PhysXNet pipeline. Given a sequence of human body motions, the UV maps for body velocities and accelerations are computed in triplets and passed to the network in order to infer the UV maps of the cloth offsets for the current evaluated frame. Then, the vertices of a given garment are projected to the correspondent UV garment map to obtain the offsets respect to the body surface point for each one of the vertices and hence, the final position for the garment cloth.
The PhysXNet network is trained with two separate models where, a generator model produce samples of the UV garment maps, and a disciminator model tries to determine whether these samples are real or fake. Then, it starts a Minimax strategy game  with the generator trying to ”fool” the discriminator, and the discriminator trying to ”catch” the generator wrong samples.
Thus, the discriminator is trained in a supervised manner, where the input data from the generator should return and the input real data should return . The loss of the discriminator is given by a sigmoid cross entropy for the real and generated classes:
The generator is trained to produce data output as similar as possible to the ground truth data . In the generator loss is used a regularization term, that ensures that generated garment UV maps stay close to the ground truth garment UV maps :
where is a parameter that controls the weight of the regularization term. Note that we use metric as we find that produces slightly better results and a more stable training than a metric.
Architecture. The generator is designed as an encoder-decoder network. The encoder network receives the body velocities and accelerations UV maps for the current and previous two frames of the motion sequence. Then the ”body” encoder is connected to a ”garment” decoder, one for each garment template, that returns the offsets positions of the garment respect to the body. As the garment offsets have different behaviour depending of the template, is necessary to have a different decoder for each garment template.
Each encoder layer is composed by a 2D convolution that sub-sample the input into a half size, a batch normalization and a ReLU function. Each decoder layer is composed by a transposed 2D convolution that doubles the size of the input, and a batch normalization layer. We use four encoder and decoder layers with skip connections as in the UNet network. The discriminator is a PatchGAN decoder with a binary output taken from . The use of a discriminator helps the network to produce more smooth UV maps with no abrupt changes between close pixels.
4 Generate training data
The dataset is generated using Avatar Blender add-on . This add-on is based on Makehuman open source 3D human model and it is completely integrated to the rendering software Blender . It includes a parametric body model for pose and shape and, a library of clothes ready to use in a single click, which allows us to accelerate the generation of a physics cloth dataset. From the cloth library we select three different 3D models (shirt, pants, dress) that will be used as our garment templates(tops, bottoms and dress). Each selected 3D model will be run with physical simulation activated for a total of actions taken from CMU  and Mixamo  mocap files.
Each simulation is designed bearing in mind that we want to capture the dynamics of a cloth during a long sequence action. In the current physics simulator based on spring-mass model , the cloth behavior is influenced by different parameters that can be grouped in three main areas: 1) garment parameters , 2) world parameters and 3) external forces parameters . World parameters such as gravity and air friction are unchanged for all simulations. External forces such as velocity and acceleration parameters are constrained by the action defined in the motion files, and the garment parameters such as bending, stiffness, compression and shear, are adjusted to match a cotton fabric style simulation for each one of the cloth templates. The cloth fabric style needs to be adjusted for every cloth mesh used, as these parameters depends on the number of vertices of the mesh. The simulations are run with collisions and self-collisions activated.
4.2 Generate train UV maps
The synthetic dataset is generated from the 3D mesh models for body and clothes. The body mesh is a parametric modelwith vertices and a set of parameters to control shape and pose at frame sequence . This body 3D model will wear each one of the three following cloth mesh templates, tops , bottoms and dresses , with , , vertices respectively. For simplicity in the notation we will refer to the cloth mesh template models as when the models are at sequence frame . As there is no direct correspondence between the vertices of the body mesh and the vertices of the cloth templates, we define a transference matrix which relates the body vertices with a point in the cloth surface, see Fig. 3 and relates cloth vertices with a point in the body surface, see Fig. 4.
Hence, each cloth vertex position at frame , can be expressed as a point in the body surface with an offset as in Eq. 3:
where is a function that defines the offset positions of each vertex given a set of parameters such as body forces , world scene and garment fabric .
Body UV maps.Neural networks are more efficient when using 2D image representations, and for that reason we will represent our 3D models surface by means of a UV maps. Each pixel of the UV layout has a direct correspondence with a point in the mesh surface stored in the transference matrix . Therefore, the body mesh surface is represented by the body UV map . From the body UV map positions we can easily obtain the UV maps for velocity and acceleration .
The original body UV map layout is modified to occupy as many pixels as possible inside the layout and therefore, get a better sampling of the surface of the body. Thus, face, hands and feet are removed from the layout, and upper and lower limbs are stretched and resized, see Fig. 3.
Garment UV maps. The garment UV maps will contain the offset vectors from the body surface to the cloth surface points for each pixel in the transference matrix , Eq. 8. This matrix is calculated with the body dressed with a T-pose position. Then, for every valid pixel in the body UV map , is traced a ray along the normal of the body surface and the impact point is stored as the cloth point correspondence. This process is illustrated in Fig. 3.
The case of the dress garment is a bit different, since in the lower part of the dress garment will be parts of the mesh that have no body correspondence due to the rays along the surface normals of the inside part of the leg never impact to the center of the garment. Therefore, another body mesh where the legs are joined by an ellipsoid is created.
4.3 Evaluate different garments
The main advantage of the PhysXNet network over other methods is that we can easily use garments from different sources without the need of retrain the network. These garments need to be able to be encapsulated in one of the three cloth templates, but there is no condition about the number of vertices neither the topology.
Thus, given a garment model where is an arbitrary number of vertices we need to find the transference matrix that relates a vertex of the model with the garment templates . This process illustrated in Fig. 4, and consists into throw a ray from the cloth vertex along its normal direction to the body surface in a T-pose, in order to find the body UV map coordinate . The body UV map has a direct correspondence with each one of the estimated garment templates by the transference matrix computed previously. This coordinate will give us the offset respect to the body, and we will be able to reconstruct the mesh for a given frame .
In our case, the vertices of the cloth meshes evaluated fall a few pixels far from the UV map boundaries. This fact, avoid us some discrepancies that could be in the opposite UV map coordinates values corresponding to the same vertex. In case a vertex fall in the UV map boundary, the best solution would be to average pixel values corresponding the same vertex.
4.4 Implementation details
The PhysXNet network is trained with actions with a total of frames and tested with actions with a total of frames. All data UV maps, , , are normalized independently from to . The network discriminator is trained with soft labels, using random uniform sampling from to for estimated labels, and from to for ground truth labels. Moreover, a random
of training data on each epoch contain flip labels. Image UV map sizes are, and learning rate . The architecture is trained up to epochs for days in a single NVIDIA GeForce GTX 1080 GPU and inference mean time per frame is (load data, run, save files).
We next evaluate our proposed PhysXNet performing several quantitative and qualitative experiments. In the quantitative experiments, we compare our proposed method with the Linear Blend Skinning (LBS) method as a baseline. The LBS method calculates the displacement of each vertex according to a weighted linear combination of the assigned skeleton segments. Results are given by comparing the estimated UV garment maps with the ground truth UV maps for each vertex of the garment template and also for each pixel in the UV garment map. In the qualitative results, we compare our proposed method with LBS and TailorNet . We also show the results of PhysXNet with different body shapes and other garment meshes than the ones used for train.
5.1 Quantitative results
We provide two different measures for the quantitative results. First, we calculate the mean squared error (MSE) for each valid pixel of the PhysXNet estimated UV map templates with the ground truth UV maps obtained from the synthetic dataset . Then, we calculate the MSE error for the vertices of the cloth meshes used to generate the dataset. Note that the meshes are a subset of the UV map pixels.
The evaluated actions in the Fig. 6 are in the following order: jump, walking, moon walk, Chinese dance, punch, balancing, ballet, stretch arms, salsa dance, jogging, side step, and strong gesture. The first bar with color cyan is for the tops template, the second bar with gray color is for the bottoms template and the third bar with purple color is for the dress template. Some of these actions have very soft movements, like moon walk, balancing, walking, which results in small velocities and accelerations while some of the other motions like strong gesture, jump, punch in very few frames the pose has big changes which produces large velocities and accelerations.
Errors for tops and bottoms templates are very similar while errors in the dress template are almost double than the other two. The reason why the dress template errors are bigger is due to the hallucination that the network needs to do in the legs of the body, as there are parts of the dress that have no direct correspondence with the input body UV maps.
5.2 Qualitative results
In the qualitative results we compare our proposed method PhysXNet, with LBS method and TailorNet  when possible. Moreover, we show the performance under changes of human shape and different cloth topologies that have never been seen by the network.
Results on continuous actions. We show several frames for the sequence Jump in Fig. 5. In the top row the model wears the tops and bottoms templates, and in the bottom row, the model wears the dress template. We can observe that when the body goes up it generates several forces that push the cloth also up. This kind of behaviour is not possible to obtain with other methods that only use human pose to deform the cloth.
Different garment topologies and body shapes. The proposed PhysXNet is also able to deal with different body shapes, as the output of the network are the offsets of each garment respect to the body, and also with cloth garments that contains different number of vertices and different topologies. This is possible due to the output UV map templates encode the surface of the garment, and when using a different mesh topology, it is only necessary to project the vertices of the mesh with the UV map without being necessary to retrain the network. Results for 3D cloth (t-shirt, shorts, shirt2, skirt) are shown in Fig. 9.
Comparisons. We compare our PhysXNet network with the LBS method and TailorNet  in the case of Tops and Bottoms templates, and only with LBS method in the case of Dress template, as TailorNet does not have a dress model. The human models used in our case, Makehuman , and TailorNet, SMPL , are different, and this makes that the represented actions are not exactly the same due to the internal bone structures and bone lengths.
In Fig. 8 we can observe the differences between the three methods. While in the LBS and TailorNet methods, the bottom part of the shirt is not moving while the performing punch action, in our proposed method, the shirt contains the movement produced by the body movements. The main reason for this behavior is because our method takes into account the current and past body motion and is able to apply it to the cloth, while other two methods are static and only use current body pose. Similar we can observe in Fig. 7 with the dress template.
There are also other differences between the three methods shown in Table 1. First, about the garment mesh models itself. While TailorNet uses very heavy models, up to 14GB per model, our cloth models and LBS are very light, being around 1MB. The counterpart, is that our model is very widely sampled, making difficult to capture small wrinkles. A second difference is about the network weights, while for a single garment, network weights are similar in size, in TailorNet, if a user wants to use more models it is necessary to download more weights. In our proposed method, the same weights are used for the three templates which can be applied to a large variety of garments. The last difference is about the execution time, as expected larger models comes with larger execution times. Hence, in PhysXNet, with a single pass of the network we obtain the outputs of the three garment templates, in TailorNet it is be necessary to perform inference for each one of the desired garments.
We presented a network, PhysXNet, that generates cloth physical dynamics for three totally different garment templates at the same time. The network is able to generalize to unseen body actions, different body shapes and different cloth 3D models, making the model suitable to integrate it into a larger pipeline. Our network can simulate the cloth physics behavior for any 3D cloth mesh randomly downloaded from the internet that fits to any of the three garment templates without being retrained. The proposed method is compared quantitatively with the synthetic dataset ground truth and qualitatively with a baseline, LBS, and to an state-of-the-art method, TailorNet.
This work is supported partly by the Spanish government under project MoHuCo PID2020-120049RB-I00, the ERA-Net Chistera project IPALM PCI2019-103386 and María de Maeztu Seal of Excellence MDM-2016-0656.
-  (1998) Large steps in cloth simulation. In SIGGRAPH ’98: Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pp. 43–54. Cited by: §4.1.
-  (2020) CLOTH3D: clothed 3d humans. External Links: Cited by: §2.
Multi-garment net: learning to dress 3d people from images.
IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
-  Blender. Note: https://www.blender.orgAccessed: 2020-09-30 Cited by: §2, §4.1.
-  CMU. Note: http://mocap.cs.cmu.eduAccessed: 2020-09-30 Cited by: §4.1.
SMPLicit: topology-aware generative model for clothed people.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
Coercing machine learning to output physically accurate results. Journal of Computational Physics 406. Cited by: §2.
-  (2014) Generative adversarial nets. In NIPS, Cited by: §3.2.
-  (2012) DRAPE: dressing any person. ACM Trans. Graph. 31 (4). Cited by: §2.
-  (2019) Garnet: a two-stream network for fast and accurate 3d cloth draping. In IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.
-  (2020-06) DeepCap: monocular human performance capture using weak supervision. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §3.2.
-  (2018) A pixel-based framework for data-driven clothing. External Links: Cited by: §2.
-  (2019) GANs-based clothes design: pattern maker is all you need to design clothing. In Augmented Human International Conference (AH), Cited by: §2.
-  (2015) View-dependent adaptive cloth simulation with buckling compensation. IEEE Transactions on Visualization and Computer Graphics 21 (10). Cited by: §2.
-  (2018) DeepWrinkles: accurate and realistic clothing modeling. In European Conference on Computer Vision (ECCV), Cited by: §1, §2.
-  (2020) Deep physics-aware inference of cloth deformation for monocular human performance capture. External Links: Cited by: §1, §2.
-  (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia) 34 (6). Cited by: §5.2.
-  (2020-06) Learning to dress 3d people in generative clothing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  Makehuman. Note: http://www.makehumancommunity.orgAccessed: 2020-09-30 Cited by: §2, §4.1, §5.2.
-  Mixamo software. Note: https://www.mixamo.comAccessed: 2020-09-30 Cited by: §4.1.
-  (2013) Stochastic exploration of ambiguities for nonrigid shape recovery. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) 35 (2), pp. 463–475. External Links: Cited by: §2.
-  (2009) Capturing 3d stretchable surfaces from single images in closed form. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1842–1849. Cited by: §2.
-  (2020-06) TailorNet: predicting clothing in 3d as a function of human pose, shape and garment style. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2, §5.2, §5.2, §5.
-  (2021) D-nerf: neural radiance fields for dynamic scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2019) 3DPeople: Modeling the Geometry of Dressed Humans. In International Conference in Computer Vision (ICCV), Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention (MICCAI), Vol. 9351. Cited by: §3.2.
-  (2010) Simultaneous pose, correspondence and non-rigid shape. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1189–1196. Cited by: §2.
-  (2021) AVATAR: blender add-on for fast creation of 3d human models. External Links: Cited by: §4.1.
-  (2019) Learning-Based Animation of Clothing for Virtual Try-On. Computer Graphics Forum. Cited by: §2.
-  (2021) Self-Supervised Collision Handling via Generative 3D Garment Models for Virtual Try-On. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Cited by: §2, §2.
-  (2020) GAN-based garment generation using sewing pattern images. In Proceedings of the European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2016-05) CAMA: contact-aware matrix assembly with unified collision handling for gpu-based cloth simulation. Computer Graphics Forum 35, pp. 511–521. Cited by: §2.
-  (2018) I-cloth: incremental collision handling for gpu-based interactive cloth simulation. ACM Transactions on Graphics 37 (6). Cited by: §2.
-  (2020) SIZER: a dataset and model for parsing 3d clothing and learning size sensitive 3d clothing. In European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2017) Learning from synthetic humans. In CVPR, Cited by: §2.
-  (2020) Fully Convolutional Graph Neural Networks for Parametric Virtual Try-On. Computer Graphics Forum (Proc. SCA). Cited by: §1, §2.
-  (2018) Learning a shared shape space for multimodal garment design. ACM Trans. Graph. 37 (6). Cited by: §2.
-  (2019) Learning an intrinsic garment space for interactive authoring of garment animation. ACM Trans. Graph. 38 (6). Cited by: §2.
-  (2018) Physics-inspired garment recovery from a single-view image. ACM Transactions on Graphics 37 (5). Cited by: §2.
-  (2019-06) SimulCap : single-view human performance capture with cloth simulation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020) Deep detail enhancement for any garment. arXiv preprint arXiv:2008.04367. Cited by: §2.