1 Introduction
Cloth animation plays an important role in many applications, such as movies, video games, virtual tryon, etc. With the rapid development of physicsbased simulation techniques terzopoulos87elastically; bridson03wrinkles; BW98; provot97collision, garment animations with remarkably realistic and detailed folding patterns can be achieved. However, these techniques require high resolution meshes to represent fine details, therefore need much computational time to solve velocityupdating equations and resolve collisions. Moreover it is laborintensive to tune simulation parameters for a desired wrinkling behavior.
Recently datadriven methods wang10example; zurdo2013wrinkles; santesteban2019learning provide alternative solutions for these problems, as they offer fast production and also create wrinkling effects that highly resemble the training data. Relying on precomputed data and datadriven techniques, a high resolution (HR) mesh is either directly synthesized, or superresolved from a physically simulated low resolution (LR) mesh. Nevertheless, existing datadriven methods either depend on human body poses wang10example; santesteban2019learning; Feng2010transfer; deAguiar10Stable) thus are not suitable for loose garments, or lack of dynamic modeling of wrinkle behaviors zurdo2013wrinkles; kavan11physics; laehner2018deepwrinkles; chen2018synthesizing; oh2018hierarchical for general case of freeflowing cloth. When used for the wrinkle synthesis, datadriven methods may suffer from cloth penetrations zurdo2013wrinkles; santesteban2019learning; kavan11physics, even though the coarse meshes are guaranteed to be collisionfree in the simulation. The penetrations in synthesized meshes are preexisting collisions and should be resolved by an untangling scheme. However, even the stateoftheart untangling scheme is notoriously not robust ye17unified.
To tackle these challenges, we propose a framework, synthesizing cloth wrinkles with a deep learning based method. We create three datasets, from physicsbased simulation, as the training data. The simulation is assume to be independent of human bodies and not limited with tight garments. This dataset is generated by a pair of LR and HR meshes with synchronized simulations. Given the simulated mesh pairs, we aim to map the LR meshes to the HR domain by a detail enhancement method, which is essentially a superresolution (SR) operation. Deep SR networks have proven to be powerful and fast machine learning tools for image detail enhancement
lim2017enhanced; ledig2016photo; zhang2018residual. Yet for surface meshes which usually have irregular structures, it is not straightforward to apply traditional convolutional operations as for images. Chen et al. chen2018synthesizing proposed a method, converting manifold meshes into geometry images gu2002geometry, to solve this issue. Inspired by their work, we design a multifeature superresolution network (MFSR) to improve the synthesized results and model the dynamic wrinkle behaviors. The LR and HR image pairs, encoding three features: the displacement, the normal and the velocity, are fed into the network for training. Our MFSR jointly learns upsampling synthesizers with a multitask architecture, consisting of a shared network and three taskspecific networks, instead of combining all features with a single SR network. The proposed spatial and temporal losses also contribute to the generation of dynamic wrinkles and further maintain frametoframe consistency. At runtime, with superresolved geometry images generated by MFSR, we convert them back into HR meshes. A refinement step is proposed to obtain realisticlooking and collisionfree results. As our approach is based on deep neural networks, it reduces the computational cost significantly, even with the refinement step. In summary, the main contributions of our work are as follows:

We propose a novel framework for cloth wrinkle synthesis, including a multifeature superresolution network (MFSR), a synchronized simulation, and a collision handling method.

We learn both shared and taskspecific representations of garment shapes with multiple features.

We generate dynamic wrinkles and consistent mesh sequences thanks to the spatial and temporal loss functions.
We qualitatively and quantitatively evaluate our method for various cloth types (tablecloths and long skirts) and motion sequences. Experimental results show that the quality of synthesized garments is comparable with that from a physicsbased simulation, yet significantly reducing the computation cost. To the best of our knowledge, this is the first approach to employ a multifeature learning model on 3D dynamic wrinkle synthesis.
2 Related work
2.1 Cloth animation
A physicsbased simulation for realistic fabrics includes velocity updating by physical energies terzopoulos87elastically; bridson03wrinkles; Grinspun03shell, time integration BW98; Harmon09asynchronous, collision detection and collision response provot97collision; volino95collision, etc. These modules are solved separately and time consuming. To improve the efficiency of this system, researchers have exploited many algorithms such as implicit or semiimplicit time integration BW98; choi02stable; hauth03analysis, adaptive remeshing Narain2012AAR; weidner2018eulerian and iterative optimization liu13fast; Wang2016Descent, .etc. Nevertheless, these algorithms still cost the expensive computation to produce rich wrinkles and are labor consuming to tune mechanical parameters for desired wrinkling behaviors. Recently datadriven methods have drawn much attention as they offer faster cloth animations than the physicsbased methods. Based on precomputed data and datadriven techniques, an HR mesh is either directly synthesized, or superresolved from a physically simulated LR mesh. In the first stream of work, with precomputed data, researchers have investigated many techniques to accelerate the process for new animations, such as a linear conditional model deAguiar10Stable; Guan12DRAPE and a secondary motion graph Kim2013near; Kim2008drivenshape. Additionally, deep learningbased methods gundogdu2018garnet; Wang2018garmentdesign are also used to generate static garments on human bodies. In the another line of work, researchers have proposed to combine coarse mesh simulations with learned geometric details from paired mesh databases, to generalize the performance to complicated testing scenes. This stream of methods includes wrinkle synthesis depending on bone clusters Feng2010transfer or human poses wang10example for fitted clothes, and linear upsampling operators kavan11physics or lowdimensional subspace with bases zurdo2013wrinkles; Hahn2014subspace for general case of freeflowing cloth. Inspired by these datadriven methods, we propose a deep learning based approach to synthesize wrinkles on coarse simulated meshes, while our approach is independent with poses or skeletons and not limited with tight garments. Due to the expensive cost and lowquality data retrieving from the real world, most datadriven methods use training data generated from physicsbased simulation. In our experiments, we use the ARCSim system Narain2012AAR for its speed and stability.
2.2 Representation in 3D learning
Due to the irregular topology and high dimension, 3D meshes are more difficult to be processed by neural networks than images. Nevertheless, some approaches have been proposed in recent years. Wang et al. wang2017cnn
propose to represent 3D shapes as voxel grids to cope with an octreebased convolutional neural networks (CNNs) for 3D shape analysis, like object classification, shape retrieval and part segmentation. Su
et al. Su2015mvcnn learn to recognize 3D shapes from a collection of their rendered views on 2D images with standard CNNs. Li et al. li20193d use the learningbased SR framework to retrieve HR texture maps from multiple view points with the geometric information via normal maps. Representations based upon voxel grids or multiview images are extrinsic to the shapes, which means that they may naturally fail to recognize a shape under isometric deformations. To encode intrinsic or extrinsic descriptors for CNNbased learning, a technique called geometry images gu2002geometry is used in chen2018synthesizing; sinha2016deep; Sinha2017surfnet; Maron2017convolutional for mesh classification or generation. We adopt this representation embedding multiple features of meshes.Featurebased methods aim for proper descriptions of irregular 3D meshes, for synthesizing detailed and realistic objects. Conventional datadriven methods zurdo2013wrinkles; Wu1996Simulation; Rohmer2010wrinkling simplify the calculation of wrinkle features, by formulating the strain or stress in a LR mesh. As for deep learning, several algorithms have also investigated robust shape descriptors for wrinkle deformation. Sinha et al. sinha2016deep use geometry images encoding principal curvatures and heat kernel signatures for rigid and nonrigid shape descriptors, respectively. Geometry images embedding position feature are proposed by Chen et al. chen2018synthesizing for cloth wrinkle synthesis. Santesteban et al. santesteban2019learning use two displacements as descriptors, one from overall deformation in the form of stretch or relaxation, and the other from additional global deformation and smallscale wrinkles. Wang et al. wang2019learning
learn a shape feature descriptor from vertex positions using multilayer perceptrons. In addition to the position or the displacement, Lähner
et al. laehner2018deepwrinkles propose a wrinkle generation method learning high frequency details from normal maps. In our approach, we cascade multiple geometric features as shape descriptors embedded in geometry images, including spatial information of the displacement, the normal and temporal information of the velocity.2.3 Deep CNNbased image SR
Image SR is a challenging and illposed problem because of no prior information. In recent years, learningbased methods have made great progress with huge training data. Dong et al. dong2016image
firstly use a simple 3layer CNN to map the bicubic interpolated LR images to HR ones. Deeper networks
kim2016accurate; kim2016deeply; zhang2017learning are proposed to improve the effects of SR networks. Above methods need to do a preprocess interpolating LR images to the high resolution because the size of images can not be upscaled in a network. Upsampled input can cause computation complexity and blurred details. To do acceleration with more layers, Dong et al. dong2016accelerating propose a transposed convolutional layer using LR images as input and upscaling them to the HR space. A common layer used in recent SR approaches lim2017enhanced; ledig2016photo is an efficient subpixel layer shi2016real directly transforming LR features into HR images. However, all of these methods build basic blocks in a chain way ignoring information from early convolutional layers. Zhang et al. zhang2018residual propose residual dense networks (RDN) to extract and adaptively fuse features from all hierarchical layers efficiently. In our MFSR, RDN are used as basic networks.In video SR kappeler2016video; caballero2017real; liu2017robust, consecutive frames are concatenated together as inputs to a CNN that generate HR images as outputs. Therefore, temporal coherence is addressed in the training. Yet it is difficult to decide the optimal sequence length for every motion in a video. Our data is animated mesh sequences so the frametoframe consistency needs to be considered in synthesis. Previous work on 3D cloth synthesis did not consider very much on the consistency between frames, while our work tries to address this issue by introducing the kinematicbased loss function.
3 Overview
In this paper, we propose a deep learning based method for synthesizing realistic and consistent HR cloth animations, taking physical simulated LR meshes as input. We construct three datasets: DRAPING, HITTING and SKIRT, and train each synthesizing model separately. The pipeline of our approach is illustrated in Figure 1. To generate training data, a pair of LR and HR meshes are simulated synchronously by virtual spring constraints and multiresolution dynamic models (§ 4.2). Consequently, the LR and HR meshes are well aligned at the level of largescale deformation and differ in the wrinkles. Given the simulated mesh pairs, we convert them into dualresolution geometry images (§ 4.1), with each sample encoding three features: the displacement, the normal and the velocity. A multifeature superresolution network (MFSR) with shared layers and taskspecific modules is proposed to superresolve these images with details (§ 5). The sharing module takes multiple features as input to learn lowdimensional representations from corresponding SR tasks simultaneously. And the taskspecific module focuses on various highlevel semantics for each specific task. Based on these features, we design the spatial and temporal loss functions (§ 5.2) to train our MFSR for detailed and consistent results. After training MFSR with synchronized data, the testing LR geometry images (converted from the input LR mesh) are upsampled into HR geometry images, which are then converted to a detailed HR mesh. To further improve the synthesized results, we also add a refinement step, which includes a fast normalguided filtering for global smoothing and collision solving step to untangle cloth penetrations (§ 6).
4 Data preparation
4.1 Data representation and conversion
Dualresolution meshes. Before executing cloth simulation for training data generation, we need to set the initial rest state of LR and HR meshes. Examples are shown in Figure 2, a piece of cloth is initially defined by a polygon, and then triangulated into an initial LR mesh. To obtain the corresponding HR mesh, we subdivide the LR one by progressively dividing edges until it reaches the desired resolution. In this work, the number of faces in the HR mesh is 16 times as many as the LR mesh. With the rest state LR/HR meshes, we create two sets of dualresolution frame data by physicsbased simulation. The correspondence between them is maintained during the simulation, so that they exhibit identical or similar largescale folding behavior but differ only in the finelevel wrinkles. More details about the synchronized simulation are given in § 4.2.
Dualresolution geometry images. We convert the paired meshes with dualresolution geometry images, embedding spatial and temporal features. The feature descriptors include the displacement , the normal and the velocity of sampling points. For any sample points inside a triangle face, these features are interpolated from three vertices using barycentric coordinates. Different from the original geometry image paper gu2002geometry, We encode the displacements instead of positions, since we are only interested in the intrinsic shape of the mesh, not its absolute spatial locations. The displacement is defined as the difference between its position in current frame and that in its starting position. The vertex normal is computed by the areaweighted average normals of the faces adjacent to this vertex. Due to the physicsbased simulation with fixed time step, the velocity is naturally calculated using the positions between two frames (The complete calculation of our feature descriptors is provided in the supplemental materials). Since these features are not rotation invariant, we calculate a rigid motion transformation with rotation and translation . Then, we apply () to displacement, applied to normal and velocity. This transformation is computed by the Kabsch Algorithm kabsch1978discussion, finding an alignment between a newly deformed mesh and an unique reference one in the rest state (please refer to the supplementary material for the details of rigid motion invariant features). To reduce the computation cost, we only compute the rigid motion of LR meshes and apply the same to the corresponding features of HR meshes. To release the internal covariate shift glorot2010understanding, these features are normalized into a range of [0, 1] for our network.
Meshtoimage conversion. We convert the deformed meshes into geometry images of 9 channels. For a mesh in its rest state, we find the smallest bounding box in the 2D material space. Inside the bounding box, we then sample an array of points uniformly. For each sample point inside the mesh, we find the triangle it is located in, and compute its barycentric coordinate (BC) w.r.t. three triangle vertices. BC is unchanged even though a triangle deforms during simulation. When computing features for sample points, BCs are used as weights for interpolating feature values from triangle vertices. For a mesh whose boundary coincides with the bounding box edge, we do the padding operation along boundaries. For sample points outside the mesh but inside the bounding box, their feature values are firstly set to zero. Similar to replicate padding, our operation changes those zero vectors into the nearest nonzero pixel values. A long skirt example is given in Figure 3.
Imagetomesh conversion. After an HR image is synthesized, values in the displacement channel are used to restore the vertex positions of the detailed mesh, while the original topology of that mesh is retained. Due to the padding operation, every vertex in 2D material space, whether it is on the boundaries or on the seam lines, or internal to the mesh, has four nearest surrounding nonzero sample points. We restore displacements of the vertices of the detailed mesh by bilinear interpolation. Specially for vertices in seam lines, each one has two or more corresponding vertices in 2D material space. We restore its displacement by weighted averaged of the corresponding ones. The weight for a 2D vertex is the ratio of the number of faces incident to this vertex, to the one incident to the 3D vertex. These computed displacements are added to the positions of subdivided mesh vertices in the rest state to obtain wrinkleenhanced positions. In the end we apply the inverse of the rigid transformation, computed in the meshtoimage phase, to new positions. As shown in the right of Figure 3, almost no visual differences can be seen. In our quantitative experiments, the geometric reconstruction error is smaller than 1e4, measured by the vertex mean square error (VMSE).
4.2 Synchronized simulation
The highquality training dataset is equally important for datadriven approaches. In our case, we need to generate corresponding LR/HR mesh pairs in animation sequences by physicsbased simulation. In image superresolution tasks ledig2016photo; dong2016image; kim2016deeply, one way to generate training dataset is downsampling HR images to obtain their corresponding LR ones. However, downsampling an HR cloth mesh could cause collisions, even though the HR mesh is collisionfree. Therefore, it is preferred that two meshes are simulated individually, with all collisions being properly resolved. However, as mentioned in previous works kavan11physics; zurdo2013wrinkles, if there is no constraints between two simulations, they will bifurcate to different behaviors because of accumulated higher frequencies generated by finer meshes and numerical errors. To solve this problem, we enforce virtual spring constraints and use multiresolution dynamic models to construct synchronized simulation for HR meshes.
As shown in Figure 2, our dualresolution meshes are well aligned in the initial state, because we only add vertices on the edges without changing the mesh shape. The vertices in an LR mesh, called feature vertices, show up in an HR mesh and are used as constraints for synchronized simulation. We first run coarse cloth simulation and record the positions of all feature vertices at total frames as , , where the superscript stands for the LR. While simulating an HR mesh at the frame , virtual springs are added to connect pairs of feature vertices between LR mesh at the frame and HR mesh at the frame . To pull towards , we define an internal force following Hooke’s law halliday2013fundamentals as
(1) 
where is a spring stiffness constant that can be adjusted depending on how tight the tracking is desired by the user. A large results in tight tracking of the feature vertices, but not for other vertices. As a side effect the simulated HR mesh has many annoying “spikes”.
To tackle the above issue, we propose another tracking mechanism with multiresolution dynamic models. Given an HR mesh at level (shown as the solid lines in Figure 4), we construct an LR triangle mesh at level (the dashed triangle in Figure 4). The mesh in connects the feature vertices by retaining the topology of the LR mesh. In finiteelement simulations, the constitutive model includes internal cloth forces supporting behaviors such as anisotropic stretch or compression muller2004interactive and surface bends bridson03wrinkles; Grinspun03shell with damping Narain2012AAR. For a triangle in the coarse mesh at level , the inplane stretching forces at three vertices are measured by a corotational finiteelement approach muller2004interactive. While the bending forces for two adjacent triangles are added using a discrete hinge model based on dihedral angles bridson03wrinkles; Grinspun03shell, denoted as . The triangles in the fine level have the same force patterns and imposed on all particles (including feature vertices). All stretching and bending forces are added accompanying damping forces. In addition, our twolevel dynamic models are independent of the force implementations, and would also work with other triangular finiteelement methods. As a result, the feature vertices in multiresolution dynamic models receive the stretch forces from both and , while the same for bending forces. The rest vertices are only imposed on the forces at level . With the twohierarchy dynamics model, modest virtual spring coefficients can make the HR mesh keep pace with the LR mesh in simulation.
5 Multifeature superresolution network
In this section, we introduce our MFSR architecture based on the RDN, as well as the loss functions taking spatial and temporal features into account to improve wrinkle synthesis capability.
5.1 MFSR architecture
We now introduce our MFSR architecture for the image SR tasks of multiple features. With LR/HR images of the form and , our MFSR learns the mappings of different features by image SR networks. One standard methodology is single task learning, which means learning one task at a time. However it ignores a potentially rich source of information available in other tasks. Another option is multitask learning, which achieves inductive transfer between tasks, with the goal to leverage additional sources to improve the performance of the target task caruana1997multitask. Our MFSR is a multitask architecture, consists of two components: a single shared network, and three taskspecific networks. The shared network is designed based on the SR task, whilst each taskspecific network consists of a set of convolutional modules, which link with the shared network. Therefore, the features in the shared network, and the taskspecific networks, can be learned jointly to maximise the generalisation of the shared representation across multiple SR tasks, simultaneously maximising the taskspecific performance.
Figure 5 shows a detailed visualisation of our MFSR based on residual dense blocks (RDB) zhang2018residual. In the shared network, the image SR model consists of four parts:
shallow feature extraction
, basic blocks, dense feature fusion, and finally upsampling. We use two convolutional layers to extract shallow features, followed by the RDB zhang2018residual as the basic blocks, then dense feature fusion to extract hierarchical features, and lastly one bilinear upsampling layer to upscale the height and width of the LR feature maps by 4 times. Different from general SR tasks, we find that pixel shuffle and deconvolution methods cause apparent checkboard artifacts so we use bilinear method. For basic blocks in our SR network, we employ RDB instead of residual blocks used in chen2018synthesizing. As shown in the left of Figure 6, a residual block learns a mapping function with reference to its input, therefore can be used to build deep networks to address the problem of vanishing gradients. However, in the residual block a convolutional layer only has direct connection to its precedent layer, neglecting to make full use of all preceding layers. To exploit all the hierarchical features, we choose RDB (see in the right of Figure 6) that consist of densely connected layers for global feature combination and local feature fusion with local residual learning. More details about RDB are given in zhang2018residual. In each taskspecific network, we utilize one convolutional layer to map the extracted local and global features to each upsampled descriptor , and , respectively.5.2 Spatial and temporal losses
In order to learn the spatial details and temporal consistency of the underlying HR meshes, our MFSR is trained by minimizing the following loss functions for mesh features. A baseline mean square error (MSE) reconstruction loss is defined as
(2) 
where the superscripts and stand for the ground truth HR and the synthesized SR, respectively. This displacement loss term is able to obtain a smooth HR result with given low frequency information.
To extend the loss into wrinkle feature space, a novel loss for normal is introduced:
(3) 
The normal feature is directly related to the bending behavior of cloth meshes. This loss term encourages our model to learn the finelevel wrinkle features so that the outputs can stay as close to the ground truth as possible. In our experiments it aids the networks in creating realistic details.
The above two loss terms are utilized to reconstruct highfrequency details exclusively from spatial statistics. To improve the consistency for animation sequences, we should also take the temporal coherence into account. The vertex velocities of every animation frame contribute a velocity loss of the form
(4) 
In addition, we minimize a kinematicsbased loss in the training stage, to constrain the relationship between synthesized velocities and displacements (please refer to the supplementary material for the detail derivation) as
(5) 
where is the length of frames associated to the input frame, and represents the time step between consecutive frames. This kinematicsinspired loss term can improve the consistency between the generated cloth animations.
The overall loss of our MFSR is defined as
(6) 
which is a linear combination of spatial smoothness, detail similarity, temporal consistency and kinematic loss terms with the weight factors , and . As for back propagation, each loss term propagates backwards through the taskspecific layer independently. In the shared layers, parameters are updated according to the total loss . As a result, the gradient of loss functions from multiple SR tasks will pass through the shared layers directly, and learn a common representation for all related tasks.
6 Refinement
Given a new LR mesh, we convert it to geometry images and our MFSR predicts the corresponding upsampled images. After converting the HR images to a detailed cloth mesh, we find that the results may suffer from unexpected rough effects and penetrations. To solve this problem, we design a refinement step including a fast featurepreserving filtering for global smoothness and collision solving step to untangle cloth penetrations.
Featurepreserved filtering. We observe that our MFSR facilitates abundant wrinkles generation. However, these wrinkles are accompanied by some unexpected rough effects. This is because our simulated HR meshes have abundant wrinkles which may introduce “noise” to networks and some numerical instabilities in physicsbased simulation. We try to follow prior works on image superresolution aly2005image; zhang2010non using total variation loss function as a regularization mechanism to encourage spatial smoothness, however it cannot work. To deal with this problem, we adapt a featurepreserving mesh denoising method sun2007fast, which can remove noise effectively without smoothing wrinkle features. Superresolved normals are used as the guidance to update the positions of vertices in the corresponding mesh. For each mesh, updating vertex positions for only one step is good enough to obtain a visually much smoother result (see Figure 7). In our experiment, we set the iteration times to 5 and it takes about 0.01s per frame, which is efficient for our application.
Collision handling. For many datadriven methods, penetrations are not able to be completely avoided in the synthesized mesh. For simulation of either the LR mesh or the HR mesh in the training stage, penetrations are avoided, by the enforced continuous collision detection and response Harmon08robust. This scheme also guarantees the runtime simulation of an LR mesh to be collisionfree. However, the synthesized mesh has the possibility of penetrating itself or other obstacles. Considering each detailed mesh alone, these penetrations are actually preexisting collisions and should be resolved by an untangling scheme. However, even the stateoftheart untangling scheme is notoriously not robust ye17unified. We propose a collision response method that guarantees to resolve all collisions in our cloth synthesis.
The state of a cloth mesh, when embedded in 3D space, can be denoted as the positions of vertices . Given a simulated LR mesh in the collisionfree state, we can divide the edges twice to obtain a subdivided version . The subdivided mesh has the same shape as the LR mesh, without penetrations. As for the synthesized mesh , collisions happen but plausible wrinkles are augmented. Then our collision handling problem turns into properly interpolating between the subdivided mesh and the synthesized mesh , so that the new mesh is collisionfree and has wrinkles as many as possible. The interpolation can be expressed as
(7) 
where
is an identity matrix, and
is the weight diagonal matrix to be solved. In implementation, we use the bisection method burden1985bisection to search for a closetooptimal interpolation weight. We iteratively bisect in the range of for the elements in , which means collisionfree at the beginning but collides at the end. For each bisection, if the middle point is collisionfree, it is set as the beginning of the new range, otherwise it is set as the end of the new range. We do the bisection three or four times and then take the last collisionfree state as the interpolation result (see Figure 7(d)).In addition, the above collision resolving process can be further optimized. It is not necessary to let all vertices of the whole mesh get involved in the position interpolation. Instead, only the vertices involved in the intersections are of our interests. These vertices can be specified by a discrete collision detection process and grouped into impact zones as did in ye17unified. Position interpolations are performed per zone, and each zone has different interpolation weights. In this way, the synthesized meshes are least affected by the collision handling.
7 Implementation
We describe the details of the data generation and the network architecture in this section.
Data generation. To generate data for our MFSR training, we construct three datasets using a tablecloth model and a skirt model with character motions. The two different models are of regular and irregular garment shapes, respectively. The meshes in each dataset are simulated from a template model to ensure a fixed size.
For the tablecloths, the LR and HR meshes have 749 and 11,393 vertices respectively. Using the tablecloth model we generate two datasets, called DRAPING and HITTING (see Figure 8). The DRAPING dataset is created by randomly handling one of the topmost vertices of the tablecloth and letting the fabric fall freely. It contains 13 simulation sequences, each with 400 frames. Ten sequences are randomly selected for training and remaining three sequences are for testing. In addition to simulating a piece of tablecloth in a free environment, we also construct a HITTING dataset where a sphere interacts with the tablecloth. Specially, we select spheres of different radii to hit the tablecloth back and forth at different locations, and obtain a total of 35 simulation sequences, with 1,000 frames for each sequence.
We also generate a dataset called SKIRT, for the long skirt garments worn by animated characters (shown in Figure 8). The number of vertices are 1,303 and 19,798 in LR and HR skirt meshes. A mannequins has rigid parts as Narain2012AAR and is driven by publicly available motion capture data from CMU hodgins2015cmu. Specifically, we select dancing motions including 7 sequences (in total 30,000 frames), in which 5 sequences are randomly selected for training and remaining 2 sequences are for testing. Since some dancing motions are too fast for physicsbased simulation, we slow it down by interpolating 8 times between two adjacent motions from the original CMU data.
We apply the ARCSim cloth simulation engine Narain2012AAR to produce all simulations, but without using the remeshing operation. ARCSim requires material parameters for simulation. In our experiment, we choose the Gray Interlock for its anisotropic behaviors, from a library of measured cloth materials Wang2011DEM. Another requirement from ARCSim is a collisionfree initial state between garment and obstacles. For the tablecloth simulation, we can easily set a rectangular sheet and put the obstacles in the right places without collision. As for the long skirts, we first manually put the skirt on a template mannequin (T pose) to ensure a collisionfree state. Then, we interpolate 80 motion frames between the T pose and the initial poses of all motion sequences. With these interpolated motions, we run the simulations of the long skirts worn by the mannequins of various poses, from its collisionfree initial state. In addition, for synchronized simulation, we set the spring stiffness constant in the equation (1).
Network architecture. For the three simulated datasets, we train three MFSR respectively. Our proposed MFSR consists of shared and taskspecific layers. The shared network has 16 identical RDB zhang2018residual
, where six of them are densely connected layers for each RDB, and the growth rate is set to 32. The basic network settings, such as the convolutional kernel and activation function, are set according to
zhang2018residual. For the upscaling operation, i.e., from the coarse resolution features to fine ones, we consider several different mechanisms, e.g., pixel shuffle module shi2016real, deconvolution, nearest, bilinear, etc., and finally choose the bilinear upscaling layer because it can prevent checkerboard artifacts in the generated meshes. In our upsampling network, the upscale factor is set to 4. The upscale factor (in one dimension) for corresponding meshes is set to be as close to 4 as possible. For example, the LR and the HR tablecloth meshes have 749 and 11,393 vertices, respectively, the latter being roughly 16 times as many as the former. Converting meshes to images, we set the size of LR images in tablecloth to be , and HR ones . The image aspect ratio is the same to the uv proportion in material space to achieve uniform sampling.We implement our network using PyTorch 1.0.0. In each training batch, we randomly extract 16 LR/HR pairs with the size of
and as input. Adam optimizer kingma2014adam is used to train our network, and its andare both set to 0.9. The base learning rate is initialized to 1e4, and is divided by 10 every 20 epochs. To avoid learning rate becoming too small, we fix it after 60 epochs. The training procedure stops after 120 epochs. The training for each model takes about a day and a half on a GeForce
^{®} GTX 1080Ti of NVIDIA^{®} corporation. In all our experiments, we set the length of the input frames for the kinematicsbased loss in the equation (6). Besides, we set the weights and in the equation (6).Benchmark  #verts  #verts  tracked  ours  speedup  our components  
LR  HR  sim.  coarse  mesh/image  synthesizing  refinement  
sim.  conversion  (GPU)  
DRAPING  749  11,393  4.27  0.345  12  0.129  0.089  0.0553  0.0718 
HITTING  749  11,393  4.38  0.341  13  0.135  0.109  0.0531  0.0434 
SKIRT  1,303  19,798  10.23  0.709  14  0.227  0.18  0.0281  0.274 
8 Results
In this section, we evaluate the results obtained with our method both quantitatively and qualitatively. The runtime performance and visual fidelity are demonstrated with various scenes: draping and hitting tablecloths, and long skirts worn by animated character, separately. We compare our results against simulation methods and demonstrate the benefits of our method for cloth wrinkle synthesis. The effectiveness of our network components is also analyzed, for various loss functions and network architectures.
8.1 Runtime performance
We implement our method on a 2.50GHz Core 4 Intel CPU for coarse simulation and meshimage conversion, and a NVIDIA GeForce^{®} GTX 1080Ti GPU for image synthesizing. Table 1 shows average perframe execution time of our method for the different garment resolutions. The execution time contains four parts: coarse simulation, mesh/image conversion, image synthesizing, and refinement. For reference, we also statistic the simulation timings of a CPUbased implementation of tracked highresolution simulation using ARCSim Narain2012AAR. Our algorithm is averagely 13 times faster than the tracked simulation. The low computational cost of our method makes it suitable for the interactive applications.
8.2 Wrinkle synthesis results and comparisons
Generalization to new hanging. We use the training data in the DRAPING dataset to learn a synthesizer, then evaluate the generalization to new hanging vertices . Figure 9 shows the deformations of tablecloths of three test sequences in the DRAPING dataset. The row of “GT” in Figure 9 illustrates the HR meshes of tracked physicsbased simulation for reference, while the row of “Ours” is the result of our datadriven method. We find that our approach successfully produces the realistic and abundant wrinkles in different deformation sequences, in details, tablecloths appear many middle and small wrinkles when falling from different directions.
Generalization to new balls. Shown in Figure 10, we visually evaluate the quality of our algorithm in the HITTING dataset, which illustrates the performance when generalizing to new crashing balls of various sizes and initial positions. We show four test examples comparing the groundtruth HR of the tracked simulation with our method. For testing, the initial positions of balls are set to four different places which are unseen in training data. Additionally, in the third and fourth columns of Figure 10, the diameter of the ball is set to 0.5 which is also a new size not used for training. When various sizes of balls crash into the cloth in different positions, our method can successfully predict the plausible wrinkles, with 12 times faster running speed than physicsbased simulation.
Generalization to new motions. In Figure 11, we show the deformed long skirt produced by our approach on the mannequins while changing various poses over time. The human poses are from two testing motion sequences in the subject of modern dance and in the subject of lambada dance hodgins2015cmu. We visually compare the results of our algorithm with the groundtruth simulation. The midscale wrinkles are successfully predicted by our approach when generalizing to various dancing motions not in the training set. For instance, in the first column of Figure 11, the skirt slides forward and forms plausible wrinkles due to an extended and straight leg caused by the character pose of sideways arabesque. As for dancing sequences, please see the accompanying video for more animated results and further comparisons.
Comparison with other methods. Compared to conventional physicallybased simulations, chen2018synthesizing and oh2018hierarchical use deep learning based methods to generate cloth animations for acceleration. Oh et al. oh2018hierarchical introduce a fast and reliable hierarchical cloth animation algorithm, simulating coarsest level by physicsbased method and generating more detailed levels by inference of deep neural networks. However, it relies on a fullconnected network to model detailed levels for each triangle separately, thus not appropriate for learning the wrinkling behaviors. We compare our method with mesh SR chen2018synthesizing, a CNNbased method to synthesize cloth wrinkles. The performance is evaluated on the Tablecloth dataset with both DRAPING and HITTING. The training settings of our network are illustrated in § 8, and mesh SR is trained using the same setting reported in their paper chen2018synthesizing.
The peak signaltonoise ratio (PSNR) is a widelyused metric for quantitatively evaluating image restoration quality, while vertexwise mean square error (VMSE), which is computed as per vertex
error averaged over all vertices and frames, is usually used for evaluating the quality of reconstructed meshes. In this work, we choose these two metrics to quantitatively compare our method with chen2018synthesizing, and the corresponding results are reported in Table 2. When comparing with mesh SR, our MFSR improves the performance significantly, and obtains a higher PSNR. This is favored by RDB, solving the drawbacks of neglecting to use all preceding features in residual blocks. With better superresolved images and the refinement step, our MFSR further reaches a lower VMSE, indicating a better performance and generalization for those datasets.In Figure 12, we show visual results of our MFSR and mesh SR. Given the same LR meshes in the testing stage, our MFSR successfully produces rich and consistent wrinkles due to multiple features, while the results of mesh SR approximate inaccurate wrinkles depending on the position. The velocity and kinematicsbased loss functions also favor to more stable results than mesh SR (please refer to the accompanying video). It has been reported that mesh SR can generate largescale folders when the resolution of training pairs is low (i.e. the vertices of LR meshes are decreased to 200). However, in our experiments we find the mesh SR is unable to converge close to the ground truth with complicated wrinkle styles. Their results suffer from plausible but unrealistic small wrinkles like noise artifacts. In contrast, our method synthesizes the HR meshes in a physically realistic manner. The differences between the result and the ground truth are highlighted in Figure 12 using color coding. In the results of results of mesh SR, it clearly highlights the bottom left, bottom right corners and wrinkle lines, where our results look closer to the ground truth.
Benchmark  Chen et al. 2018  Ours 

PSNR/VMSE  PSNR/VMSE  
DRAPING  59.07/4.19e4  68.91/7.09e5 
HITTING  59.15/1.17e4  72.25/4.69e5 
In addition, here we introduce some improvements of our method to stateoftheart datadriven methods not using deep networks. As mentioned in zurdo2013wrinkles, their method handles only quasistatic wrinkle formation without dynamics information so that it cannot capture the richness of wrinkles in a flag. They only use the edge ratio between current and the rest state as mesh descriptors, contrarily, our algorithm enhances the LR deformation using descriptors with displacement, normal and velocity covering both spatial and dynamic information. As shown in Figure 9, our technique can realize such dynamic effects like travelling waves. Another limitation mentioned in their work is the possibility to incur in cloth interpenetration. Although the penetrations in LR cloth are solved, generated HR meshes may suffer from collision problems. With the controllable cost (see in Table 1), we have solved this problem using discrete collision detection and interpolation collision response algorithm.
8.3 Ablation study
Next, we study the effect of different components of our proposed network, including loss function and network architecture.
Loss function. To demonstrate the effectiveness of our proposed loss functions, we conduct the experiments with different loss combinations on three datasets, i.e., DRAPING, HITTING, and SKIRT, respectively. The training and testing datasets are selected as mentioned in § 8. We use the displacement loss as the baseline and progressively add the remaining loss terms of our mesh MFSR, to obtain the comparative results of different loss terms.
Table 3 reports the quantitative evaluation of PSNR between generated displacement images and ground truth in various settings of loss functions. Red text indicates the best performance and the blue text indicates the secondbest. The result shows that our proposed algorithm has either a best or secondbest performance through combining all our proposed loss terms in a multitask learning framework. Notice that without the constraints of velocity and kinematicsbased loss functions, normal loss may decrease the final testing PSNR although it encourages wrinkle generation in SR results.
Benchmark  

DRAPING  67.90  62.83  67.92  63.47  68.68 
HITTING  67.11  68.23  71.41  69.75  71.26 
SKIRT  62.48  60.76  62.48  61.31  62.88 
Network architecture. To further investigate the performance of different SR networks (SRResNet and RDN), we conduct an experiment on the DRAPING dataset. In particular, we validate our cloth animation results on randomly selected 800 pairs of LR/HR meshes from the DRAPING dataset, which are excluded from the training set, and cover different complex hanging motions in pendulum movement. In Figure 13, we depict the convergence curves of three different features in the above validation dataset. The convergence curves show that RDN achieves better performance than SRResNet, and further stabilizes the training process in all three features. The improved performance and stabilization are benefited from contiguous memory, local residual learning and global feature fusion in RDN. In SRResNet, local convolutional layers do not have direct access to the subsequent layers so it neglects to fully use information of each convolutional layer. As a result, RDN achieves better performance than SRResNet.
9 Conclusions and future work
In this paper, we have presented a novel deep learning based framework to synthesize cloth animations with abundant wrinkles. Our evaluations show that the spatial and temporal features can be augmented with high frequency details using a multifeature superresolution network. The proposed network consists of a sharing module to jointly learn the lowlevel representation and a taskspecific module to focus on highlevel semantics. We design an additional kinematicsbased loss to the network objective that maintains frametoframe consistency across time. Quantitative and qualitative results reveal that our method can synthesize realisticlooking wrinkles in various scenes, such as draping cloth, garments interacting with moving balls and human bodies, etc. We also give details on how to create paired meshes using a synchronized simulation, as it contributes to construct large 3D datasets. These aligned coarse and fine meshes can also be used for other related applications such as 3D shape matching of incompatible shape structures. In addition, our collision handling algorithm is independent of the wrinkle synthesis implementations therefore can cope with other datadriven methods. To the best of our knowledge, our approach is the first one to consider multiple features including geometry and time consistency on 3D dynamic wrinkle synthesis, and it can be conveniently generalized to have more tasks cascaded together.
Nevertheless, several limitations remain open for the future work. Since our data are simulated sequences, we plan to investigate the recurrent SR networks to capture the dynamics, and then potentially improve the consistency. In our work, the training data is the paired LR/HR meshes generated by a synchronized simulation. While tracking the LR cloth, the HR cloth cannot show dynamic properties of a full simulation. We would like to address this limitation by imposing unsupervised learning for unpaired data. In addition, the dataset could be further expanded including more scenes, motion sequences, and garment shapes.
Comments
There are no comments yet.