Multi-feature super-resolution network for cloth wrinkle synthesis

04/09/2020 ∙ by Lan Chen, et al. ∙ 0

Existing physical cloth simulators suffer from expensive computation and difficulties in tuning mechanical parameters to get desired wrinkling behaviors. Data-driven methods provide an alternative solution. It typically synthesizes cloth animation at a much lower computational cost, and also creates wrinkling effects that highly resemble the much controllable training data. In this paper we propose a deep learning based method for synthesizing cloth animation with high resolution meshes. To do this we first create a dataset for training: a pair of low and high resolution meshes are simulated and their motions are synchronized. As a result the two meshes exhibit similar large-scale deformation but different small wrinkles. Each simulated mesh pair are then converted into a pair of low and high resolution "images" (a 2D array of samples), with each sample can be interpreted as any of three features: the displacement, the normal and the velocity. With these image pairs, we design a multi-feature super-resolution (MFSR) network that jointly train an upsampling synthesizer for the three features. The MFSR architecture consists of two key components: a sharing module that takes multiple features as input to learn low-level representations from corresponding super-resolution tasks simultaneously; and task-specific modules focusing on various high-level semantics. Frame-to-frame consistency is well maintained thanks to the proposed kinematics-based loss function. Our method achieves realistic results at high frame rates: 12-14 times faster than traditional physical simulation. We demonstrate the performance of our method with various experimental scenes, including a dressed character with sophisticated collisions.



There are no comments yet.


page 1

page 2

page 5

page 7

page 8

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Cloth animation plays an important role in many applications, such as movies, video games, virtual try-on, etc. With the rapid development of physics-based simulation techniques terzopoulos87elastically; bridson03wrinkles; BW98; provot97collision, garment animations with remarkably realistic and detailed folding patterns can be achieved. However, these techniques require high resolution meshes to represent fine details, therefore need much computational time to solve velocity-updating equations and resolve collisions. Moreover it is labor-intensive to tune simulation parameters for a desired wrinkling behavior.

Recently data-driven methods wang10example; zurdo2013wrinkles; santesteban2019learning provide alternative solutions for these problems, as they offer fast production and also create wrinkling effects that highly resemble the training data. Relying on precomputed data and data-driven techniques, a high resolution (HR) mesh is either directly synthesized, or super-resolved from a physically simulated low resolution (LR) mesh. Nevertheless, existing data-driven methods either depend on human body poses wang10example; santesteban2019learning; Feng2010transfer; deAguiar10Stable) thus are not suitable for loose garments, or lack of dynamic modeling of wrinkle behaviors zurdo2013wrinkles; kavan11physics; laehner2018deepwrinkles; chen2018synthesizing; oh2018hierarchical for general case of free-flowing cloth. When used for the wrinkle synthesis, data-driven methods may suffer from cloth penetrations zurdo2013wrinkles; santesteban2019learning; kavan11physics, even though the coarse meshes are guaranteed to be collision-free in the simulation. The penetrations in synthesized meshes are pre-existing collisions and should be resolved by an untangling scheme. However, even the state-of-the-art untangling scheme is notoriously not robust ye17unified.

To tackle these challenges, we propose a framework, synthesizing cloth wrinkles with a deep learning based method. We create three datasets, from physics-based simulation, as the training data. The simulation is assume to be independent of human bodies and not limited with tight garments. This dataset is generated by a pair of LR and HR meshes with synchronized simulations. Given the simulated mesh pairs, we aim to map the LR meshes to the HR domain by a detail enhancement method, which is essentially a super-resolution (SR) operation. Deep SR networks have proven to be powerful and fast machine learning tools for image detail enhancement

lim2017enhanced; ledig2016photo; zhang2018residual. Yet for surface meshes which usually have irregular structures, it is not straightforward to apply traditional convolutional operations as for images. Chen et al. chen2018synthesizing proposed a method, converting manifold meshes into geometry images gu2002geometry

, to solve this issue. Inspired by their work, we design a multi-feature super-resolution network (MFSR) to improve the synthesized results and model the dynamic wrinkle behaviors. The LR and HR image pairs, encoding three features: the displacement, the normal and the velocity, are fed into the network for training. Our MFSR jointly learns upsampling synthesizers with a multi-task architecture, consisting of a shared network and three task-specific networks, instead of combining all features with a single SR network. The proposed spatial and temporal losses also contribute to the generation of dynamic wrinkles and further maintain frame-to-frame consistency. At runtime, with super-resolved geometry images generated by MFSR, we convert them back into HR meshes. A refinement step is proposed to obtain realistic-looking and collision-free results. As our approach is based on deep neural networks, it reduces the computational cost significantly, even with the refinement step. In summary, the main contributions of our work are as follows:

  • We propose a novel framework for cloth wrinkle synthesis, including a multi-feature super-resolution network (MFSR), a synchronized simulation, and a collision handling method.

  • We learn both shared and task-specific representations of garment shapes with multiple features.

  • We generate dynamic wrinkles and consistent mesh sequences thanks to the spatial and temporal loss functions.

We qualitatively and quantitatively evaluate our method for various cloth types (tablecloths and long skirts) and motion sequences. Experimental results show that the quality of synthesized garments is comparable with that from a physics-based simulation, yet significantly reducing the computation cost. To the best of our knowledge, this is the first approach to employ a multi-feature learning model on 3D dynamic wrinkle synthesis.

2 Related work

2.1 Cloth animation

A physics-based simulation for realistic fabrics includes velocity updating by physical energies terzopoulos87elastically; bridson03wrinkles; Grinspun03shell, time integration BW98; Harmon09asynchronous, collision detection and collision response provot97collision; volino95collision, etc. These modules are solved separately and time consuming. To improve the efficiency of this system, researchers have exploited many algorithms such as implicit or semi-implicit time integration BW98; choi02stable; hauth03analysis, adaptive remeshing Narain2012AAR; weidner2018eulerian and iterative optimization liu13fast; Wang2016Descent, .etc. Nevertheless, these algorithms still cost the expensive computation to produce rich wrinkles and are labor consuming to tune mechanical parameters for desired wrinkling behaviors. Recently data-driven methods have drawn much attention as they offer faster cloth animations than the physics-based methods. Based on precomputed data and data-driven techniques, an HR mesh is either directly synthesized, or super-resolved from a physically simulated LR mesh. In the first stream of work, with precomputed data, researchers have investigated many techniques to accelerate the process for new animations, such as a linear conditional model deAguiar10Stable; Guan12DRAPE and a secondary motion graph Kim2013near; Kim2008drivenshape. Additionally, deep learning-based methods gundogdu2018garnet; Wang2018garmentdesign are also used to generate static garments on human bodies. In the another line of work, researchers have proposed to combine coarse mesh simulations with learned geometric details from paired mesh databases, to generalize the performance to complicated testing scenes. This stream of methods includes wrinkle synthesis depending on bone clusters Feng2010transfer or human poses wang10example for fitted clothes, and linear upsampling operators kavan11physics or low-dimensional subspace with bases zurdo2013wrinkles; Hahn2014subspace for general case of free-flowing cloth. Inspired by these data-driven methods, we propose a deep learning based approach to synthesize wrinkles on coarse simulated meshes, while our approach is independent with poses or skeletons and not limited with tight garments. Due to the expensive cost and low-quality data retrieving from the real world, most data-driven methods use training data generated from physics-based simulation. In our experiments, we use the ARCSim system Narain2012AAR for its speed and stability.

Figure 1: Pipeline of our MFSR for cloth wrinkle synthesis. We generate three datasets of LR and HR mesh sequences by synchronized simulation. In the training stage, meshes are converted into geometry images (GI) encoding the displacement, the normal and the velocity of the sampled points. Then they are fed into our MFSR for training. For three datasets, we train three models separately. At runtime stage, LR geometry images (converted from the input LR mesh) are upsampled into HR geometry images, which are converted to a detailed mesh.

2.2 Representation in 3D learning

Due to the irregular topology and high dimension, 3D meshes are more difficult to be processed by neural networks than images. Nevertheless, some approaches have been proposed in recent years. Wang et al. wang2017cnn

propose to represent 3D shapes as voxel grids to cope with an octree-based convolutional neural networks (CNNs) for 3D shape analysis, like object classification, shape retrieval and part segmentation. Su

et al. Su2015mvcnn learn to recognize 3D shapes from a collection of their rendered views on 2D images with standard CNNs. Li et al. li20193d use the learning-based SR framework to retrieve HR texture maps from multiple view points with the geometric information via normal maps. Representations based upon voxel grids or multi-view images are extrinsic to the shapes, which means that they may naturally fail to recognize a shape under isometric deformations. To encode intrinsic or extrinsic descriptors for CNN-based learning, a technique called geometry images gu2002geometry is used in chen2018synthesizing; sinha2016deep; Sinha2017surfnet; Maron2017convolutional for mesh classification or generation. We adopt this representation embedding multiple features of meshes.

Feature-based methods aim for proper descriptions of irregular 3D meshes, for synthesizing detailed and realistic objects. Conventional data-driven methods zurdo2013wrinkles; Wu1996Simulation; Rohmer2010wrinkling simplify the calculation of wrinkle features, by formulating the strain or stress in a LR mesh. As for deep learning, several algorithms have also investigated robust shape descriptors for wrinkle deformation. Sinha et al. sinha2016deep use geometry images encoding principal curvatures and heat kernel signatures for rigid and non-rigid shape descriptors, respectively. Geometry images embedding position feature are proposed by Chen et al. chen2018synthesizing for cloth wrinkle synthesis. Santesteban et al. santesteban2019learning use two displacements as descriptors, one from overall deformation in the form of stretch or relaxation, and the other from additional global deformation and small-scale wrinkles. Wang et al. wang2019learning

learn a shape feature descriptor from vertex positions using multilayer perceptrons. In addition to the position or the displacement, Lähner

et al. laehner2018deepwrinkles propose a wrinkle generation method learning high frequency details from normal maps. In our approach, we cascade multiple geometric features as shape descriptors embedded in geometry images, including spatial information of the displacement, the normal and temporal information of the velocity.

2.3 Deep CNN-based image SR

Image SR is a challenging and ill-posed problem because of no prior information. In recent years, learning-based methods have made great progress with huge training data. Dong et al. dong2016image

firstly use a simple 3-layer CNN to map the bicubic interpolated LR images to HR ones. Deeper networks

kim2016accurate; kim2016deeply; zhang2017learning are proposed to improve the effects of SR networks. Above methods need to do a preprocess interpolating LR images to the high resolution because the size of images can not be upscaled in a network. Upsampled input can cause computation complexity and blurred details. To do acceleration with more layers, Dong et al. dong2016accelerating propose a transposed convolutional layer using LR images as input and upscaling them to the HR space. A common layer used in recent SR approaches lim2017enhanced; ledig2016photo is an efficient sub-pixel layer shi2016real directly transforming LR features into HR images. However, all of these methods build basic blocks in a chain way ignoring information from early convolutional layers. Zhang et al. zhang2018residual propose residual dense networks (RDN) to extract and adaptively fuse features from all hierarchical layers efficiently. In our MFSR, RDN are used as basic networks.

In video SR kappeler2016video; caballero2017real; liu2017robust, consecutive frames are concatenated together as inputs to a CNN that generate HR images as outputs. Therefore, temporal coherence is addressed in the training. Yet it is difficult to decide the optimal sequence length for every motion in a video. Our data is animated mesh sequences so the frame-to-frame consistency needs to be considered in synthesis. Previous work on 3D cloth synthesis did not consider very much on the consistency between frames, while our work tries to address this issue by introducing the kinematic-based loss function.

3 Overview

In this paper, we propose a deep learning based method for synthesizing realistic and consistent HR cloth animations, taking physical simulated LR meshes as input. We construct three datasets: DRAPING, HITTING and SKIRT, and train each synthesizing model separately. The pipeline of our approach is illustrated in Figure 1. To generate training data, a pair of LR and HR meshes are simulated synchronously by virtual spring constraints and multi-resolution dynamic models (§ 4.2). Consequently, the LR and HR meshes are well aligned at the level of large-scale deformation and differ in the wrinkles. Given the simulated mesh pairs, we convert them into dual-resolution geometry images (§ 4.1), with each sample encoding three features: the displacement, the normal and the velocity. A multi-feature super-resolution network (MFSR) with shared layers and task-specific modules is proposed to super-resolve these images with details (§ 5). The sharing module takes multiple features as input to learn low-dimensional representations from corresponding SR tasks simultaneously. And the task-specific module focuses on various high-level semantics for each specific task. Based on these features, we design the spatial and temporal loss functions (§ 5.2) to train our MFSR for detailed and consistent results. After training MFSR with synchronized data, the testing LR geometry images (converted from the input LR mesh) are upsampled into HR geometry images, which are then converted to a detailed HR mesh. To further improve the synthesized results, we also add a refinement step, which includes a fast normal-guided filtering for global smoothing and collision solving step to untangle cloth penetrations (§ 6).

4 Data preparation

4.1 Data representation and conversion

Dual-resolution meshes. Before executing cloth simulation for training data generation, we need to set the initial rest state of LR and HR meshes. Examples are shown in Figure 2, a piece of cloth is initially defined by a polygon, and then triangulated into an initial LR mesh. To obtain the corresponding HR mesh, we subdivide the LR one by progressively dividing edges until it reaches the desired resolution. In this work, the number of faces in the HR mesh is 16 times as many as the LR mesh. With the rest state LR/HR meshes, we create two sets of dual-resolution frame data by physics-based simulation. The correspondence between them is maintained during the simulation, so that they exhibit identical or similar large-scale folding behavior but differ only in the fine-level wrinkles. More details about the synchronized simulation are given in § 4.2.

Figure 2: Our dual-resolution cloth model. The LR meshes on the left are initially defined by triangulated polygon meshes. Then the HR meshes on the right are obtained by subdividing edges of the the LR meshes several times.
Figure 3: The original long skirt model on the left is converted to geometry images encoding three descriptors: the displacement , the normal and the velocity

. For irregular garments, the feature values of sample points outside the mesh but inside the bounding box are firstly set to zero, which are black pixels shown in the 2nd column. Then those zero vectors are changed into the nearest non-zero pixel values similar to replicate padding to form a padded geometry image (the 3rd column). The right one is the reconstructed model using geometry image of displacement.

Dual-resolution geometry images. We convert the paired meshes with dual-resolution geometry images, embedding spatial and temporal features. The feature descriptors include the displacement , the normal and the velocity of sampling points. For any sample points inside a triangle face, these features are interpolated from three vertices using barycentric coordinates. Different from the original geometry image paper gu2002geometry, We encode the displacements instead of positions, since we are only interested in the intrinsic shape of the mesh, not its absolute spatial locations. The displacement is defined as the difference between its position in current frame and that in its starting position. The vertex normal is computed by the area-weighted average normals of the faces adjacent to this vertex. Due to the physics-based simulation with fixed time step, the velocity is naturally calculated using the positions between two frames (The complete calculation of our feature descriptors is provided in the supplemental materials). Since these features are not rotation invariant, we calculate a rigid motion transformation with rotation and translation . Then, we apply () to displacement, applied to normal and velocity. This transformation is computed by the Kabsch Algorithm kabsch1978discussion, finding an alignment between a newly deformed mesh and an unique reference one in the rest state (please refer to the supplementary material for the details of rigid motion invariant features). To reduce the computation cost, we only compute the rigid motion of LR meshes and apply the same to the corresponding features of HR meshes. To release the internal covariate shift glorot2010understanding, these features are normalized into a range of [0, 1] for our network.

Mesh-to-image conversion. We convert the deformed meshes into geometry images of 9 channels. For a mesh in its rest state, we find the smallest bounding box in the 2D material space. Inside the bounding box, we then sample an array of points uniformly. For each sample point inside the mesh, we find the triangle it is located in, and compute its barycentric coordinate (BC) w.r.t. three triangle vertices. BC is unchanged even though a triangle deforms during simulation. When computing features for sample points, BCs are used as weights for interpolating feature values from triangle vertices. For a mesh whose boundary coincides with the bounding box edge, we do the padding operation along boundaries. For sample points outside the mesh but inside the bounding box, their feature values are firstly set to zero. Similar to replicate padding, our operation changes those zero vectors into the nearest non-zero pixel values. A long skirt example is given in Figure 3.

Image-to-mesh conversion. After an HR image is synthesized, values in the displacement channel are used to restore the vertex positions of the detailed mesh, while the original topology of that mesh is retained. Due to the padding operation, every vertex in 2D material space, whether it is on the boundaries or on the seam lines, or internal to the mesh, has four nearest surrounding non-zero sample points. We restore displacements of the vertices of the detailed mesh by bilinear interpolation. Specially for vertices in seam lines, each one has two or more corresponding vertices in 2D material space. We restore its displacement by weighted averaged of the corresponding ones. The weight for a 2D vertex is the ratio of the number of faces incident to this vertex, to the one incident to the 3D vertex. These computed displacements are added to the positions of subdivided mesh vertices in the rest state to obtain wrinkle-enhanced positions. In the end we apply the inverse of the rigid transformation, computed in the mesh-to-image phase, to new positions. As shown in the right of Figure 3, almost no visual differences can be seen. In our quantitative experiments, the geometric reconstruction error is smaller than 1e-4, measured by the vertex mean square error (VMSE).

4.2 Synchronized simulation

The high-quality training dataset is equally important for data-driven approaches. In our case, we need to generate corresponding LR/HR mesh pairs in animation sequences by physics-based simulation. In image super-resolution tasks ledig2016photo; dong2016image; kim2016deeply, one way to generate training dataset is down-sampling HR images to obtain their corresponding LR ones. However, down-sampling an HR cloth mesh could cause collisions, even though the HR mesh is collision-free. Therefore, it is preferred that two meshes are simulated individually, with all collisions being properly resolved. However, as mentioned in previous works kavan11physics; zurdo2013wrinkles, if there is no constraints between two simulations, they will bifurcate to different behaviors because of accumulated higher frequencies generated by finer meshes and numerical errors. To solve this problem, we enforce virtual spring constraints and use multi-resolution dynamic models to construct synchronized simulation for HR meshes.

As shown in Figure 2, our dual-resolution meshes are well aligned in the initial state, because we only add vertices on the edges without changing the mesh shape. The vertices in an LR mesh, called feature vertices, show up in an HR mesh and are used as constraints for synchronized simulation. We first run coarse cloth simulation and record the positions of all feature vertices at total frames as , , where the superscript stands for the LR. While simulating an HR mesh at the frame , virtual springs are added to connect pairs of feature vertices between LR mesh at the frame and HR mesh at the frame . To pull towards , we define an internal force following Hooke’s law halliday2013fundamentals as


where is a spring stiffness constant that can be adjusted depending on how tight the tracking is desired by the user. A large results in tight tracking of the feature vertices, but not for other vertices. As a side effect the simulated HR mesh has many annoying “spikes”.

To tackle the above issue, we propose another tracking mechanism with multi-resolution dynamic models. Given an HR mesh at level (shown as the solid lines in Figure 4), we construct an LR triangle mesh at level (the dashed triangle in Figure 4). The mesh in connects the feature vertices by retaining the topology of the LR mesh. In finite-element simulations, the constitutive model includes internal cloth forces supporting behaviors such as anisotropic stretch or compression muller2004interactive and surface bends bridson03wrinkles; Grinspun03shell with damping Narain2012AAR. For a triangle in the coarse mesh at level , the in-plane stretching forces at three vertices are measured by a corotational finite-element approach muller2004interactive. While the bending forces for two adjacent triangles are added using a discrete hinge model based on dihedral angles bridson03wrinkles; Grinspun03shell, denoted as . The triangles in the fine level have the same force patterns and imposed on all particles (including feature vertices). All stretching and bending forces are added accompanying damping forces. In addition, our two-level dynamic models are independent of the force implementations, and would also work with other triangular finite-element methods. As a result, the feature vertices in multi-resolution dynamic models receive the stretch forces from both and , while the same for bending forces. The rest vertices are only imposed on the forces at level . With the two-hierarchy dynamics model, modest virtual spring coefficients can make the HR mesh keep pace with the LR mesh in simulation.

Figure 4: The multi-resolution dynamic model for tracking. Forces of stretching (left) and bending (right).

5 Multi-feature super-resolution network

In this section, we introduce our MFSR architecture based on the RDN, as well as the loss functions taking spatial and temporal features into account to improve wrinkle synthesis capability.

5.1 MFSR architecture

Figure 5: The architecture of MFSR. The input and the output are LR/HR images where each pixel is represented in a 9-dimensional feature vector embedding the displacement, the normal and the velocity in order. Conv and Concat refer to convolutional and concatenation layers, respectively. The MFSR upscales the LR features with shared and unshared layers to recover HR features with detailed information.
Figure 6: Two network structures used in image super-resolution. The left is residual block in ledig2016photo. The right is residual dense block in zhang2018residual used for our MFSR.

We now introduce our MFSR architecture for the image SR tasks of multiple features. With LR/HR images of the form and , our MFSR learns the mappings of different features by image SR networks. One standard methodology is single task learning, which means learning one task at a time. However it ignores a potentially rich source of information available in other tasks. Another option is multi-task learning, which achieves inductive transfer between tasks, with the goal to leverage additional sources to improve the performance of the target task caruana1997multitask. Our MFSR is a multi-task architecture, consists of two components: a single shared network, and three task-specific networks. The shared network is designed based on the SR task, whilst each task-specific network consists of a set of convolutional modules, which link with the shared network. Therefore, the features in the shared network, and the task-specific networks, can be learned jointly to maximise the generalisation of the shared representation across multiple SR tasks, simultaneously maximising the task-specific performance.

Figure 5 shows a detailed visualisation of our MFSR based on residual dense blocks (RDB) zhang2018residual. In the shared network, the image SR model consists of four parts:

shallow feature extraction

, basic blocks, dense feature fusion, and finally upsampling. We use two convolutional layers to extract shallow features, followed by the RDB zhang2018residual as the basic blocks, then dense feature fusion to extract hierarchical features, and lastly one bilinear upsampling layer to upscale the height and width of the LR feature maps by 4 times. Different from general SR tasks, we find that pixel shuffle and deconvolution methods cause apparent checkboard artifacts so we use bilinear method. For basic blocks in our SR network, we employ RDB instead of residual blocks used in chen2018synthesizing. As shown in the left of Figure 6, a residual block learns a mapping function with reference to its input, therefore can be used to build deep networks to address the problem of vanishing gradients. However, in the residual block a convolutional layer only has direct connection to its precedent layer, neglecting to make full use of all preceding layers. To exploit all the hierarchical features, we choose RDB (see in the right of Figure 6) that consist of densely connected layers for global feature combination and local feature fusion with local residual learning. More details about RDB are given in zhang2018residual. In each task-specific network, we utilize one convolutional layer to map the extracted local and global features to each upsampled descriptor , and , respectively.

5.2 Spatial and temporal losses

In order to learn the spatial details and temporal consistency of the underlying HR meshes, our MFSR is trained by minimizing the following loss functions for mesh features. A baseline mean square error (MSE) reconstruction loss is defined as


where the superscripts and stand for the ground truth HR and the synthesized SR, respectively. This displacement loss term is able to obtain a smooth HR result with given low frequency information.

To extend the loss into wrinkle feature space, a novel loss for normal is introduced:


The normal feature is directly related to the bending behavior of cloth meshes. This loss term encourages our model to learn the fine-level wrinkle features so that the outputs can stay as close to the ground truth as possible. In our experiments it aids the networks in creating realistic details.

The above two loss terms are utilized to reconstruct high-frequency details exclusively from spatial statistics. To improve the consistency for animation sequences, we should also take the temporal coherence into account. The vertex velocities of every animation frame contribute a velocity loss of the form


In addition, we minimize a kinematics-based loss in the training stage, to constrain the relationship between synthesized velocities and displacements (please refer to the supplementary material for the detail derivation) as


where is the length of frames associated to the input frame, and represents the time step between consecutive frames. This kinematics-inspired loss term can improve the consistency between the generated cloth animations.

The overall loss of our MFSR is defined as


which is a linear combination of spatial smoothness, detail similarity, temporal consistency and kinematic loss terms with the weight factors , and . As for back propagation, each loss term propagates backwards through the task-specific layer independently. In the shared layers, parameters are updated according to the total loss . As a result, the gradient of loss functions from multiple SR tasks will pass through the shared layers directly, and learn a common representation for all related tasks.

6 Refinement

Given a new LR mesh, we convert it to geometry images and our MFSR predicts the corresponding upsampled images. After converting the HR images to a detailed cloth mesh, we find that the results may suffer from unexpected rough effects and penetrations. To solve this problem, we design a refinement step including a fast feature-preserving filtering for global smoothness and collision solving step to untangle cloth penetrations.

Feature-preserved filtering. We observe that our MFSR facilitates abundant wrinkles generation. However, these wrinkles are accompanied by some unexpected rough effects. This is because our simulated HR meshes have abundant wrinkles which may introduce “noise” to networks and some numerical instabilities in physics-based simulation. We try to follow prior works on image super-resolution aly2005image; zhang2010non using total variation loss function as a regularization mechanism to encourage spatial smoothness, however it cannot work. To deal with this problem, we adapt a feature-preserving mesh denoising method sun2007fast, which can remove noise effectively without smoothing wrinkle features. Super-resolved normals are used as the guidance to update the positions of vertices in the corresponding mesh. For each mesh, updating vertex positions for only one step is good enough to obtain a visually much smoother result (see Figure 7). In our experiment, we set the iteration times to 5 and it takes about 0.01s per frame, which is efficient for our application.

Figure 7: Comparisons of results with/without refinement. (a) the super-resolved cloth mesh suffering from unexpected rough effects and collisions, (b)(c) results by updating vertex positions for one step and 5 steps respectively according to the feature-preserved filtering. (d) our result with refinement by smoothing noise and solving penetrations.

Collision handling. For many data-driven methods, penetrations are not able to be completely avoided in the synthesized mesh. For simulation of either the LR mesh or the HR mesh in the training stage, penetrations are avoided, by the enforced continuous collision detection and response Harmon08robust. This scheme also guarantees the runtime simulation of an LR mesh to be collision-free. However, the synthesized mesh has the possibility of penetrating itself or other obstacles. Considering each detailed mesh alone, these penetrations are actually pre-existing collisions and should be resolved by an untangling scheme. However, even the state-of-the-art untangling scheme is notoriously not robust ye17unified. We propose a collision response method that guarantees to resolve all collisions in our cloth synthesis.

The state of a cloth mesh, when embedded in 3D space, can be denoted as the positions of vertices . Given a simulated LR mesh in the collision-free state, we can divide the edges twice to obtain a subdivided version . The subdivided mesh has the same shape as the LR mesh, without penetrations. As for the synthesized mesh , collisions happen but plausible wrinkles are augmented. Then our collision handling problem turns into properly interpolating between the subdivided mesh and the synthesized mesh , so that the new mesh is collision-free and has wrinkles as many as possible. The interpolation can be expressed as



is an identity matrix, and

is the weight diagonal matrix to be solved. In implementation, we use the bisection method burden1985bisection to search for a close-to-optimal interpolation weight. We iteratively bisect in the range of for the elements in , which means collision-free at the beginning but collides at the end. For each bisection, if the middle point is collision-free, it is set as the beginning of the new range, otherwise it is set as the end of the new range. We do the bisection three or four times and then take the last collision-free state as the interpolation result (see Figure 7(d)).

In addition, the above collision resolving process can be further optimized. It is not necessary to let all vertices of the whole mesh get involved in the position interpolation. Instead, only the vertices involved in the intersections are of our interests. These vertices can be specified by a discrete collision detection process and grouped into impact zones as did in ye17unified. Position interpolations are performed per zone, and each zone has different interpolation weights. In this way, the synthesized meshes are least affected by the collision handling.

7 Implementation

We describe the details of the data generation and the network architecture in this section.

Figure 8: Visualization of three datasets including DRAPING (left), HITTING (middle) and SKIRT (right). We generate the DRAPING dataset by randomly handling one of the topmost vertices of the tablecloth and letting the fabric fall freely. For the HITTING dataset, we use spheres of various sizes to hit the hanging tablecloth, at different locations, generating a total of 35 simulation sequences. The SKIRT dataset contains the simulation sequences of the long skirt garments worn by animated characters with different motions (right). The top row shows the low-resolution cloth simulation and the bottom row shows high-resolution ones.

Data generation. To generate data for our MFSR training, we construct three datasets using a tablecloth model and a skirt model with character motions. The two different models are of regular and irregular garment shapes, respectively. The meshes in each dataset are simulated from a template model to ensure a fixed size.

For the tablecloths, the LR and HR meshes have 749 and 11,393 vertices respectively. Using the tablecloth model we generate two datasets, called DRAPING and HITTING (see Figure 8). The DRAPING dataset is created by randomly handling one of the topmost vertices of the tablecloth and letting the fabric fall freely. It contains 13 simulation sequences, each with 400 frames. Ten sequences are randomly selected for training and remaining three sequences are for testing. In addition to simulating a piece of tablecloth in a free environment, we also construct a HITTING dataset where a sphere interacts with the tablecloth. Specially, we select spheres of different radii to hit the tablecloth back and forth at different locations, and obtain a total of 35 simulation sequences, with 1,000 frames for each sequence.

We also generate a dataset called SKIRT, for the long skirt garments worn by animated characters (shown in Figure 8). The number of vertices are 1,303 and 19,798 in LR and HR skirt meshes. A mannequins has rigid parts as Narain2012AAR and is driven by publicly available motion capture data from CMU hodgins2015cmu. Specifically, we select dancing motions including 7 sequences (in total 30,000 frames), in which 5 sequences are randomly selected for training and remaining 2 sequences are for testing. Since some dancing motions are too fast for physics-based simulation, we slow it down by interpolating 8 times between two adjacent motions from the original CMU data.

We apply the ARCSim cloth simulation engine Narain2012AAR to produce all simulations, but without using the remeshing operation. ARCSim requires material parameters for simulation. In our experiment, we choose the Gray Interlock for its anisotropic behaviors, from a library of measured cloth materials Wang2011DEM. Another requirement from ARCSim is a collision-free initial state between garment and obstacles. For the tablecloth simulation, we can easily set a rectangular sheet and put the obstacles in the right places without collision. As for the long skirts, we first manually put the skirt on a template mannequin (T pose) to ensure a collision-free state. Then, we interpolate 80 motion frames between the T pose and the initial poses of all motion sequences. With these interpolated motions, we run the simulations of the long skirts worn by the mannequins of various poses, from its collision-free initial state. In addition, for synchronized simulation, we set the spring stiffness constant in the equation (1).

Network architecture. For the three simulated datasets, we train three MFSR respectively. Our proposed MFSR consists of shared and task-specific layers. The shared network has 16 identical RDB zhang2018residual

, where six of them are densely connected layers for each RDB, and the growth rate is set to 32. The basic network settings, such as the convolutional kernel and activation function, are set according to

zhang2018residual. For the upscaling operation, i.e., from the coarse resolution features to fine ones, we consider several different mechanisms, e.g., pixel shuffle module shi2016real, deconvolution, nearest, bilinear, etc., and finally choose the bilinear upscaling layer because it can prevent checkerboard artifacts in the generated meshes. In our upsampling network, the upscale factor is set to 4. The upscale factor (in one dimension) for corresponding meshes is set to be as close to 4 as possible. For example, the LR and the HR tablecloth meshes have 749 and 11,393 vertices, respectively, the latter being roughly 16 times as many as the former. Converting meshes to images, we set the size of LR images in tablecloth to be , and HR ones . The image aspect ratio is the same to the uv proportion in material space to achieve uniform sampling.

We implement our network using PyTorch 1.0.0. In each training batch, we randomly extract 16 LR/HR pairs with the size of

and as input. Adam optimizer kingma2014adam is used to train our network, and its and

are both set to 0.9. The base learning rate is initialized to 1e-4, and is divided by 10 every 20 epochs. To avoid learning rate becoming too small, we fix it after 60 epochs. The training procedure stops after 120 epochs. The training for each model takes about a day and a half on a GeForce

® GTX 1080Ti of NVIDIA® corporation. In all our experiments, we set the length of the input frames for the kinematics-based loss in the equation (6). Besides, we set the weights and in the equation (6).

Benchmark #verts #verts tracked ours speedup our components
LR HR sim. coarse mesh/image synthesizing refinement
sim. conversion (GPU)
DRAPING 749 11,393 4.27 0.345 12 0.129 0.089 0.0553 0.0718
HITTING 749 11,393 4.38 0.341 13 0.135 0.109 0.0531 0.0434
SKIRT 1,303 19,798 10.23 0.709 14 0.227 0.18 0.0281 0.274
Table 1: Statistics and timing (sec/frm) of the tablecloth and skirt testing examples.

8 Results

In this section, we evaluate the results obtained with our method both quantitatively and qualitatively. The runtime performance and visual fidelity are demonstrated with various scenes: draping and hitting tablecloths, and long skirts worn by animated character, separately. We compare our results against simulation methods and demonstrate the benefits of our method for cloth wrinkle synthesis. The effectiveness of our network components is also analyzed, for various loss functions and network architectures.

8.1 Runtime performance

We implement our method on a 2.50GHz Core 4 Intel CPU for coarse simulation and mesh-image conversion, and a NVIDIA GeForce® GTX 1080Ti GPU for image synthesizing. Table 1 shows average per-frame execution time of our method for the different garment resolutions. The execution time contains four parts: coarse simulation, mesh/image conversion, image synthesizing, and refinement. For reference, we also statistic the simulation timings of a CPU-based implementation of tracked high-resolution simulation using ARCSim Narain2012AAR. Our algorithm is averagely 13 times faster than the tracked simulation. The low computational cost of our method makes it suitable for the interactive applications.

8.2 Wrinkle synthesis results and comparisons

Generalization to new hanging. We use the training data in the DRAPING dataset to learn a synthesizer, then evaluate the generalization to new hanging vertices . Figure 9 shows the deformations of tablecloths of three test sequences in the DRAPING dataset. The row of “GT” in Figure 9 illustrates the HR meshes of tracked physics-based simulation for reference, while the row of “Ours” is the result of our data-driven method. We find that our approach successfully produces the realistic and abundant wrinkles in different deformation sequences, in details, tablecloths appear many middle and small wrinkles when falling from different directions.

Figure 9: Comparison between the ground-truth HR tracked simulation (top) and the super-resolved results of our method (bottom), on a test sequence of the DRAPING dataset.
Figure 10: Comparison between ground-truth tracked simulation (top) and our super-resolved meshes (bottom), on testing animation sequences in HITTING dataset. Our method succeeds to predict the small and mid-scale wrinkles of the garment with 12 times faster running speed than physic-based ones.
Figure 11: Comparison between ground-truth tracked simulation (top) and our super-resolved meshes (bottom), on testing animation sequences in SKIRT dataset.

Generalization to new balls. Shown in Figure 10, we visually evaluate the quality of our algorithm in the HITTING dataset, which illustrates the performance when generalizing to new crashing balls of various sizes and initial positions. We show four test examples comparing the ground-truth HR of the tracked simulation with our method. For testing, the initial positions of balls are set to four different places which are unseen in training data. Additionally, in the third and fourth columns of Figure 10, the diameter of the ball is set to 0.5 which is also a new size not used for training. When various sizes of balls crash into the cloth in different positions, our method can successfully predict the plausible wrinkles, with 12 times faster running speed than physics-based simulation.

Generalization to new motions. In Figure 11, we show the deformed long skirt produced by our approach on the mannequins while changing various poses over time. The human poses are from two testing motion sequences in the subject of modern dance and in the subject of lambada dance hodgins2015cmu. We visually compare the results of our algorithm with the ground-truth simulation. The mid-scale wrinkles are successfully predicted by our approach when generalizing to various dancing motions not in the training set. For instance, in the first column of Figure 11, the skirt slides forward and forms plausible wrinkles due to an extended and straight leg caused by the character pose of sideways arabesque. As for dancing sequences, please see the accompanying video for more animated results and further comparisons.

Comparison with other methods. Compared to conventional physically-based simulations, chen2018synthesizing and oh2018hierarchical use deep learning based methods to generate cloth animations for acceleration. Oh et al. oh2018hierarchical introduce a fast and reliable hierarchical cloth animation algorithm, simulating coarsest level by physics-based method and generating more detailed levels by inference of deep neural networks. However, it relies on a full-connected network to model detailed levels for each triangle separately, thus not appropriate for learning the wrinkling behaviors. We compare our method with mesh SR chen2018synthesizing, a CNN-based method to synthesize cloth wrinkles. The performance is evaluated on the Tablecloth dataset with both DRAPING and HITTING. The training settings of our network are illustrated in § 8, and mesh SR is trained using the same setting reported in their paper chen2018synthesizing.

The peak signal-to-noise ratio (PSNR) is a widely-used metric for quantitatively evaluating image restoration quality, while vertex-wise mean square error (VMSE), which is computed as per vertex

error averaged over all vertices and frames, is usually used for evaluating the quality of reconstructed meshes. In this work, we choose these two metrics to quantitatively compare our method with chen2018synthesizing, and the corresponding results are reported in Table 2. When comparing with mesh SR, our MFSR improves the performance significantly, and obtains a higher PSNR. This is favored by RDB, solving the drawbacks of neglecting to use all preceding features in residual blocks. With better super-resolved images and the refinement step, our MFSR further reaches a lower VMSE, indicating a better performance and generalization for those datasets.

In Figure 12, we show visual results of our MFSR and mesh SR. Given the same LR meshes in the testing stage, our MFSR successfully produces rich and consistent wrinkles due to multiple features, while the results of mesh SR approximate inaccurate wrinkles depending on the position. The velocity and kinematics-based loss functions also favor to more stable results than mesh SR (please refer to the accompanying video). It has been reported that mesh SR can generate large-scale folders when the resolution of training pairs is low (i.e. the vertices of LR meshes are decreased to 200). However, in our experiments we find the mesh SR is unable to converge close to the ground truth with complicated wrinkle styles. Their results suffer from plausible but unrealistic small wrinkles like noise artifacts. In contrast, our method synthesizes the HR meshes in a physically realistic manner. The differences between the result and the ground truth are highlighted in Figure 12 using color coding. In the results of results of mesh SR, it clearly highlights the bottom left, bottom right corners and wrinkle lines, where our results look closer to the ground truth.

Benchmark Chen et al. 2018 Ours
DRAPING 59.07/4.19e-4 68.91/7.09e-5
HITTING 59.15/1.17e-4 72.25/4.69e-5
Table 2: Comparison of pixel-wise and vertex-wise error values (PSNR/VMSE) of our method and chen2018synthesizing.

In addition, here we introduce some improvements of our method to state-of-the-art data-driven methods not using deep networks. As mentioned in zurdo2013wrinkles, their method handles only quasistatic wrinkle formation without dynamics information so that it cannot capture the richness of wrinkles in a flag. They only use the edge ratio between current and the rest state as mesh descriptors, contrarily, our algorithm enhances the LR deformation using descriptors with displacement, normal and velocity covering both spatial and dynamic information. As shown in Figure 9, our technique can realize such dynamic effects like travelling waves. Another limitation mentioned in their work is the possibility to incur in cloth interpenetration. Although the penetrations in LR cloth are solved, generated HR meshes may suffer from collision problems. With the controllable cost (see in Table 1), we have solved this problem using discrete collision detection and interpolation collision response algorithm.

Ground Truth Chen et al. 2018 Ours Chen et al. 2018 Ours
Figure 12: Here we qualitatively show the reconstruction results for unseen data in the DRAPING and HITTING dataset with mesh SRchen2018synthesizing. The first column is the physics-based HR simulation. The second column is the results using method in chen2018synthesizing. The third column is our results. The reconstruction accuracy is qualitatively showed as a difference map rendered side by side. Reconstruction errors are color-coded and the warmer color indicates a larger error. Our method leads to significantly lower reconstruction errors.
Figure 13: Convergence analysis of (a) the displacement, (b) the normal and (c) the velocity with different super-resolution networks of RDN and SRResNet.

8.3 Ablation study

Next, we study the effect of different components of our proposed network, including loss function and network architecture.

Loss function. To demonstrate the effectiveness of our proposed loss functions, we conduct the experiments with different loss combinations on three datasets, i.e., DRAPING, HITTING, and SKIRT, respectively. The training and testing datasets are selected as mentioned in § 8. We use the displacement loss as the baseline and progressively add the remaining loss terms of our mesh MFSR, to obtain the comparative results of different loss terms.

Table 3 reports the quantitative evaluation of PSNR between generated displacement images and ground truth in various settings of loss functions. Red text indicates the best performance and the blue text indicates the second-best. The result shows that our proposed algorithm has either a best or second-best performance through combining all our proposed loss terms in a multi-task learning framework. Notice that without the constraints of velocity and kinematics-based loss functions, normal loss may decrease the final testing PSNR although it encourages wrinkle generation in SR results.

DRAPING 67.90 62.83 67.92 63.47 68.68
HITTING 67.11 68.23 71.41 69.75 71.26
SKIRT 62.48 60.76 62.48 61.31 62.88
Table 3: Comparison of PSNR of displacement images using different training loss terms.

Network architecture. To further investigate the performance of different SR networks (SRResNet and RDN), we conduct an experiment on the DRAPING dataset. In particular, we validate our cloth animation results on randomly selected 800 pairs of LR/HR meshes from the DRAPING dataset, which are excluded from the training set, and cover different complex hanging motions in pendulum movement. In Figure 13, we depict the convergence curves of three different features in the above validation dataset. The convergence curves show that RDN achieves better performance than SRResNet, and further stabilizes the training process in all three features. The improved performance and stabilization are benefited from contiguous memory, local residual learning and global feature fusion in RDN. In SRResNet, local convolutional layers do not have direct access to the subsequent layers so it neglects to fully use information of each convolutional layer. As a result, RDN achieves better performance than SRResNet.

9 Conclusions and future work

In this paper, we have presented a novel deep learning based framework to synthesize cloth animations with abundant wrinkles. Our evaluations show that the spatial and temporal features can be augmented with high frequency details using a multi-feature super-resolution network. The proposed network consists of a sharing module to jointly learn the low-level representation and a task-specific module to focus on high-level semantics. We design an additional kinematics-based loss to the network objective that maintains frame-to-frame consistency across time. Quantitative and qualitative results reveal that our method can synthesize realistic-looking wrinkles in various scenes, such as draping cloth, garments interacting with moving balls and human bodies, etc. We also give details on how to create paired meshes using a synchronized simulation, as it contributes to construct large 3D datasets. These aligned coarse and fine meshes can also be used for other related applications such as 3D shape matching of incompatible shape structures. In addition, our collision handling algorithm is independent of the wrinkle synthesis implementations therefore can cope with other data-driven methods. To the best of our knowledge, our approach is the first one to consider multiple features including geometry and time consistency on 3D dynamic wrinkle synthesis, and it can be conveniently generalized to have more tasks cascaded together.

Nevertheless, several limitations remain open for the future work. Since our data are simulated sequences, we plan to investigate the recurrent SR networks to capture the dynamics, and then potentially improve the consistency. In our work, the training data is the paired LR/HR meshes generated by a synchronized simulation. While tracking the LR cloth, the HR cloth cannot show dynamic properties of a full simulation. We would like to address this limitation by imposing unsupervised learning for unpaired data. In addition, the dataset could be further expanded including more scenes, motion sequences, and garment shapes.