Deep Detail Enhancement for Any Garment

08/10/2020 ∙ by Meng Zhang, et al. ∙ 0

Creating fine garment details requires significant efforts and huge computational resources. In contrast, a coarse shape may be easy to acquire in many scenarios (e.g., via low-resolution physically-based simulation, linear blend skinning driven by skeletal motion, portable scanners). In this paper, we show how to enhance, in a data-driven manner, rich yet plausible details starting from a coarse garment geometry. Once the parameterization of the garment is given, we formulate the task as a style transfer problem over the space of associated normal maps. In order to facilitate generalization across garment types and character motions, we introduce a patch-based formulation, that produces high-resolution details by matching a Gram matrix based style loss, to hallucinate geometric details (i.e., wrinkle density and shape). We extensively evaluate our method on a variety of production scenarios and show that our method is simple, light-weight, efficient, and generalizes across underlying garment types, sewing patterns, and body motion.



There are no comments yet.


page 2

page 3

page 4

page 5

page 6

page 10

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

High fidelity garments are a vital component of many applications including virtual try-on and realistic 3D character design for games, VR, and movies. In real life, garments, as they undergo motion, develop folds and rich details, characteristic of the underlying materials. Naturally, we desire to carry over such realistic details to the digital world as well.

Directly capturing such fine geometric details requires professional capture setups along with computationally-heavy processing pipeline [Bradley et al., 2008; Chen et al., 2015; Pons-Moll et al., 2017], which may not be applicable for low-budget scenarios. An alternate option involves physically-based cloth simulation [Narain et al., 2012; Tang et al., 2018; Liang et al., 2019] to virtually replace the need for physical setups. However, despite many research advances, achieving high quality and robust simulation still has many challenges. The simulation is often sensitive to the choice of parameters and initial conditions and may require a considerable amount of manual work to setup correctly. Even when setup, it remains a computationally expensive process, especially as the resolution of the garment geometry increases (see Figure 2). Further, with increasing complexity of the garment geometry, e.g., garments with many layers or with non-canonical UV layout, it may be difficult to obtain any stable result (see Figure 1(c)). Finally, for each new garment type and/or motion sequences, majority of the setup and simulation steps have to be repeated.

We propose a different approach. As garment undergoes motion, its geometry can be analyzed in two steps. First, the corresponding motion of the body (i.e., the body pose) changes the global shape of the garment, e.g., its boundary and coarse folds. We observe that the global shape to be consistent across different resolutions of a garment (see Figure 

2). Such a global geometry can either be simulated efficiently and reliably using a coarsened version of the garment, or by utilizing a skinning based deformation method. Given the coarse geometry, the high frequency wrinkles can be perceived by considering only local neighborhood information. As shown in Figure 2, the highlighted local patches in the coarse garment geometry provide sufficient cues to hallucinate plausible high frequency wrinkles in the same region. Hence, we hypothesize that given the coarse geometry of a garment, the dynamics of the detailed wrinkles can be learned from many such local patches observed from different type of garments undergoing different motions.

Figure 2. We show a garment simulated with different mesh resolutions, i.e., varying particle distances (PD), and report the computation time for simulating a single frame. While wrinkle details are captured well when simulating high resolution meshes, i.e., lower PD, computation time significantly increases. Our method captures statistically similar quality details by enhancing a coarse simulation much more efficiently (x speedup here).

In this work, we present a deep learning based method that enhances an input coarse garment geometry by synthesizing realistic and plausible wrinkle details. Inspired by the recent success of image-based artistic style transfer methods [Gatys et al., 2016], we represent the garment geometry as a 2D normal map where we interpret the coarse normal map as the content and fine geometric details as a style that can be well captured by the correlation between VGG features captured in the form of Gram matrices, characteristic to different materials. Furthermore, in order to tackle various garment materials within a universal framework, we adopt the conditional instance normalization technique [Dumoulin et al., 2016] that is proposed in the context of universal style transfer. Finally, we design our detail enhancement network to operate at a patch level enabling generalization across different (i) garment types, (ii) sewing patterns, and (iii) underlying motion sequences. In fact, we show that our network trained on a dataset generated from only a single sheet of cloth by applying random motion sequences (e.g., applying wind force, being hit by a ball) can already reasonably generalize to various garment types such as a skirt, a shirt, or a dress. See Figure 11 for a comparison.

We evaluate our approach on a challenging set of examples demonstrating generalization both across complex garment types (e.g., with layering) as well as a wide range of motion sequences. In addition to enriching low resolution garment simulations with accurate wrinkles, we also demonstrate that our method can enhance coarse garment geometries obtained by skinning methods as often used by computer games, see Figure 1. To our knowledge, we present the first learning-based method for generating garment details that generalizes to unseen garment types, sewing patterns, and body motions all at the same time.

2. Related Work

Realistic and physically based cloth simulation has been extensively studied in the graphics community [Choi and Ko, 2005; Nealen et al., 2006; Narain et al., 2012; Liang et al., 2019; Yu et al., 2019; Tang et al., 2018]. However, due to the associated computational cost and stability concerns, as the complexity of the garments increase, several alternative avenues have been explored. For example, in the industrial settings, skinning based deformation of garments is often favored for its simplicity, robustness, and efficiency. This, however, comes at the expense of losing geometric details. Finally, there has been various efforts to augment coarse simulation output with fine details using various approaches including optimization, data-driven, and learning-based methods.

Optimization-based methods.

Müller et al. [2010] use a constraint-based optimization method to compute a high resolution mesh from a base coarse mesh. Rohmer et al. [2010]

uses stretch tensor of the coarse garment animation as a guide to place wrinkles in a post-processing stage. However, they assume the underlying garment motion is smooth which is not always the case with character body motion. Gillette et al. 

[2015] extends similar ideas to real time animation of cloth.

Data-driven methods.

Feng et al. [2010] present a deformation transfer system that learns a non-linear mapping from a coarse garment shape to per-vertex displacements that represent fine details. Zurdo et al. [2012] share a similar idea to learn the mapping from a low-resolution simulation of a cloth to high resolution where plausible wrinkles are defined as displacements. Given a set of coarse and mid-scale mesh pairs, Kavan et al. [2011] learn linear upsampling operators. Wang et al. [2010]

construct a database of high resolution simulation examples generated by moving each body joint separately. Given a coarse mesh, wrinkle regions for each joint are synthesized by interpolating between database samples and blended into a final mesh. However, this method is limited to tight fitting clothing. One of the most comprehensive system in this direction is the DRAPE 

[Guan et al., 2012] system that learns a linear subspace model of garment deformation driven by pose variation. Hahn et al. [2014] extend this idea by using adaptive bases to generate richer deformations. Xu et al. [2014] introduce a pose-dependent rigging solution to extract nearby examples of a cloth and blend them in a sensitive-based scheme to generate the final drape.

Learning-based methods.

With the recent success of deep learning methods in various imaging and 3D geometry tasks, a recent research direction is to learn

garment deformations under body motion using neural networks. A popular approach has been to extend parametric human body models with a per-vertex displacement map to capture the deforming garment geometry  

[Alldieck et al., 2019a; Bhatnagar et al., 2019; Pons-Moll et al., 2017; Jin et al., 2018]. While this is a very efficient representation, it only works well for tight clothes such as t-shirt or pants that are sufficiently close to the body surface. Gundogdu et al. [2019]

present GarNet which learns features of garment deformation as a function of body pose and shape at different levels (point-wise, patch-wise, and global features) to reconstruct the detailed draping output. Santesteban et al. 


learn a garment deformation model via recurrent neural networks that enables real-time virtual try-on. Their approach learns the coarse garment shape based on the body shape and the detailed wrinkles based on the pose dynamics. A similar two-stage strategy is proposed by Patel et al. 


which decomposes the deformation caused by body pose, shape, and garment style, into high frequency and low-frequency components. Wang et al. 

[2019] also consider garment deformations due to material properties in their motion-driven network architecture. Others [Yang et al., 2018; Wang et al., 2018; Holden et al., 2019]

investigate a subspace technique such as principal component analysis, to reduce the number of variables of the high dimensional garment shape space. While these learning-based methods are efficient and fully differentiable compared to standard physically-based simulation, they are hard to generalize across different garment types and associated sewing patterns. In this work, we seek to combine the best of both worlds, physically-based simulation and learning-based approach. Specifically, we rely on physically-based simulation for fast coarse garment generation, we synthesize fine geometry conditioned on the coarse shape by a neural network. By utilizing a patch-based approach, and training on synthetic data, we demonstrate that our trained model generalizes across different garment types as well as different material configurations, and also extends to coarse geometry obtained using linear blend skin.

One of the closest learning based methods to our approach is DeepWrinkles [Lahner et al., 2018] which augments a low-resolution normal map of a garment captured by RGB-D sensors with high-frequency details using PatchGAN. However, their model is garment-specific and falls short when applied to garments unseen during training (e.g., garments with a different uv-parameterization or material properties). In Section 5, we show a comparison of DeepWrinkle to contrast against ours generalization ability, wherein we use conditional instance normalization to further adapt to material specifications and directly match Gram matrix based features, instead of matching style using a GAN.

Generalizing learning based methods to different garment types with potentially different number of vertices, connectivity, and material properties is an important challenge. In order to address this challenge, the recent work of Ma et al. [2019]

adopts a one-hot vector to encode the


garment type and trains a graph convolutional neural network with a generative framework to deform the body mesh into the garment surface. Zhu et al. 

[2020] train an adaptable template for image based garment reconstruction and introduce a large garment dataset with more than real 3D garment scans spanning over 10 different categories. This method generates plausible results if an initial template with a reasonable topology is provided. Alldieck et al.  [2019b] use normal and displacement maps based on a fixed body UV parameterization to represent the surface geometry of static garments. The fixed parameterization assumption makes it difficult to generalize the method to new garment designs, especially more complex garments such as multi-layer dresses. We also utilize a normal map representation, however, our patch-based method focuses on local shape variations and is independent of the underlying UV parameterization.

Image-to-image transfer.

We draw inspiration from the recent advances in image-to-image transfer methods [Liao et al., 2017; Fišer et al., 2016; Huang and Belongie, 2017; Isola et al., 2017] as we represent the garment geometry as a normal map and cast the problem as detail synthesis in this 2D domain. Specifically, we follow the framework proposed by [Gatys et al., 2015, 2016] that captures the style of an image by the Gram matrices of its VGG features [Simonyan and Zisserman, 2014]

. Similarly, we assume that a coarse normal map can be enhanced with fine scale wrinkles by matching the feature statistics of Gram matrices. Our work can be loosely related to image super-resolution where, the aim is to synthesize high-resolution images from blurry counterparts 

[Ledig et al., 2017; Wang et al., 2015]. In Section 5, we provide comparison with a direct image space superresolution approach.

Figure 3. Given a coarse garment geometry represented as a 2D normal map , we present a method to generate an enhanced normal map and the corresponding high resolution geometry . At the core of our method is a detail enhancement network that enhances local garment patches in conditioned on the material type of the garment. We combine such enhanced local patches to generate which captures fine wrinkle details. We lift to 3D to generate by an optimization method that avoids interpenetrations between the garment and the underlying body surface. In case the garment material is not known in advance, we present a material classification network that operates on the local patches cropped from the coarse normal map .

3. Overview

Our goal is to augment a coarse garment geometry source sequence, , with fine details such as plausible and realistic wrinkles. We assume that is obtained by either a physically-based simulation run on a low-resolution version of the garment or obtained by computational fast approaches such as skinning based deformation. The output of our pipeline is a fine-detailed surface sequence, , which depicts the source garment with realistic wrinkles in a temporally-consistent manner. The fine wrinkles in are synthesized either by sharpening the original coarse folds in or synthesized conditioned on nearby coarse folds.

Motivated by the recent advances in image style transfer [Gatys et al., 2016], we cast the 3D wrinkle synthesis task as a 2D detail transfer problem defined on the normal maps of the garments. Specifically, we present a learning-based method to learn an image-to-image transfer network that synthesizes a detail-rich normal map corresponding to given the coarse normal map of by matching the feature statistics using Gram Matrices (Section 4.1).

However, unlike previous work [Lahner et al., 2018], which focuses on learning garment-specific models using PatchGAN, our goal is to train a universal detail transfer network that can generalize over different types of garments (e.g., dress, shirt, skirt) made of different materials (e.g., silk, leather, matte) and undergoing different body motions (e.g., dancing, physical activities). First, in order to ensure generalization across different materials and body motions, we adopt the conditional instance normalization (CIN) approach, introduced by Dumoulin et al. [2016] in the context of artistic style transfer. Specifically, given the material type of a garment, we shift and scale the learned network parameters to adapt to different materials and motions. Given a sequence of coarse garment geometry, we predict its material type

via a classifier, and do not require the material specification as part of the input. Second, in order to generalize across different types of garments and different uv-parameterizations utilized to generate the normal maps of the garments, we propose a novel patch-based approach. Specifically, our network works at a patch level by synthesizing details on overlapping patches randomly cropped from the normal map,

, of the garment, . We merge the resulting detailed patches to obtain the final normal map, , which is lifted to 3D via a gradient-based optimization scheme to yield (Section 4.3). Our pipeline is illustrated in Figure 3.

4. Algorithm Stages

Given a coarse 3D garment geometry sequence , for each frame , our goal is to synthesize , which enhances the garment geometry with fine wrinkle details, subject to the underlying body motion and the assigned garment material (see Section 4.2). Note that given the nature of garments, any can be semantically cut into a few developable pieces, i.e., garment patterns. We assume such a parameterization is provided for the first frame and applicable to the remaining frames in the sequence.

4.1. Detail Enhancement

Using the garment-specific parameterization, we represent each by a normal map . We treat the detail enhancement problem as generating a normal map conditioned on and the material characteristics . While preserves the content of (e.g., the density and the position of the wrinkles), the appearance of the details (e.g., shape, granularity) are learned from , a collection of normal maps depicting the detailed geometry of various garments of material undergoing different body motions. The detail enhancement step operates at each frame, hence, for simplicity, we ignore the subscript in the following.

Detail Enhancement Network.

We adopt a U-net architecture for our detail enhancement network (see Figure 3). The encoder projects the input into a learned latent space. The latent representation is then used to construct the output via the decoder. Although garments made of different materials may share similar coarse geometry, the fine details can vary, conditioned on the material. To adapt our approach to different garment materials and capture fine details, we use , a one-hot vector, as input to indicate the type of material and are associated with. We design the decoder using conditional instance normalization (CIN) [Dumoulin et al., 2016] to adapt for different material types. Specifically, given a material type , the activation of a layer is converted to a normalized activation specific to a material type according to,

where and

are the mean and standard deviation of

, and and are material-specific scaling and shifting parameters learned for the corresponding layer .

Loss Function.

In general, the loss function for a style transfer task is composed of two parts: (i) content signal, which in our case penalizes the difference between

and with respect to the coarse geometry features; and (ii) style term, which in our case penalizes the perceptual difference between and , a set of training examples that depict detailed garment geometries of material . As indicated by Gatys et al. [2016], content and style losses can be formulated using the output of a predefined set of layers, and , of a pre-trained image classification network, e.g., VGG19 [Simonyan and Zisserman, 2014]. Note that the definition of content and style in our problem is not the same as in a classical image style transfer method. We performed a qualitative study to find the appropriate layers and . Specifically, we pass examples of two sets of normal maps that depict coarse and fine geometry of garments of a particular material through VGG19. For candidate layers, we embed the output features in 2D space using a spectral embedding method and identify the layers where the two sets are well separated (see Figure 4).

Figure 4. We pass a given set of normal map patches cropped from coarse and high resolution simulation results through VGG19. Here, we show spectral embedding in of the feature outputs of candidate layers, and select the layers where the two sets that differ significantly to compute the style loss. We show the feature plots of two such layers. This difference is largely responsible for the drop in visual fidelity going from high resolution to coarse simulation.

Given and , we define content loss between and as

where is the VGG features of the selected layer .

We define the style loss based on the correlation of the VGG features across different channels by using Gram matrices. Since the input to our network is a normal map that is generated based on a UV parameterization of the garment, it consists of background regions which do not correspond to any UV patch. In order to avoid such empty regions impacting the correlation measure, we adopt a spatial control neural style transfer method by using a UV mask that masks out the background regions. We capture style by a set of Gram Matrices defined for each layer of VGG:

where is the background mask for layer downsampled accordingly. The style loss is then formulated as:

where is the number of valid pixels in the down-sampled mask .

The final loss function we use to train the network is a weighted sum of the content and style losses:

where and denote the weight of each term respectively. In our tests, we use and .

Patch-based generalization.

Generalizing a CNN based neural network to different types of clothes with unknown UV parameterization is challenging. Particularly, similar 3D garments (e.g., a short and a long skirt) might potentially have very different parameterizations due to different sewing patterns that vary in the number of 2D patterns and their shape. In order to generalize our approach across such varying 2D parameterizations, we adopt a patch-based approach. Specifically, instead of operating with complete normal maps, we use patches cut from the normal maps as input and output of our network. While we use randomly cut patches during training, at test time we use overlapping patches sampled regularly on the image plane. We resize each input patch to ensure the absolute size of one pixel remains unchanged. The output patches that are enhanced by our network are then combined to generate the final detailed normal map by averaging the RGB values in the overlapping regions of the output patches.

Figure 5.

For an input coarse normal map, we crop 42 patches as the input of our material classification network. For each material type in our dataset, We plot the histogram of patches based on the probability of them belonging to that material. As shown, material 1 has the highest distribution among the five materials. We also color code the patches based on their probability of belonging to material 1, where darker color means a higher probability.

4.2. Material Classification

We introduce a patch-based material classification method to predict the material properties in such cases. Specifically, we train a patch-based material-classification network, , which is composed of fully-connected layers as shown in Figure 3. The input to this network is the flattened feature vector of VGG calculated from a patch cropped from . The dimension of the last layer is the number of the materials captured in our dataset. The last layer is followed with a Sigmoid function and a normalization layer to output the probability of the input patch belonging to each material. During training, we use Cross-Entropy loss between the predicted probabilities and the ground truth material type.

Humans perceive the characteristics of a material when they observe a garment in motion. Similarly, we observe that performing material classification on a single patch is not sufficiently accurate [Yang et al., 2017]. Hence, for a given input garment sequence, we choose randomly cropped patches and run material classification on each of them sampled over time. We use a voting strategy to decide the final material type from all such individual predictions as shown in Figure 5. Note that in application scenarios, e.g., if the coarse garment geometry is obtained by deforming a template via linear blend skinning, the material classification frees the user from having to guess material parameters.

4.3. 3D Recovery

Once the enhanced normal map is calculated, we finally lift the details into 3D. Since coarse geometry is known, one common solution in the CG industry is to bake the normalmap onto the original coarse surface during rendering. The skinning based examples shown in Figure 7 and 8 are rendered using this approach. On the other hand, real 3D results are useful in many follow up applications, such as animation, texture painting, etc. We adopt a two-stage approach to recover the detailed 3D geometry for such applications.

Normal map guided deformation.

We perform upsampling over the surface of the coarse geometry and deform the set of vertices based on the enhanced normal map . Our method uses an iterative gradient optimization scheme to find the deformed vertex positions which minimizes the following energy:

where is the normal of vertex as denoted on , is the one-ring neighborhood of , is the Laplacian operator, and is the initial position of obtained by upsampling the coarse geometry, and and are the weights of the last two regularization terms. While the Laplacian term acts as a smoothness term to avoid sharp deformations, the last term penalizes the final shape from deviating from the original geometry (e.g., by shifting or scaling).

Avoiding interpenetration.

The above deformation approach may cause interpenetrations between the garment and the underlying body surface. Hence, we adopt a post-processing step to resolve such penetrations as introduced by Guan et al. [2012]. Specifically, we first find the set of garment vertices , which are inside the body mesh. We project each such vertex to the nearest location, , on the body surface. We update the vertex positions such that the following objective function is minimized:

In our experiments, we use , , and . The objective functions are optimized by standard gradient descent algorithm.

4.4. Data Generation

Our goal is to generalize our method to a large variety of garment types and motion sequences. To ensure such a generalization capability, we generate our training data by performing physically-based simulation over 3 different garment-motion pairs (see Figure 6). These are:

  1. a piece of hanging sheet interacting with 2 balls and blowing wind ( frames at fps);

  2. a long skirt over a virtual character performing Samba dance ( frames at fps);

  3. a pleated skirt over a virtual character performing modern dance ( frames at fps).

For the last two cases, the underlying virtual character is created by Adobe Fuse111 and the motion sequences are obtained from Mixamo222 We use Marvellous Designer333 to drape the garments over the 3D character body and perform the physically-based simulation. We simulate each garment with different material samples (silk chamuse, denim lightweight, knit terry, wool melton, silk chiffon) and in two resolution settings. To obtain the coarse geometry, we run the simulation with a particle distance of . This results in resulting meshes of vertices for the aforementioned 3 types of garments respectively. To generate the fine scale geometry, we run the simulation with a particle distance of which results in meshes with vertices respectively. The normal map of the 3D garment surface is then calculated. For each example, we generate the corresponding normal map using a given uv parameterization for each garment type. We scale the normal maps of different examples to ensure a single pixel corresponds to the same unit in 3D and randomly cut patches from the normal maps and use each pair of non-empty coarse and high resolution patches as input and target of our network. Finally, we also augment our dataset by rotating and reflecting the patches.

Figure 6. Our base training data is simulated from a single sheet of cloth interacting with two balls and blowing wind. We also simulate a long skirt and a pleated skirt under different body motions. For each training example, we simulate both a coarse and a fine resolution mesh and obtain the corresponding normal map pairs.
Figure 7. We evaluate our method both on coarse garment simulations and garments deformed by linear blend skinning. Our method can generalize across different garment types and motions. The materials for the simulated garments are as follows: t-shirt (knit terry), pants (wool melton), skirt (silk chamuse), skirt laces (silk chiffon). For the skinning based example, our material classification network predicts it as silk chiffon. Please see supplementary video.
Figure 8. We evaluate our method both on coarse garment simulations and garments deformed by linear blend skinning. The materials for the simulated garments are as follows: hanfu (silk chamuse), hood (denim lightweight), pants (knit terry), vest (wool melton). For the skinning based example, our material classification network predicts it as knit terry. Please see supplementary video.

We observe that, sometimes due to instabilities in the physical simulation, simulation results with different particle distances may reveal geometric variations, e.g., in the position and the shape of the wrinkles. This results in input and output patch pairs that are not well aligned. To avoid this problem, we instead downsample the normal maps of the high resolution mesh to generate the coarse normal maps that will be provided as input to our network. During testing, the inputs to our network are obtained from the coarse simulation results. Our experiments demonstrate that training with the downsampled normal maps while predicting with the coarse normal maps as input at test is a successful strategy.

Network architecture.

Our network broadly consists of an encoder and a decoder. There are 4 layers in the encoder to project the input patch image to a compact latent space, by down-sampling from to

. In each layer, there are blocks in the order of 2D convolution, Relu, and maxpool. The followed decoder is used to up-sample the patch tensor from

to the original size, mainly using 2D trans-convolutions, and a 2D convolution in the last layer. Because different materials share one network, we adopt CIN layer to scale and shift the latent code to the specified material related space, after every 2D trans-convolution. Finally, we use skip connection, as we regard our problem as constructing the residual for the fine image based on the coarse one. In our experiments, we use the Adam optimizer, with a learning rate of and parameters . Project code and data will be released.

Figure 9. We demonstrate how our method works on garments made of different materials and undergoing the same motion. Note how the coarse inputs vary with material change and the synthesized details that align well with the input. The two different materials used are denim light (DL) and silk chiffon (SC).

5. Results

We evaluate our method on the product space of a variety of garments undergoing various body motions. We present qualitative and quantitative evaluation, comparison with baseline methods, and also ablation study.

5.1. Dataset

As input, we test on coarse garment geometries obtained by either running a physically-based simulation on a low resolution mesh of the garment, as well as garments deformed via a skeleton rig using linear blend skinning (LBS). Similar to the dataset generation, we use motion sequences from Mixamo and utilize Marvelous Designer as our physical simulator. For skinning-based examples, we test our method on a character from Mixamo, as well as a character from the Microsoft Rocketbox dataset 444 We show qualitative examples in Figures 7 and 8.

5.2. Evaluation

Qualitative evaluation

The examples demonstrate the generalization capability of our method to unseen garments and motions during training time. Even for garment types that are significantly different from our training data, e.g., the multi-layer skirt in Figures 1, 7, and the hijab in Figure 8, our network synthesizes plausible geometric details an order of magnitude faster than running a physically based simulation with a high resolution mesh (see Figure 2). When evaluating our method on skinning-based rigged characters, since no garment material information is provided, we use our material classification method to first predict plausible material parameters based on the input normal maps. We then utilize this material information to synthesize vivid wrinkles that enhances the input garment geometry. Empirically, even when material specification was available, we found training end-to-end with the material classification network along with CIN resulted in faster convergence [Huang and Belongie, 2017].

Material properties of a garment plays an important role in the resulting folds and wrinkles. Our method captures such deformations by taking the material parameters as an additional input. In Figure 9, we show examples of garments made from different materials undergoing the same body motion. Our method is able to generate plausible details in all cases that well respect the input coarse geometries.

Quantitative evaluation

In order to evaluate our approach quantitatively, when ground truth data is available, we define an improvement score given an input patch, (), an output patch, (), and the corresponding ground truth patch, (), as follows:

[width=5em]MeanStd Dress A Dress B Dress C

Silk chamuse

Motion 1 [width=5em]98.410.87 [width=5em]96.221.01 [width=5em]91.873.02
Motion 2 [width=5em]97.551.53 [width=5em]95.390.92 [width=5em]91.254.56

Denim light

Motion 1 [width=5em]98.350.57 [width=5em]95.280.92 [width=5em]92.452.03
Motion 2 [width=5em]97.391.08 [width=5em]95.251.42 [width=5em]92.443.25
Table 1. We report the mean and standard deviation of improvement scores on three dresses made of two different materials and undergoing two different motion sequences. While Dress A and Motion 1 was seen during training, the remaining dresses and the motion sequence are unseen. Both motions are sequences are composed of 200 frames.

A higher improvement score, 100 being perfect, indicates that the output patch is closer to the ground truth in terms of the styles of the wrinkles. Table 1 reports the mean and standard deviation of the improvement score for three different garment types made from two different materials and undergoing two different motion types. While one garment and motion type has been seen during training (dress A, motion 1), the remaining garment types and the motion sequence are unseen. Not surprisingly, the method performs well () on seen garment types and motions. The main result is the generalization to unseen data and achieves high improvement scores regardless with only slight degradation in terms of performance ().

When ground truth is not available, we evaluate the performance from a statistics perspective. Given a sequence of animated garments, we randomly crop patches from each frame and generate the corresponding samples of Gram matrices of style features. We collect such samples from three sources: (a) patches from the coarse input; (b) patches generated by our method; and (c) patches from high resolution simulation (not for corresponding frames). We embed the patches using spectral embedding from , and measure the distances in the distribution of the resultant point sets (denotes as , , and , respectively) using Chamfer distance (CD) as and , and improvement score defined as in Table 2. As shown in the table, the Chamfer distance between our result and the ground truth is significantly lower than the distance between the coarse input and the ground truth. This indicates that our method not only enhances the details of individual frames, but also improves overall plausibility for a motion sequence.

[width=6em]DR Dress A Dress B Dress C

Silk chamuse

Motion 1 [width=6em]1.770.0994.81 [width=6em]0.360.0974.32 [width=6em]0.150.1214.95
Motion 2 [width=6em]1.390.1192.25 [width=6em]1.090.1289.20 [width=6em]0.110.1010.56

Denim light

Motion 1 [width=6em]1.310.0192.54 [width=6em]0.890.1385.86 [width=6em]
Motion 2 [width=6em]1.500.1292.33 [width=6em]0.930.1188.23 [width=6em]
Table 2. We report the Chamfer distance between (i) the coarse input and the ground truth and (ii) our result and the ground truth. We calculate a matching score {} to show the relative improvement. Please refer to the text for details. Note that Dress A and Motion 1 are seen during training while the remaining dresses and the motion sequence are unseen. Dress C is tight, as indicated by the low Chamfer distance to start, and therefore leaves little scope to improve in terms of detail enhancement.

Continuity in time

While we do use time information in the material classification, the main detail enhancement step works at frame level. In earlier versions of our network, we experimented with time-dependent networks, with GRUs and LSTMs. While the results were better on training sets, we found the networks did not generalize across unseen data. In contrast, the proposed per-frame processing still leverages continuity data in the coarse geometry sequence and produces time-continuous details (please see supplementary video), with superior generalization behavior. We quantitatively evaluate our performance over continuous body motion, see Figure 6 (right).

5.3. Ablation Study

In order to ensure coarse and high resolution patches are aligned during training, we generate the input to our network by downsampling the high resolution normal maps. We validate this choice, by visualizing the 2D embedding of the output features of some of the layers of VGG19 that we choose as style layers in Figure 10. As shown in the figure, the distribution of the features of the downsampled high resolution normal maps are close to the actual coarse normal maps. Further, given the downsampled normal maps as input, our method can successfully generate results that share similar distribution as the ground truth.

Figure 10. Given an input normal map obtained by running a coarse simulation with a particle distance of 30 mm (PD30), our network produces a result that is close to the ground truth (high resolution simulation run with a particle distance of 10 mm) in terms of Gram matrix distribution of the output features of VGG19 layers that we use as style layers. We note that when training our network, as input we use downsampled high resolution normal maps (PD10 ds) which have similar feature distribution as the coarse simulation normal maps (PD30) used during test time.
Figure 11. We show results of our method when trained with (i) data generated from a single sheet only and (ii) all the three datasets (see Figure 6).
Figure 12.

Given a coarse garment geometry, we compare our results with image based methods, i.e., sharpening feature in Photoshop and image super-resolution 

[Yifan et al., 2018], as well as our implementation of the DeepWrinkles [Lahner et al., 2018] method. Our results are closest to the target ground truth visually. We also outperform the alternative approaches in terms of the improvement score.

We also evaluate the performance of our method trained with different portions of our dataset. As shown in Figure 11, even when trained with a single sheet of cloth, our method generalizes reasonably well to different garment types. The quality of the results improve as the complete dataset is utilized.

5.4. Comparison

In Figure 12, we compare our method to other alternative approaches. Specifically, we compare to two image-based enhancement methods, namely image sharpening feature in Adobe Photoshop [Adobe, 2020]

and a state-of-the-art image super-resolution method 

[Yifan et al., 2018]. We observe that image-based approaches cannot really hallucinate wrinkle details and are not really suitable for our problem.

We further compare our method to the recent neural network-based approach of DeepWrinkles [Lahner et al., 2018]. We implement their method and train both their generative model and our network with a portion of our dataset, specifically with training data obtained from the long skirt. While DeepWrinkles, which uses a patch-based GAN setup, achieves similar performance during the training stage, it falls short when generalizing to unseen cases. We also report the improvement score across the whole motion sequence of each method demonstrating the superiority of our approach quantitatively.

Figure 13. We enhance the same coarse normal map respectively with right (silk chamuse) and wrong (wool melton) materials. The method cannot recover subtle wrinkle details of the silk material when wrongly assigned a different material.

Finally, given the same coarse normal map as input, we also provide the result of our method when run with incorrect material parameters and report the respective improvement scores in Figure 13. Use of wrong material parameters results in less plausible results with around drop in the improvement score.

6. Conclusions and Future Work

We have presented a deep learning based approach to synthesize plausible wrinkle details on coarse garment geometries. Our approach draws inspiration from the recent success of image artistic style methods and casts the problem as detail transfer across the normal maps of the garments. We have shown that our method generalizes across different garment types, sewing patterns, and motion sequences. We have also trained and tested our method with different material parameters under a universal framework. We have demonstrated results on inputs obtained from running physical simulation on low-resolution garment geometries as well as garments deformed by skinning.

Limitations and Future work.

While our method shows impressive generalization capability, it has several limitations that can be addressed in future work. First of all, in order to generalize the detail enhancement network across different materials, we provide the material parameters as input. Hence, at run time, the network can be tested only on material types that have been seen during training. Generalizing our method to unseen materials, e.g., via learning an implicit material basis, is an interesting future direction. The current bottleneck is getting diverse simulations by sampling a variety of realistic materials due to challenges in getting robust and realistic simulations, in reasonable time, using commercial simulation frameworks. In our present implementation, our method is trained with regularly cropped 2D patches on the normal maps. Cropping geodesic patches over 3D surface instead to minimize averaging errors across overlapping regions and integrating the post-processing step into the network is an exciting direction. Furthermore, we assume that the coarse and high-resolution garments share the same set of garment pieces. In a possible garment design scenario, the designer might add some accessories to an existing garment, e.g., adding a belt or a layer to a skirt. It would be interesting to hallucinate the wrinkle details after such edits using the coarse simulation of the base garment only.


  • Adobe (2020) Adobe photoshop cc External Links: Link Cited by: §5.4.
  • T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-Moll (2019a) Learning to reconstruct people in clothing from a single rgb camera. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 1175–1186. Cited by: §2.
  • T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor (2019b) Tex2shape: detailed full human body geometry from a single image. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2293–2303. Cited by: §2.
  • B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll (2019) Multi-garment net: learning to dress 3d people from images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5420–5430. Cited by: §2.
  • D. Bradley, T. Popa, A. Sheffer, W. Heidrich, and T. Boubekeur (2008) Markerless garment capture. In ACM SIGGRAPH 2008 papers, pp. 1–9. Cited by: §1.
  • X. Chen, B. Zhou, F. Lu, L. Wang, L. Bi, and P. Tan (2015) Garment modeling with a depth camera.. ACM Trans. Graph. 34 (6), pp. 203–1. Cited by: §1.
  • K. Choi and H. Ko (2005) Research problems in clothing simulation. Comput. Aided Des. 37 (6), pp. 585–592. Cited by: §2.
  • V. Dumoulin, J. Shlens, and M. Kudlur (2016) A learned representation for artistic style. arXiv preprint arXiv:1610.07629. Cited by: §1, §3, §4.1.
  • W. Feng, Y. Yu, and B. Kim (2010) A deformation transformer for real-time cloth animation. ACM Trans. Graph. 29 (4). Cited by: §2.
  • J. Fišer, O. Jamriška, M. Lukáč, E. Shechtman, P. Asente, J. Lu, and D. Sỳkora (2016) StyLit: illumination-guided example-based stylization of 3d renderings. ACM Transactions on Graphics (TOG) 35 (4), pp. 1–11. Cited by: §2.
  • L. A. Gatys, A. S. Ecker, and M. Bethge (2016) Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2414–2423. Cited by: §1, §2, §3, §4.1.
  • L. Gatys, A. S. Ecker, and M. Bethge (2015) Texture synthesis using convolutional neural networks. In Advances in neural information processing systems, pp. 262–270. Cited by: §2.
  • R. Gillette, C. Peters, N. Vining, E. Edwards, and A. Sheffer (2015) Real-time dynamic wrinkling of coarse animated cloth. In SCA, SCA ’15. Cited by: §2.
  • P. Guan, L. Reiss, D. A. Hirshberg, A. Weiss, and M. J. Black (2012) Drape: dressing any person. ACM Transactions on Graphics (TOG) 31 (4), pp. 1–10. Cited by: §2, §4.3.
  • E. Gundogdu, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, and P. Fua (2019) Garnet: a two-stream network for fast and accurate 3d cloth draping. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8739–8748. Cited by: §2.
  • F. Hahn, B. Thomaszewski, S. Coros, R. W. Sumner, F. Cole, M. Meyer, T. DeRose, and M. Gross (2014) Subspace clothing simulation using adaptive bases. ACM Transactions on Graphics (TOG) 33 (4), pp. 1–9. Cited by: §2.
  • D. Holden, B. C. Duong, S. Datta, and D. Nowrouzezahrai (2019) Subspace neural physics: fast data-driven interactive simulation. In Proceedings of the 18th annual ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pp. 1–12. Cited by: §2.
  • X. Huang and S. Belongie (2017) Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1501–1510. Cited by: §2, §5.2.
  • P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134. Cited by: §2.
  • N. Jin, Y. Zhu, Z. Geng, and R. Fedkiw (2018) A pixel-based framework for data-driven clothing. arXiv preprint arXiv:1812.01677. Cited by: §2.
  • L. Kavan, D. Gerszewski, A. W. Bargteil, and P. Sloan (2011) Physics-inspired upsampling for cloth simulation in games. ACM Trans. Graph. 30 (4). Cited by: §2.
  • Z. Lahner, D. Cremers, and T. Tung (2018) Deepwrinkles: accurate and realistic clothing modeling. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 667–684. Cited by: §2, §3, Figure 12, §5.4.
  • C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017)

    Photo-realistic single image super-resolution using a generative adversarial network

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §2.
  • J. Liang, M. Lin, and V. Koltun (2019) Differentiable cloth simulation for inverse problems. In Advances in Neural Information Processing Systems, pp. 771–780. Cited by: §1, §2.
  • J. Liao, Y. Yao, L. Yuan, G. Hua, and S. B. Kang (2017) Visual attribute transfer through deep image analogy. arXiv preprint arXiv:1705.01088. Cited by: §2.
  • Q. Ma, J. Yang, A. Ranjan, S. Pujades, G. Pons-Moll, S. Tang, and M. J. Black (2019) Learning to dress 3d people in generative clothing. arXiv preprint arXiv:1907.13615. Cited by: §2.
  • M. Müller and N. Chentanez (2010) Wrinkle meshes. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, SCA ’10, Goslar, DEU, pp. 85–92. Cited by: §2.
  • R. Narain, A. Samii, and J. F. O’brien (2012) Adaptive anisotropic remeshing for cloth simulation. ACM transactions on graphics (TOG) 31 (6), pp. 1–10. Cited by: §1, §2.
  • A. Nealen, M. Müller, R. Keiser, E. Boxerman, and M. Carlson (2006) Physically based deformable models in computer graphics. Computer Graphics Forum 25 (4), pp. 809–836. Cited by: §2.
  • C. Patel, Z. Liao, and G. Pons-Moll (2020) The virtual tailor: predicting clothing in 3d as a function of human pose, shape and garment style. arXiv preprint arXiv:2003.04583. Cited by: §2.
  • G. Pons-Moll, S. Pujades, S. Hu, and M. J. Black (2017) ClothCap: seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–15. Cited by: §1, §2.
  • D. Rohmer, T. Popa, M. Cani, S. Hahmann, and A. Sheffer (2010) Animation wrinkling: augmenting coarse cloth simulations with realistic-looking wrinkles. ACM Trans. Graph. 29 (6). Cited by: §2.
  • I. Santesteban, M. A. Otaduy, and D. Casas (2019) Learning-based animation of clothing for virtual try-on. In Computer Graphics Forum, Vol. 38, pp. 355–366. Cited by: §2.
  • K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2, §4.1.
  • M. Tang, T. Wang, Z. Liu, R. Tong, and D. Manocha (2018) I-cloth: incremental collision handling for gpu-based interactive cloth simulation. ACM Transactions on Graphics (TOG) 37 (6), pp. 1–10. Cited by: §1, §2.
  • H. Wang, F. Hecht, R. Ramamoorthi, and J. F. O’Brien (2010) Example-based wrinkle synthesis for clothing animation. ACM Trans. Graph. 29 (4). Cited by: §2.
  • T. Y. Wang, D. Ceylan, J. Popovic, and N. J. Mitra (2018) Learning a shared shape space for multimodal garment design. arXiv preprint arXiv:1806.11335. Cited by: §2.
  • T. Y. Wang, T. Shao, K. Fu, and N. J. Mitra (2019) Learning an intrinsic garment space for interactive authoring of garment animation. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–12. Cited by: §2.
  • Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang (2015)

    Deep networks for image super-resolution with sparse prior

    In Proceedings of the IEEE international conference on computer vision, pp. 370–378. Cited by: §2.
  • W. Xu, N. Umetani, Q. Chao, J. Mao, X. Jin, and X. Tong (2014) Sensitivity-optimized rigging for example-based real-time clothing synthesis.. ACM Trans. Graph. 33 (4), pp. 107–1. Cited by: §2.
  • J. Yang, J. Franco, F. Hétroy-Wheeler, and S. Wuhrer (2018) Analyzing clothing layer deformation statistics of 3d human motions. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 237–253. Cited by: §2.
  • S. Yang, J. Liang, and M. C. Lin (2017) Learning-based cloth material recovery from video. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4383–4393. Cited by: §4.2.
  • W. Yifan, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O. Sorkine-Hornung, and C. Schroers (2018)

    A fully progressive approach to single-image super-resolution

    In CVPR Workshops, Cited by: Figure 12, §5.4.
  • T. Yu, Z. Zheng, Y. Zhong, J. Zhao, Q. Dai, G. Pons-Moll, and Y. Liu (2019) Simulcap: single-view human performance capture with cloth simulation. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5499–5509. Cited by: §2.
  • H. Zhu, Y. Cao, H. Jin, W. Chen, D. Du, Z. Wang, S. Cui, and X. Han (2020) Deep fashion3d: a dataset and benchmark for 3d garment reconstruction from single images. arXiv preprint arXiv:2003.12753. Cited by: §2.
  • J. S. Zurdo, J. P. Brito, and M. A. Otaduy (2012) Animating wrinkles by example on non-skinned cloth. IEEE Transactions on Visualization and Computer Graphics 19 (1), pp. 149–158. Cited by: §2.