Realistic garment reconstruction is notoriously a complex problem and its importance is undeniable in many research work and applications, such as accurate body shape and pose estimation in the wild (i.e., from observations of clothed humans), realistic AR/VR experience, movies, video games, virtual try-on, etc.
For the past decades, physics-based simulations have been setting the standard in movie and video game industries, even though they require hours of labor by experts. More recently methods for full clothing reconstruction using multi-view videos or 3D scan systems have also been proposed . Global deformations can be reconstructed with high fidelity semi-automatically. Nevertheless, accurately recovering geometric details such as fine cloth wrinkles has remained a challenge.
In this paper, we present DeepWrinkles (see Fig. 1
), a novel framework to generate accurate and realistic clothing deformation from real data capture. It consists of two complementary modules: (1) A statistical model is learned from 3D scans of clothed people in motion, from which clothing templates are precisely non-rigidly aligned. Clothing shape deformations are therefore modeled using a linear subspace model, where human body shape and pose are factored out, hence enabling body retargeting. (2) Fine geometric details are added to normal maps generated using a conditional adversarial network whose architecture is designed to enforce realism and temporal consistency.
To our knowledge, this is the first method that tackles 3D surface geometry refinement using deep neural network on normal maps. With DeepWrinkles, we obtain unprecedented high-quality rendering of clothing deformation, where global shape as well as fine wrinkles from (real) high resolution observations can be recovered, using an entirely data-driven approach. Figure 2 gives an overview of our framework with a T-shirt as example. Additional materials contain videos of results. We show how the model can be applied to virtual human animation, with body shape and pose retargeting.
2 Related Work
Cloth modeling and garment simulation have a long history that dates back to the mid 80s. A general overview of fundamental methods is given in . There are two mostly opposing approaches to this problem. One is using physics-based simulations to generate realistic wrinkles, and the other captures and reconstructs details from real-world data.
2.0.1 Physics-based simulation.
For the past decades, models relying on Newtonian physics have been widely applied to simulate cloth behavior. They usually model various material properties such as stretch (tension), stiffness, and weight. For certain types of applications (e.g., involving human body) additional models or external forces have to be taken into account such as body kinematics, body surface friction, interpenetration, etc [9, 8, 4, 16]. Note that several models have been integrated in commercial solutions (e.g., Unreal Engine APEX Cloth/Nvidia NvCloth, Unity Cloth, Maya nCloth, MarvelousDesigner, OptiTex, etc.) . Nevertheless, it typically requires hours or days if not weeks of computation, retouching work, and parameter tuning by experts to obtain realistic cloth deformation effects.
2.0.2 3D capture and reconstruction.
Vision-based approaches have explored ways to capture cloth surface deformation under stress, and estimate material properties through visual observations for simulation purpose [53, 34, 33, 29]. As well, several methods directly reconstruct whole object surface from real-world measurements.  uses texture patterns to track and reconstruct garment from a video, while 3D reconstruction can also be obtained from multi-view videos without markers [48, 33, 29, 41, 49, 7]. However without sufficient prior, reconstructed geometry can be quite crude. When the target is known (e.g., clothing type), templates can improve the reconstruction quality 
. Also, more details can be recovered by applying super-resolution techniques on input images[47, 17, 45, 7], or using photometric stereo and information about lighting [22, 50]. Naturally, depth information can lead to further improvement [36, 14].
In recent work , cloth reconstruction is obtained by clothing segmentation and template registration from 4D scan data. Captured garments can be retargeted to different body shapes. However the method has limitations regarding fine wrinkles.
2.0.3 Coarse-to-fine approaches.
To reconstruct fine details, and consequently handle the bump in resolution at runtime (i.e., higher resolution meshes or more particles for simulation), methods based on dimension reduction (e.g., linear subspace models) [2, 21] or coarse-to-fine strategies are commonly applied [52, 35, 25]. DRAPE  automates the process of learning linear subspaces from simulated data and applying them to different subjects. The model factors out body shape and pose to produce a global cloth shape and then applies the wrinkles of seen garments. However, deformations are applied per triangle as in , which is not optimal for online applications. Additionally, for all these methods, simulated data is tedious to generate, and accuracy and realism are limited.
2.0.4 Learning methods.
Previously mentioned methods focus on efficients simulation and representation of previously seen data. Going a step further, several methods have attempted to generalize this knowledge to unseen cases.  learns bags of dynamical systems to represent and recognize repeating patterns in wrinkle deformations. In DeepGarment  the global shape and low frequency details are reconstructed from a single segmented image using a CNN but no retargeting is possible.
Only sparse work has been done on learning to add realistic details to 3D surfaces with neural networks but several methods to enrich facial scans with texture exist [40, 37]. In particular, Generative Adversarial Networks (GANs)  are suitable to enhance low dimensional information with details. In  it is used create realistic images of clothed people given a (possibly random) pose.
Outside of clothing, SR-GAN  solves the super-resolution problem of recovering photo-realistic textures from heavily downsampled images on public benchmark. The task has similarities to ours in generating high frequency details for coarse inputs but we use a content loss motivated by perceptual similarity instead of similarity in pixel space. 
uses a data-driven approach with a CNN to simulate highly detailed smoke flows. Instead, pix2pix proposes a conditional GAN that creates realistic images from sketches or annotated regions or vice versa. This design suits our problem better as we aim at learning and transferring underlying image structure.
In order to represent the highest possible level of detail at runtime, we propose to revisit the traditional rendering pipeline of 3D engine with computer vision. Our contributions take advantage of the normal mapping technique[26, 12, 11]. Note that displacement maps have been used to create wrinkle maps using texture information [5, 15]. However, while results are visually good on faces, they still require high resolution mesh, and no temporal consistency is guaranteed across time. (Also, faces are arguably less difficult to track than clothing which are prone to occlusions and more loose.)
In this work, we claim the first entirely data-driven method that uses a deep neural network on normal maps to leverage 3D geometry of clothing.
3 Deformation Subspace Model
We model cloth deformations by learning a linear subspace model that factors out body pose and shape, as in . However, our model is learned from real data, and deformations are applied per vertex for speed and flexibility regarding graphics pipelines . Our strategy ensures deformations are represented compactly and with high realism. First, we compute robust template-based non-rigid registrations from a 4D scan sequence (Sect. 3.1), then a clothing deformation statistical model is derived (Sect. 3.2) and finally, a regression model is learned for pose retargeting (Sect. 3.3).
3.1 Data preparation
For each type of clothing, we capture 4D scan sequences at 60 fps (e.g., 10.8k frames for 3 min) of a subject in motion, and dressed in a full-body suit with one piece of clothing with colored boundaries on top. Each frame contists of a 3D surface mesh with around 200k vertices yielding very detailed folds on the surface but partially corrupted by holes and noise (see Fig. 1a). This setup allows a simple color-based 3D clothing extraction. In addition, capturing only one garment prevents occlusions where clothing normally overlaps (e.g., waistbands) and clothings can be freely combined with each other.
3D body pose is estimated at each frame using a method in the spirit of . We define a skeleton with joints described by parameters representing transformation and bone length. Joint parameters are also adjusted to body shape, which is estimated using [31, 55]. Posed human body is obtained using a linear blend skinning function that transforms (any subset of) vertices of a 3D deformable human template in normalized pose (e.g., T-pose) to a pose defined by skeleton joints.
We define a template of clothing by choosing a subset of the human template with consistent topology. should contain enough vertices to model deformations (e.g., 5k vertices for a T-shirt), as shown in Fig. 3. The clothing template is then registered to the 4D scan sequence using a variant of non-rigid ICP based on grid deformation [30, 20]. The following objective function , which aims at optimizing affine transformations of grid nodes, is iteratively minimized using Gauss-Newton method:
where the data term aligns template vertices with their nearest neighbors on the target scans, encourages each triangle deformation to be as rigid as possible, and penalizes inconsistent deformation of neighboring triangles. In addition, we introduce the energy term to ensure alignment of boundary vertices, which is unlikely to occur otherwise (see below for details). We set , , and by experiments. One template registration takes around 15s (using CPU only).
During data capture the boundaries of the clothing are marked in a distinguishable color and corresponding points are assigned to the set . We call the set of boundary points on the template . Matching point pairs in should be distributed equally among the scan and template, and ideally capture all details in the folds. As this is not the case if each point in is simply paired with the closest scan boundary point (see Fig. 4), we select instead a match for each point via the following formula:
Notice that might be empty. This ensures consistency along the boundary and better captures high frequency details (which are potentially further away).
3.2 Statistical model
The statistical model is computed using linear subspace decomposition by PCA . Poses of all registered meshes are factored out from the model by pose-normalization using inverse skinning: . In what remains, meshes in normalized pose are marked with a bar. Each registration can be represented by a mean shape and vertex offsets , such that , where the mean shape is obtained by averaging vertex positions: . The principal directions of the matrix
are obtained by singular value decomposition:
. Ordered by the largest singular values, the corresponding singular vectors contain information about the most dominant deformations.
Finally, each can be compactly represented by parameters (instead of its vertex coordinates), with the linear blend shape function , given a pose :
where is the -th singular vector. For a given registration, holds. In practice, choosing is sufficient to represent all registrations with a negligible error (less than 5 mm).
3.3 Pose-to-shape prediction
We now learn a predictive model , that takes as inputs joint poses, and outputs a set of shape parameters . This allows powerful applications where deformations are induced by pose. To take into account deformation dynamics that occur during human motion, the model is also trained with pose velocity, acceleration, and shape parameter history. These inputs are concatenated in the control vector , and
can be obtained using autoregressive models[2, 39, 31].
In our experiments with clothing, we solved for
in a straightforward way by linear regression:, where is the matrix representation of , and indicates the Moore-Penrose inverse. While this allows for (limited) pose retargeting, we observed loss in reconstruction details. One reason is that under motion, the same pose can give rise to various configurations of folds depending on the direction of movement, speed and previous fold configurations.
To obtain non-linear mapping, we consider the components of and
as multivariate time series, and train a deep multi-layer recurrent neural network (RNN)
. A sequence-to-sequence encoder-decoder architecture with Long Short-term Memory (LSTM) units is well suited as it allows continuous predictions, while being easier to train than RNNs and outperforming shallow LSTMs. We composewith joint parameter poses, plus velocity and acceleration of the joint root. MSE compared to linear regression are reported in Sect. 5.3.
4 Fine Wrinkle Generation
Our goal is to recover all observable geometric details. As previously mentioned, template-based methods  and subspace-based methods [21, 19] cannot recover every detail such as fine cloth wrinkles due to resolution and data scaling limitations, as illustrated in Fig. 5.
Assuming the finest details are captured at sensor image pixel resolution, and are reconstructed in 3D (e.g., using a 4D scanner as in [6, 38]), all existing geometric details can then be encoded in a normal map of the 3D scan surface at lower resolution (see Fig 6). To automatically add fine details on the fly to reconstructed clothing, we propose to leverage normal maps using a generative adversarial network . See Figure 8 for the architecture. In particular, our network induces temporal consistency on the normal maps to increase realism in animation applications.
4.1 Data preparation
We take as inputs a 4D scan sequence, and a sequence of corresponding reconstructed garments. The latter can be either obtained by registration, reconstruction using blend shape or regression, as detailed in Sect. 3. Clothing template meshes are equipped with UV maps which are used to project any pixel from an image to a point on a mesh surface, hence assigning a property encoded in a pixel to each point. Therefore, normal coordinates can be normalized and stored as pixel colors in normal maps. Our training dataset then consists of pairs of normal maps (see Fig. 7): low resolution (LR) normal maps obtained by blend shape reconstruction, and high resolution
(HR) normal maps obtained from the scans. For LR normal maps, the normal at surface point (lying in a face) is linearly interpolated from vertex normals. For HR normal maps, per-pixel normals are obtained by projection of the high resolution observations (i.e., 4D scan) onto triangles of the corresponding low resolution reconstruction, and then the normal information is transferred using the UV map of. Note that normal maps cannot be directly calculated from scans because neither is the exact area of the garment defined, nor are they equipped with UV map. Also, our normals are represented in global coordinates, as opposed to tangent space coordinates as is standard for normal maps. The reason is that LR normal maps contain no additional information to the geometry and are therefore constant in tangent space. This makes them unfitting for conditioning our adversarial neural network.
4.2 Network architecture
Due to the nature of our problem it is natural to explore network architectures designed to enhance images (i.e., super-resolution applications). From our experiments, we observed that models trained on natural images, including those containing a perceptual loss term fail (e.g., SR-GAN ). On the other hand, cloth deformations exhibit smooth patterns (wrinkles, creases, folds) that deform continuously in time. In addition, at a finer level, materials and fabric texture also contain high frequency details.
Our proposed network is based on a conditional Generative Adversarial Network (cGAN) inspired from image transfer 
. We also use a convolution-batchnorm-ReLu structure and a U-Net in the generative network since we want latent information to be transfered across the network layers and the overall structure of the image to be preserved. This happens thanks to the skip connections. The discriminator only penalizes structure at the scale of patches, and works as a texture loss. Our network is conditioned by low-resolution normal map images (size: ) which will be enhanced with fine details learned from our real data normal maps. See Fig. 8 for the complete architecture.
Temporal consistency is achieved by extending the network loss term. For compelling animations, it is not only important that each frame looks realistic, but also no sudden jumps in the rendering should occur. To ensure smooth transition between consecutively generated images across time, we introduce an additional loss to the GAN objective that penalizes discrepancies between generated images at and expected images (from training dataset) at :
where helps to generate images near to ground truth in an sense (for less blurring). The temporal consistency term is meant to capture global fold movements over the surface. If something appears somewhere, most of the time, it should have disappeared close-by and vice versa. Our term does not take spatial proximity into account though. We also tried temporal consistency based on the - and -norm, and report the results in Table 1. See Fig. 9 for a comparison of results with and without the temporal consistency term.
This section evaluates the results of our reconstruction. 4D scan sequences were captured using a temporal-3dMD system (4D) . Sequences are captured at 60fps. Each frame consists of a colored mesh with 200K vertices. Here, we show results on two different shirts (for female and male). We trained the cGAN network on a dataset of 9213 consecutive frames. The first 8000 images compose the training data set, the next 1000 images the test data set and the remaining 213 images the validation set. Test and validation sets contain poses and movements not seen in the training set. The U-Net auto-encoder is constructed with layers, and 64 filters in each of the first convolutional layers. The discriminator uses patches of size . weight is set to 100, weight is 50, while GAN weight is 1. The images have a resolution of , although our early experiments also showed promising results on .
5.1 Comparison of approaches
We compare our results to different approaches (see Fig. 10). A physics-based simulation done by a 3D artist using MarvelousDesigner  returns a mesh imitating similar material properties as our scan and with a comparable amount of folds but containing vertices (i.e., an order of magnitude more). A linear subspace reconstruction with coefficients derived from the registrations (Sect. 3) produces a mostly flat surface, while the registration itself shows smooth approximations of the major folds in the scan. Our method, DeepWrinkles, adds all high frequency details seen in the scan to the reconstructed surface. These three methods use a mesh with vertices. DeepWrinkles is shown with a normal map image.
|temp||temp||Eq. 4||no temp||Registr.||BS 500||BS 200||Regre.|
5.2 Importance of reconstruction details in input
Our initial experiments showed promising results reconstructing details from the original registration normal maps. To show the efficacy of the method it is not only necessary to reconstruct details from registration, but also from blend shapes, and after regression. We replaced the input images in the training set by normal maps constructed from the blend shapes with 500, 200 and 100 basis functions and one set from the regression reconstruction. The goal is to determine the amount of detail that is necessary in the input to obtain realistic detailed wrinkles. Table 1 shows the error rates in each experiment.
basis functions seem sufficient for a reasonable amount of detail in the result. Probably due to the fact that the reconstruction from regression is more noisy and bumpy, the neural network is not capable of reconstructing long defined folds and instead produces a lot of higher frequency wrinkles (see Fig.11). This is an indicator that the structures of the inputs are only redefined by the net and important folds have to be visible in the input.
The final goal is to be able to scan one piece of clothing in one or several sequences and then transferring it on new persons with new movements on the go.
We experimented with various combinations of control vectors , including pose, shape, joint root velocity and acceleration history. It turns out most formulations in the literature are difficult to train or unstable [2, 39, 31]. We restrict the joint parameters to those directly related to each piece of clothing to reduce the dimensionality. In the case of shirts, this leaves the parameters related to the upper body. In general, linear regression generalized best but smoothed out a lot of overall geometric details, even in the training set. We evaluated on 9213 frames for 500 and 1000 blend shapes: and .
On the other hand, we trained an encoder-decoder with LSTM units (4 layers with dimension 256), using inputs and outputs equally of length 3 (see Sect. 3.3). We obtained promising results: . Supplemental materials show visually good reconstructed sequences.
In 3.2 we represented clothing with folds as offsets of a mean shape. The same can be done with a human template for persons with different body shapes. Each person in normalized pose can be represented as an average template plus a vertex-wise offset . Given the fact that the clothing mean shape contains a subset of vertices of the human template, it can be adjusted to any deformation of the template by taking . restricts vertices of the human template to those used for clothing. Then the mean in the blend shape can simply be replaced by . Equation 3 becomes:
Replacing the mean shape affects surface normals. Hence, it is necessary to use normal maps in tangent space at rendering time. This makes them applicable to any body shape (see Fig. 12).
We present DeepWrinkles, a entirely data-driven framework to capture and reconstruct clothing in motion out from 4D scan sequences. Our evaluations show that high frequency details can be added to low resolution normal maps using a conditional adversarial neural network. We introduce an additional temporal loss to the GAN objective that preserves geometric consistency across time, and show qualitative and quantitative evaluations on different datasets. We also give details on how to create low resolution normal maps from registered data, as it turns out registration fidelity is crucial for the cGAN training. The two presented modules are complementary to achieve accurate and realistic rendering of global shape and details of clothing. To the best of our knowledge, our methods exceeds the level of detail of the current state of the art in both physics-based simulation and data-driven approaches by far. Additionally, the space requirement of a normal map is negligible in comparison to increasing the resolution of clothing mesh, which makes our pipeline suitable to standard 3D engines.
High resolution normal maps can have missing information in areas not seen by cameras, such as armpit areas. Hence, visually disruptive artifacts can occur although the clothing template can fix most of the issues (e.g., by doing a pass of smoothing). At the moment pose retargeting works best when new poses are similar to ones included in the training dataset. Although the neural network is able to generalize to some unseen poses, reconstructing the global shape from a new joint parameter sequence can be challenging. This should be fixed by scaling the dataset.
Scanning setup can be extended to reconstruct all body parts with sufficient details without occlusions, and apply our method to more diverse types of clothing and accessories like coats, scarfs. Normal maps could also be used to add fine details like buttons which are hard to capture in 3D.
Acknowledgements. We would like to thank the FRL teams for their support, and Vignesh Ganapathi-Subramanian for preliminary work on the subspace model.
-  www.3dmd.com, temporal-3dmd systems (4d) (2018)
-  de Aguiar, E., Sigal, L., Treuille, A., Hodgins, J.: Stable spaces for real-time clothing. In: Hart, J.C. (ed.) ACM Transactions for Graphics. Association for Computing Machinery (ACM) (2010)
-  de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H., Thrun, S.: Performance capture from sparse multi-view video. In: Hart, J.C. (ed.) ACM Transactions for Graphics. vol. 27, pp. 98:1–98:10. Association for Computing Machinery (ACM) (2008)
-  Baraff, D., Witkin, A., Kass, M.: Untangling cloth. In: Hart, J.C. (ed.) ACM Transactions on Graphics. vol. 22, pp. 862–870. Association for Computing Machinery (ACM), New York, NY, USA (Jul 2003)
-  Beeler, T., Hahn, F., Bradley, D., Bickel, B., Beardsley, P., Gotsman, C., Sumner, R.W., Gross, M.: High-quality passive facial performance capture using anchor frames. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 30, pp. 75:1–75:10. Association for Computing Machinery (ACM), New York, NY, USA (August 2011)
Bogo, F., Romero, J., Pons-Moll, G., Black, M.J.: Dynamic FAUST: Registering human bodies in motion. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (Jul 2017)
-  Bradley, D., Popa, T., Sheffer, A., Heidrich, W., Boubekeur, T.: Markerless garment capture. In: Hart, J.C. (ed.) ACM Transactions for Graphics. vol. 27, pp. 99:1–99:9. Association for Computing Machinery (ACM), New York, NY, USA (Aug 2008)
-  Bridson, R., Marino, S., Fedkiw, R.: Simulation of clothing with folds and wrinkles. In: Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation. pp. 28–36. Eurographics Association, San Diego, CA, USA (Jul 2003)
-  Choi, K.J., Ko, H.S.: Stable but responsive cloth. In: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques. pp. 604–611. SIGGRAPH, ACM, New York, NY, USA (2002)
-  Chu, M., Thuerey, N.: Data-driven synthesis of smoke flows with cnn-based feature descriptors. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 36, pp. 69:1–69:14. Association for Computing Machinery (ACM), New York, NY, USA (Jul 2017)
-  Cignoni, P., Montani, C., Scopigno, R., Rocchini, C.: A general method for preserving attribute values on simplified meshes. In: Proceedings of IEEE Conference on Visualization. pp. 59–66. IEEE (1998)
-  Cohen, J.D., Olano, M., Manocha, D.: Appearance-perserving simplification. In: Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH. pp. 115–122. ACM, Orlando, FL, USA (Jul 1998)
-  Danerek, R., Dibra, E., Öztireli, C., Ziegler, R., Gross, M.: Deepgarment : 3d garment shape estimation from a single image. In: Chen, M., Zhang, R. (eds.) Computer Graphics Forum. vol. 36, pp. 269–280. Eurographics Association (2017)
-  Dou, M., Khamis, S., Degtyarev, Y., Davidson, P., Fanello, S.R., Kowdle, A., Escolano, S.O., Rhemann, C., Kim, D., Taylor, J., Kohli, P., Tankovich, V., Izadi, S.: Fusion4d: Real-time performance capture of challenging scenes. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 35, pp. 114:1–114:13. Association for Computing Machinery (ACM), New York, NY, USA (Jul 2016)
-  Fyffe, G., Nagano, K., Huynh, L., Saito, S., Busch, J., Jones, A., Li, H., Debevec, P.: Multi-view stereo on consistent face topology. Computer Graphics Forum 36(2), 295–309 (May 2017)
-  Goldenthal, R., Harmon, D., Fattal, R., Bercovier, M., Grinspun, E.: Efficient simulation of inextensible cloth. In: Hart, J.C. (ed.) ACM Transactions for Graphics. vol. 26, p. 49. Association for Computing Machinery (ACM) (2007)
-  Goldlücke, B., Cremers, D.: Superresolution texture maps for multiview reconstruction. In: IEEE 12th International Conference on Computer Vision, ICCV. pp. 1677–1684. IEEE Computer Society (Sep 2009)
-  Goodfellow, I.J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A.C., Bengio, Y.: Generative adversarial nets. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger., K. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems (NIPS). pp. 2672–2680. Curran Associates, Inc., Montreal, Quebec, Canada (Dec 2014)
-  Guan, P., Reiss, L., Hirshberg, D.A., Weiss, A., Black, M.J.: DRAPE: dressing any person. ACM Transactions for Graphics 31(4), 35:1–35:10 (2012)
-  Guo, K., Xu, F., Wang, Y., Liu, Y., Dai, Q.: Robust non-rigid motion tracking and surface reconstruction using l0 regularization. pp. 3083–3091. IEEE Computer Society (2015)
-  Hahn, F., Thomaszewski, B., Coros, S., Sumner, R.W., Cole, F., Meyer, M., DeRose, T., Gross, M.H.: Subspace clothing simulation using adaptive bases. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 33, pp. 105:1–105:9. Association for Computing Machinery (ACM) (2014)
-  Hernández, C., Vogiatzis, G., Brostow, G.J., Stenger, B., Cipolla, R.: Non-rigid photometric stereo with colored lights. In: IEEE 11th International Conference on Computer Vision, ICCV. pp. 1–8. IEEE Computer Society, Rio de Janeiro, Brazil (Oct 2007)
Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning - Volume 37. pp. 448–456. ICML, ACM (2015)
-  Isola, P., Zhu, J., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 5967–5976. IEEE Computer Society, Honolulu, HI, USA (Jul 2017)
-  Kavan, L., Gerszewski, D., Bargteil, A.W., Sloan, P.P.: Physics-inspired upsampling for cloth simulation in games. In: Hart, J.C. (ed.) ACM Transactions for Graphics. vol. 30, pp. 93:1–93:10. Association for Computing Machinery (ACM), New York, NY, USA (Jul 2011)
-  Krishnamurthy, V., Levoy, M.: Fitting smooth surfaces to dense polygon meshes. In: Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH. New Orleans, LA, USA (Aug.)
-  Lassner, C., Pons-Moll, G., Gehler, P.V.: A generative model of people in clothing. In: IEEE International Conference on Computer Vision, ICCV. pp. 853–862. IEEE Computer Society, Venice, Italy (Oct 2017)
-  Ledig, C., Theis, L., Huszar, F., Caballero, J., Cunningham, A., Acosta, A., Aitken, A.P., Tejani, A., Totz, J., Wang, Z., Shi, W.: Photo-realistic single image super-resolution using a generative adversarial network. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 105–114. IEEE Computer Society, Honolulu, HI, USA (Jul 2017)
-  Leroy, V., Franco, J., Boyer, E.: Multi-view dynamic shape refinement using local temporal integration. In: IEEE International Conference on Computer Vision, ICCV. pp. 3113–3122. Venice, Italy (2017)
-  Li, H., Adams, B., Guibas, L.J., Pauly, M.: Robust single-view geometry and motion reconstruction. In: Hart, J.C. (ed.) ACM Transactions for Graphics. vol. 28, pp. 175:1–175:10. Association for Computing Machinery (ACM), New York, NY, USA (Dec 2009)
-  Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 34, pp. 248:1–248:16. Association for Computing Machinery (ACM) (2015)
-  MarvelousDesigner: www.marvelousdesigner.com (2018)
-  Matsuyama, T., Nobuhara, S., Takai, T., Tung, T.: 3D Video and Its Applications. Springer (2012)
-  Miguel, E., Bradley, D., Thomaszewski, B., Bickel, B., Matusik, W., Otaduy, M.A., Marschner, S.: Data-driven estimation of cloth simulation models. Computer Graphics Forum 31(2), 519–528 (2012)
-  Müller, M., Chentanez, N.: Wrinkle meshes. In: Proceedings of the 2010 ACM SIGGRAPH/Eurographics Symposium on Computer Animation. pp. 85–92. SCA ’10, Eurographics Association, Aire-la-Ville, Switzerland, Switzerland (2010)
-  Newcombe, R.A., Fox, D., Seitz, S.M.: Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 343–352. IEEE Computer Society, Boston, MA, USA (Jun 2015)
-  Olszewski, K., Li, Z., Yang, C., Zhou, Y., Yu, R., Huang, Z., Xiang, S., Saito, S., Kohli, P., Li, H.: Realistic dynamic facial textures from a single image using gans. In: IEEE International Conference on Computer Vision, ICCV. pp. 5439–5448. IEEE Computer Society, Venice, Italy (Oct 2017)
-  Pons-Moll, G., Pujades, S., Hu, S., Black, M.: Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (Proc. SIGGRAPH) 36(4) (2017)
-  Pons-Moll, G., Romero, J., Mahmood, N., Black, M.J.: Dyna: A model of dynamic human shape in motion. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 34, pp. 120:1–120:14. Association for Computing Machinery (ACM) (Aug 2015)
-  Saito, S., Wei, L., Hu, L., Nagano, K., Li, H.: Photorealistic facial texture inference using deep neural networks. IEEE Conference on Computer Vision and Pattern Recognition, CVPR pp. 2326–2335 (Jul 2017)
-  Starck, J., Hilton, A.: Surface capture for performance-based animation. IEEE Computer Graphics and Applications 27(3), 21–31 (2007)
-  Sumner, R.W., Popović, J.: Deformation transfer for triangle meshes. In: Hart, J.C. (ed.) ACM Transactions for Graphics. vol. 23, pp. 399–405. Association for Computing Machinery (ACM), New York, NY, USA (Aug 2004)
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems (NIPS). pp. 3104–3112. Curran Associates, Inc., Montreal, Quebec, Canada (Dec 2014)
-  Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.: The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In: Computer Vision and Pattern Recognition (CVPR). pp. 103–110. IEEE Computer Society (Jul 2012)
-  Tsiminaki, V., Franco, J., Boyer, E.: High resolution 3d shape texture from multiple videos. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR. pp. 1502–1509. IEEE Computer Society, Columbus, OH, USA (Jun 2014)
-  Tung, T., Matsuyama, T.: Intrinsic characterization of dynamic surfaces. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 233–240. IEEE Computer Society, Portland, OR, USA (Jun 2013)
-  Tung, T., Nobuhara, S., Matsuyama, T.: Simultaneous super-resolution and 3d video using graph-cuts. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR). IEEE Computer Society, Anchorage, Alaska, USA (Jun 2008)
-  Tung, T., Nobuhara, S., Matsuyama, T.: Complete multi-view reconstruction of dynamic scenes from probabilistic fusion of narrow and wide baseline stereo. In: IEEE 12th International Conference on Computer Vision, ICCV. pp. 1709–1716. IEEE Computer Society, Kyoto, Japan (Sep 2009)
-  Vlasic, D., Baran, I., Matusik, W., Popović, J.: Articulated mesh animation from multi-view silhouettes. In: Hart, J.C. (ed.) ACM Transactions for Graphics. pp. 97:1–97:9. Association for Computing Machinery (ACM), New York, NY, USA (2008)
-  Vlasic, D., Peers, P., Baran, I., Debevec, P., Popović, J., Rusinkiewicz, S., Matusik, W.: Dynamic shape capture using multi-view photometric stereo. In: Proceedings of ACM SIGGRAPH Asia. pp. 174:1–174:11. ACM, New York, NY, USA (2009)
-  Volino, P., Magnenat-Thalmann, N.: Virtual clothing - theory and practice. Springer (2000)
-  Wang, H., Hecht, F., Ramamoorthi, R., O’Brien, J.F.: Example-based wrinkle synthesis for clothing animation. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 29, pp. 107:1–8. Association for Computing Machinery (ACM), Los Angles, CA (Jul 2010)
-  Wang, H., Ramamoorthi, R., O’Brien, J.F.: Data-driven elastic models for cloth: Modeling and measurement. In: Alexa, M. (ed.) ACM Transactions for Graphics. vol. 30, pp. 71:1–11. Association for Computing Machinery (ACM), Vancouver, BC Canada (Jul 2011)
-  White, R., Crane, K., Forsyth, D.A.: Capturing and animating occluded cloth. In: Hart, J.C. (ed.) ACM Transactions for Graphics. vol. 26, p. 34. Association for Computing Machinery (ACM) (2007)
-  Zhang, C., Pujades, S., Black, M.J., Pons-Moll, G.: Detailed, accurate, human shape estimation from clothed 3d scan sequences. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5484–5493. IEEE Computer Society, Honolulu, HI, USA (Jul 2017)