Recovering Geometric Information with Learned Texture Perturbations

01/20/2020 ∙ by Jane Wu, et al. ∙, Inc. Stanford University 10

Regularization is used to avoid overfitting when training a neural network; unfortunately, this reduces the attainable level of detail hindering the ability to capture high-frequency information present in the training data. Even though various approaches may be used to re-introduce high-frequency detail, it typically does not match the training data and is often not time coherent. In the case of network inferred cloth, these sentiments manifest themselves via either a lack of detailed wrinkles or unnaturally appearing and/or time incoherent surrogate wrinkles. Thus, we propose a general strategy whereby high-frequency information is procedurally embedded into low-frequency data so that when the latter is smeared out by the network the former still retains its high-frequency detail. We illustrate this approach by learning texture coordinates which when smeared do not in turn smear out the high-frequency detail in the texture itself but merely smoothly distort it. Notably, we prescribe perturbed texture coordinates that are subsequently used to correct the over-smoothed appearance of inferred cloth, and correcting the appearance from multiple camera views naturally recovers lost geometric information.



There are no comments yet.


page 3

page 5

page 6

page 7

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Since neural networks are trained to generalize to unseen data, regularization is important for reducing overfitting, see [26, 61]

. However, regularization also removes some of the high variance characteristic of much of the physical world. Even though high-quality ground truth data can be collected or generated to reflect the desired complexity of the outputs, regularization will inevitably smooth network predictions. Rather than attempting to directly infer high-frequency features, we alternatively propose to learn a low-frequency space in which such features can be embedded.

(a) inferred cloth
(b) texture sliding
Figure 1: Texture coordinate perturbations (texture sliding) reduce shape inference errors: ground truth (blue), prediction (orange).

We focus on the specific task of adding high-frequency wrinkles to virtual clothing, noting that the idea of learning a low-frequency embedding may be generalized to other tasks. Because cloth wrinkles/folds are high-frequency features, existing deep neural networks (DNNs) trained to infer cloth shape tend to predict overly smooth meshes [1, 16, 27, 28, 35, 40, 48, 59, 68]. Rather than attempting to amend such errors directly, we perturb texture so that the rendered cloth mesh appears to more closely match the ground truth. See Figure 1. Then given texture perturbations from at least two unique camera views, 3D geometry can be accurately reconstructed [32] to recover high-frequency wrinkles. Similarly, for AR/VR applications, correcting visual appearance from two views (one for each eye) is enough to allow the viewer to accurately discern 3D geometry.

Our proposed texture coordinate perturbations are highly dependent on the camera view. Thus, we demonstrate that one can train a separate texture sliding neural network (TSNN) for each of a finite number of cameras laid out into an array and use nearby networks to interpolate results valid for any view enveloped by the array. Although an approach similar in spirit might be pursued for various lighting conditions, this limitation is left as future work since there are a great deal of applications where the light is ambient/diffuse/non-directional/etc. In such situations, this further complication may be ignored without significant repercussion.

2 Related Work

Cloth: While physically-based cloth simulation has matured as a field over the last few decades [7, 8, 12, 13, 63], data-driven methods are attractive for many applications. There is a rich body of work in reconstructing cloth from multiple views or 3D scans, see [11, 22, 66]. More recently, optimization-based methods have been used to generate higher resolution reconstructions [33, 53, 69, 71]. Some of the most interesting work focuses on reconstructing the body and cloth separately [6, 49, 72, 74].

Cloth and Learning:

With advances in deep learning, one can aim to reconstruct 3D cloth meshes from single views. A number of approaches reconstruct a joint cloth/body mesh from a single RGB image

[1, 4, 48, 57], RGB-D image [73], or video [2, 3, 30, 70]. To reduce the dimensionality of the output space, DNNs are often trained to predict the pose/shape parameters of human body models such as SCAPE [5] or SMPL [41] (see also [52]). [1, 2, 3] predict SMPL model parameters along with per-vertex offsets to add details, and [4] refines the mesh using the network proposed in [34]. [30, 48, 65]

leverage predicted pose information to infer shape. Estimating shape from silhouettes given an RGB image has also been explored

[18, 19, 48]. When only the garment shape is predicted, a number of recent works output predictions in UV space to represent geometric information as pixels [16, 35, 40], although others [28, 59]

define loss functions directly in terms of the 3D cloth vertices.

Wrinkles and Folds: Cloth realism can be improved by introducing wrinkles and folds. In the graphics community, researchers have explored both procedural and data-driven methods for generating wrinkles [17, 27, 31, 47, 56, 67]. Other works add real-world wrinkles as a postprocessing step to improve smooth captured cloth: [54] extracts the edges of cloth folds and then applies space-time deformations, [55] solves for shape deformations directly by optimizing over all frames of a video sequence. Recently, [40] used a conditional Generative Adversarial Network [45] to generate normal maps as proxies for wrinkles on captured cloth.

Geometry: More broadly, deep learning on 3D meshes falls under the umbrella of geometric deep learning, which was coined by [14] to characterize learning in non-Euclidean domains. [60] was one of the earliest works in this area and introduced the notion of a Graph Neural Network (GNN) in relation to CNNs. Subsequent works similarly extend the CNN architecture to graphs and manifolds [9, 42, 44, 46]. [39] introduces a latent representation that explicitly incorporates the Dirac operator to detect principal curvature directions. [64] trains a mesh generative model to generate novel meshes outside an original dataset. Returning to the specific application of virtual cloth, [35] embeds a non-Euclidean cloth mesh into a Euclidean pixel space, making it possible to directly use CNNs to make non-Euclidean predictions.

Texture: In the computer graphics community, textures have historically been used to capture both geometric and material details lost by using simplified models [21, 43], which is similar in spirit to our approach. Though, to the best of our knowledge, we are the first to propose learning texture coordinate perturbations to facilitate the accurate reconstruction of lost geometric details. For completeness, we briefly note a few works that use learning for texture synthesis and/or style transfer [20, 23, 24, 29, 36, 58].

3 Methods

We define texture sliding as the changing of texture coordinates on a per-camera basis such that any point which is visible from some stereo pair of cameras can be triangulated back to its ground truth position. Other stereo reconstruction techniques can also be used in place of triangulation because the images we generate are consistent with the ground truth geometry. See [10, 32, 62].

3.1 Per-Vertex Discretization

Since the cloth mesh is discretized into vertices and triangles, we take a per-vertex, not a per-point, approach to texture sliding. Our proposed method (see Section 4.1) computes per-vertex texture coordinates on the inferred cloth that match those of the ground truth as seen by the camera under consideration. Then during 3D reconstruction, barycentric interpolation is used to find the subtriangle locations of the texture coordinates corresponding to ground truth cloth vertices. This assumes linearity, which is only valid when the triangles are small enough to capture the inherent nonlinearities in a piecewise linear sense; moreover, folds and wrinkles can create significant nonlinearity. See Figure 2.

Figure 2: Consider an extreme case, where the inferred cloth has a quite large triangle (shown in red). That triangle should encompass the nonlinear texture region outlined in yellow (shown in pattern space). Note: the yellow curve was generated by sampling the ground truth cloth’s texture coordinates along the projected edges of the red triangle. The linearity assumption implied by barycentric interpolation instead uses the region outlined in green.

3.2 Occlusion Boundaries

Accurate 3D reconstruction requires that a vertex of the ground truth mesh be visible from at least two cameras and that camera projections of the vertex to the inferred cloth exist and are valid. However, occlusions can derail these assumptions.

First, consider things from the standpoint of the inferred cloth. For a given camera view, some inferred cloth triangles will not contain any visible pixels, and we denote a vertex as occluded when none of its incident triangles contain any visible pixels. Although we do not assign perturbed texture coordinates to occluded vertices (they keep their original texture coordinates, or a perturbation of zero), we do aim to keep the texture coordinate perturbation function smooth (see Section 4.2). In addition, there will be so called non-occluded vertices in the inferred cloth that do not project to visible pixels of the ground truth cloth. This often occurs near silhouette boundaries where the inferred cloth silhouette is sometimes wider than the ground truth cloth silhouette. These vertices are also treated as occluded, similar to those around the back side of the cloth behind the silhouette, essentially treating some extra vertices near occlusion boundaries as also being occluded. See Figure 2(a).

Next, consider things from the standpoint of the ground truth cloth. For example, consider the case where all the cameras are in the front, and vertices on the back side of the ground truth cloth are not visible from any camera. The best one can do in reconstructing these occluded vertices is to use the inferred cloth vertex positions; however, care should be taken near occlusion boundaries to smoothly taper between our texture sliding 3D reconstruction and the inferred cloth prediction. A simple approach is to extrapolate/smooth the geometric difference between our texture sliding 3D reconstruction and the inferred cloth prediction to occluded regions of the mesh. Once again, the definition of occluded vertices needs to be broadened for silhouette consideration. Not only will vertices not visible from at least two cameras have to be considered occluded, but vertices that don’t project to the interior of an inferred cloth triangle with valid texture coordinate perturbations will also have to be considered occluded. See Figure 2(b).

Figure 3: The method discussed in Section 4.1 can fail near silhouettes of the inferred and ground truth cloth meshes, in which case smoothness assumptions are used (see Section 4.2). In (a), inferred triangles with at least one vertex falling outside the silhouette of the ground truth mesh are colored red. In (b), ground truth triangles with at least one vertex falling outside the silhouette of the inferred mesh are colored blue.

4 Dataset Generation

Let be a cloth triangulated surface with vertices and texture coordinates . We assume that mesh connectivity remains fixed throughout. The ground truth cloth mesh depends on the pose . Given a pre-trained DNN (we use the network from [35]), the inferred cloth is also a function of the pose . Our objective is to replace the ground truth texture coordinates with perturbed texture coordinates , to compute where depends on both the pose and the view . Even though is in principle valid for all using interpolation (see Section 6.3), training data is only required for a finite number of camera views . For each camera , we also only require training data for finite number of poses , we require , which is computed from using , , and .

4.1 Texture Coordinate Projection

We project texture coordinates to the inferred cloth vertices from the ground truth cloth mesh using ray intersection. For each inferred cloth vertex in , we cast a ray from camera ’s aperture through the vertex and find the first intersection with the ground truth mesh ; subsequently, is barycentrically interpolated to the point of intersection and assigned to the inferred cloth vertex as its value. See Figure 4. Rays are only cast for inferred cloth vertices that have at least one incident triangle with a nonzero area subregion visible to camera . Also, a ground truth texture coordinate value is only assigned to an inferred cloth vertex when the point of intersection with the ground truth mesh is visible to camera . We store and learn texture coordinate displacements . After this procedure, any remaining vertices of the inferred cloth that have not been assigned values are treated as occluded and handled via smoothness considerations as discussed in Section 4.2.

Figure 4: Illustration of the ray intersection method for transferring texture coordinates to the inferred cloth from the ground truth cloth. Texture coordinates for the inferred cloth vertex (red cross) are interpolated from the ground truth mesh to the point of ray intersection (red circle).

4.2 Occlusion Handling

Some vertices of the inferred cloth mesh remain unassigned with after executing the algorithm outlined in Section 4.1. This creates a discontinuity in which excites high frequencies that require a more complex network architecture to capture. In order to alleviate demands on the network, we smooth as follows. First, we use the Fast Marching Method on triangulated surfaces [37] to generate a signed distance field. Then, we extrapolate normal to the distance field into the unassigned region, see [50]. Finally, a bit of averaging is used to provide smoothness, while keeping the assigned values of unchanged. Alternatively, one could solve a Poisson equation as in [15] while using the assigned as Dirichlet boundary conditions.

5 Network Architecture

A separate texture sliding neural network (TSNN) is trained for each camera ; thus, we drop the notation in this section. The loss is defined over all poses in the training set


to minimize the difference between the desired displacements and predicted displacements . The inferred cloth data we chose to correct are predictions of the T-shirt meshes from [35], each of which contains about 3,000 vertices. The dataset spans about 10,000 different poses generated from a scanned garment using physically-based simulation. To improve the resolution, we up-sampled each cloth mesh by subdividing each triangle into four subtriangles. Notably, our texture sliding approach can be used to augment the results of any dataset for which ground truth and inferred training examples are available. Moreover, it is trivial to increase the resolution of any such dataset simply by subdividing triangles. Note that perturbations of the subdivided geometry are unnecessary, as we merely desire more sample points (to address Figure 2). Finally, we applied an 80-10-10 training-validation-test set split.

Similar to [35], the displacements are stored as pixel-based cloth images for the front and back sides of the T-shirt, though we still output per-vertex texture coordinate displacements in UV space. See Figure 5 for an overview of the network architecture. Given input joint transformation matrices of shape

, TSNN applies a series of transpose convolution, batch normalization, and ReLU activation layers to upsample the input to

. The first two dimensions of the output tensor represent the predicted displacements for the front side of the T-shirt, and the remaining two dimensions represent those for the back side.

Figure 5: Texture sliding neural network (TSNN) architecture.

6 Experiments

In Section 6.1, we quantify the data generation approach of Section 4 and highlight the advantages of mesh subdivision for up-sampling. In Section 6.2, we evaluate the predictions made by our trained texture sliding neural network (TSNN). In Section 6.3, we demonstrate the interpolation of texture sliding results to novel views between a finite number of cameras. Finally, in Section 6.4, we use multi-view texture sliding to reconstruct 3D geometry.

6.1 Dataset Generation and Evaluation

We aim to have the material coordinates of the cloth be in the correct locations as viewed by multiple cameras, so that the material can be accurately 3D reconstructed with point-wise accuracy. As such, our error metric is a bit more stringent than that commonly used because our aim is to reproduce the actual material behavior, not merely to mimic its look (, by perturbing normal vectors to create shading consistent with wrinkles in spite of the cloth being smooth, as in

[40]). In order to elucidate this, consider a two-step approach where one first approximates a smooth cloth mesh and then perturbs that mesh to add wrinkles (similar to [59]). In order to preserve area and achieve the correct material behavior, material in the vicinity of a newly forming wrinkle should slide laterally towards that wrinkle as it is formed. Merely non-physically stretching the material in order to create a wrinkle may look plausible, but does not admit the correct material behavior. In fact, the texture would be unrealistically stretched as well, although this is less apparent visually when using simple textures.

Since texture coordinates provide a proxy surface parameterization for material coordinates, we measure texture coordinate errors in a per-pixel fashion comparing between the ground truth and inferred cloth at the center of each pixel. Figure 5(a) shows results typical for cloth inferred using the network from [35], and Figure 5(b) shows the highly improved results obtained on the same inferred geometry using our texture sliding approach (with 1 level of subdivision). Note that the vast majority of the errors in Figure 5(b) occur near the wrinkles where the nonlinearities illustrated in Figure 2 are most prevalent. In Figure 5(c), we deform the vertices of the inferred cloth mesh so that they lie exactly on the ground truth mesh in order to mimic a two-step approach (as discussed above). Note how our error metric captures the still rather large errors in the material coordinates (and thus cloth vertex positions) in spite of the mesh in Figure 5(c) appearing to have the same wrinkles and folds as the ground truth mesh. Figure 7 compares the local compression and extension energies of the ground truth mesh (Figure 6(a)), the inferred cloth mesh (Figure 6(b)), and the result of this two-step process (Figure 6(c)). In spite of the untextured mesh in Figure 6(c)

bearing visual similarity to the ground truth in Figure

6(a), it still has rather large errors in deformation energy.

Figure 6: Per-pixel texture coordinate errors before (a) and after (b) applying texture sliding to the inferred cloth output by the network of [35]. The result of a two-step process (c) may well match the ground truth in a visual sense, whilst still having quite large errors in material coordinates. Blue , red .
Figure 7: Local compression (blue) and extension (red) energies for a sample pose, comparing the ground truth cloth (a), the inferred cloth (b), and the result of a two-step process (c). In spite of the cloth mesh in (c) bearing visual resemblance to the ground truth in (a), it still has quite erroneous deformation energies.

Figure 8 illustrates the efficacy of subdividing the cloth mesh to get more samples for texture sliding. The particular ground truth cloth wrinkle shown in Figure 7(e) is not captured by the inferred cloth geometry shown in Figure 7(a). The texture sliding result shown in Figure 7(b) better represents the ground truth cloth. Figures 7(c) and 7(d) show how subdividing the inferred cloth mesh one and two times (respectively) progressively alleviates errors emanating from the linearity assumption illustrated in Figure 2. Table 1 shows quantitative results comparing the inferred cloth to texture sliding with and without subdivision.

Method SqrtMSE ()
Jin et al. [35] 24.496 6.9536
TS 5.2662 2.2320
TS subdivision 3.5645 1.6617
Table 1: Per-pixel square root of mean squared error (SqrtMSE) for the entire dataset.
Figure 8: As the inferred cloth mesh (a) is subdivided, texture sliding (b-d) moves the appearance of the inferred mesh closer to the ground truth (e).

6.2 Network Training and Inference

The network was trained using the Adam optimizer [38] with a

learning rate in PyTorch

[51]. As mentioned earlier, we subdivided the mesh triangles once. Figure 9 shows a typical prediction on a test set example, including the per-pixel errors in predicted texture coordinates. While the TSNN is able to recover the majority of the shirt, it struggles near wrinkles. Figure 10 highlights a particular wrinkle comparing the inferred cloth (Figure 9(a)) and the results of the TSNN before (Figure 9(b)) and after (Figure 9(c)) subdivision to the ground truth (Figure 9(d)). Table 2 shows quantitative results comparing the inferred cloth to TSNN results with and without subdivision.

Figure 9: A typical test set example prediction. The per-pixel errors are shown in (c) (blue , red ).
Network SqrtMSE ()
Jin et al. [35] 24.871 7.0613
TSNN 13.335 4.2924
TSNN + subdivision 13.591 4.5194
Table 2: Per-pixel SqrtMSE for the test set. Inspite of Table 1 demonstrating that subdivision improves the ground truth TS data, the improvements are not uniformly realized by the TSNN (which we discuss in the appendix).
Figure 10: The results of the TSNN before (b) and after (c) subdivision, as compared to the ground truth (d). In spite of Table 2, some wrinkles are better resolved by the TSNN after subdivision. The inferred mesh with ground truth texture coordinates is shown in (a).
Figure 11: Given two camera views (far left and far right images), texture sliding can be linearly interpolated to novel views between them. The top row shows per-pixel errors (blue , red ), and the bottom row shows the cloth from a fixed front-facing view to illustrate how the interpolated texture changes as a function of the chosen novel view.

6.3 Interpolating to Novel Views

Given a finite number of camera views , one can specify a new view enveloped by the array using a variety of interpolation methods. For the sake of demonstration, we take a simple approach assuming that one can interpolate via , and then use these same weights to compute


This same equation is also used for . Figure 11 shows the results obtained by linearly interpolating between two camera views. Note how the largest errors appear near areas occluded by wrinkles, where one (or both) of the cameras has no valid texture sliding results and instead uses the inferred cloth textures. This can be alleviated by using more cameras placed closer together. Figure 12 quantifies these results for the inferred cloth , texture sliding , and the results of the TSNN . In Figure 13, we repeat these comparisons, except using bilinear interpolation between four camera views.

Figure 12: Per-pixel SqrtMSE for interpolating between two cameras (using a test set example). Note that the inferred cloth does not use any view based information, but that our error metric does depend on the view.
Figure 13: Per-pixel SqrtMSE for interpolating between four cameras (one at each corner of the square). The pose is the same as in Figure 12, which plots the values along the bottom edge of the square.

6.4 3D Reconstruction

In order to reconstruct the 3D position of a vertex of the ground truth mesh, we take the usual approach of finding rays that pass through that vertex and the camera aperture for a number of cameras. Then given at least two rays, one can triangulate a 3D point that is minimal distance from all the rays. We can do this without solving the typical image to image correspondence problem because we know the ground truth texture coordinates for any given vertex. Thus, we merely have to find the ray that passes through the camera aperture and the ground truth texture coordinate for the vertex under consideration.

To find a ground truth texture coordinate on a texture corrected inferred cloth mesh , or , we first find the triangle containing that texture coordinate. This can be done quickly by using a hierarchical bounding box structure where the base level boxes around each triangle are defined using the min/max texture coordinates at the three vertices. Then one can write the barycentric interpolation formula that interpolates the triangle vertex texture coordinates to obtain the given ground truth texture coordinate, and subsequently invert the matrix to solve for the weights. These weights determine the sub-triangle position of the vertex under consideration (taking care to note that different answers are obtained in 3D space versus screen space, since the camera projection is nonlinear). Figure 14 shows the 3D reconstruction of a test set example using texture sliding (Figure 13(c)) and the TSNN (Figure 13(d)). Figure 15 compares the per-pixel errors and local compression/extension energies of Figures 13(c) and 13(d).

Figure 14: Comparison of the ground truth cloth (a) and inferred cloth (b) to the 3D reconstructions obtained using texture sliding (c) and the TSNN (d). To remove reconstruction noise generated by network inference errors in (d), we used the postprocess from [25]; although, there are many other smoothing options in the literature that one might also consider.
Figure 15: Per-pixel errors (top) and local compression/extension energies (bottom) for Figure 13(c) (a) and Figure 13(d) (b).

7 Discussion and Future Work

There are many disparate applications for clothing including for example video games, AR/VR, Hollywood special effects, virtual try-on and shopping, scene acquisition and understanding, and even bullet proof vests and soft armor. Various scenarios define accuracy or fidelity in vastly different ways. So while it is typical to state that one cares about more than just the visual appearance (or “graphics”), often those aiming for predictive capability still make concessions. For example, wherein [59] proposes a network that well predicts wrinkles mapped to new body types, the discussion in [40] implies that the horizontal wrinkles predicted by [59] are more characteristic of inaccurate physical simulation than real-world behavior. Instead, [40] strives for more vertical wrinkles to better match their data, but they accomplish this by predicting lighting to match an image while accepting overly smooth geometry. And as we have shown in Figure 6(c), predicting the correct geometry still allows for rather large errors in the deformation (see [25]).

In light of this, we state the problem of most interest to us: Our aim is to study the efficacy of using deep neural networks to aid in the modeling of material behavior, especially for those materials for which predictive methods do not currently exist because of various unknowns including friction, material parameters (for cloth and body), etc. Given this goal, we focus on the accurate prediction of material coordinates, which are a super set of deformation, geometry, lighting, visual plausibility, etc.

As demonstrated by the remarkably accurate 3D reconstruction in Figure 13(c) (see 14(a)), our approach to encoding high frequency wrinkles into lower frequency texture coordinates (texture sliding) works quite well. It can be used as a post-process to any existing neural network to capture lost details (as long as ground truth and inferred training examples are available); moreover, we showed that trivial subdivision could be used to increase the sampling resolution to limit linearization artifacts. The main drawback of our approach is that it relies on triangulation or multi-view stereo in order to construct the final 3D geometry, although this step is not required for AR/VR applications. This means that one needs to take care when training the texture sliding neural network (TSNN) since inference errors can cause reconstruction noise. Thus, as future work, we plan on experimenting with the network architecture, the size of the image used in the CNN, the smoothing methods near occlusion boundaries, the amount of subdivision, etc. In addition, it would be interesting to consider more savvy multiview 3D reconstruction methods (particularly ones that employ DNNs; then, one might train the whole process end-to-end). Our current solution to addressing multiview reconstruction noise is to simply use the method from [25] as a postprocess to the triangulation of the TSNN results. As can be seen in Figure 13(d), this leads to a high quality reconstruction with many high frequency wrinkles faithful to the ground truth; however, an improved TSNN would lead to more accurate per-pixel texture coordinates than those in Figure 14(b) (top).


Research supported in part by ONR N00014-13-1-0346, ONR N00014-17-1-2174, and We would like to thank Reza and Behzad at ONR for supporting our efforts into machine learning, as well as Rev Lebaredian and Michael Kass at NVIDIA for graciously loaning us a GeForce RTX 2080Ti to use for running experiments. We would also like to thank Matthew Cong and Yilin Zhu for their insightful discussions, and Congyue Deng for generating realistic cloth textures.


Appendix A Dataset Generation

a.1 Topological Considerations

There are some edge cases that require additional topological consideration. In particular, the collar, sleeves, and waist are areas where a ray cast to an inferred cloth vertex can intersect with a back-facing triangle on the inside of the ground truth shirt. We aim to define texture coordinates on inferred cloth vertices so that barycentric interpolation can be used to find the texture coordinates of a ground truth vertex for 3D reconstruction. However, mixing texture coordinates from the inside and outside of the shirt in a single triangle causes dramatic interpolation error. In fact, as shown in Figure 2, large errors may occur for any triangle that mixes texture coordinates from geodesically far-away regions. Thus, we omit such triangles from consideration by omitting a vertex from any edge that connects two geodesically far-away regions.

As a further improvement to our method, one can treat the inside and outside of the shirt as separate meshes, applying texture sliding twice and training two separate networks; moreover, one may take a patch-based approach, applying TS and training a TSNN for each (slightly overlapping) patch of the shirt.

a.2 Smoothness Considerations

When training a neural network, more predictable results are obtained when the inferred cloth vertex data is smoother. Thus, there exists tradeoffs between smoothness and accuracy when assigning texture coordinates. An edge that connects two geodesically far-away regions introduces a jump discontinuity in the texture coordinates leading to high frequencies in the ground truth data that place increased demands on the network. Although subdivision adds degrees of freedom along such edges to better sample the high frequency, it is often better to delete such edges entirely by removing one of the edge’s vertices. Recall that any vertex not assigned a ground truth texture coordinate is instead defined via smoothness considerations (see Section

4.2) reducing demands on the network.

(a) (b) (c) (d)

Figure 16: Comparisons of the ground truth cloth (a) and inferred cloth (b) to the 3D reconstructions obtained using texture sliding (c) and the TSNN (d) for three test set examples. Note that the postprocess in [25] is only applied to (d).

Appendix B 3D Reconstruction

There are a couple of issues with finding the texture coordinates of the ground truth vertices on an inferred cloth mesh whether it be TS or TSNN data. Firstly, there could be seams in the texture in which case smoothing would be needed near the seam as discussed above in order to avoid degrading the data. A patch-based approach can be used to alleviate any such seams. Secondly, seams, smoothing, and non-linearity along the lines of Figure 2 may all contribute to more than one inferred cloth triangle containing the texture coordinates of a ground truth vertex. This ambiguity can be treated similarly to how correspondence uncertainties are addressed in standard multi-view stereo algorithms. The straightforward approach is to consider each distinct possibility for each camera in all possible combinations and choose the set of rays that have the least disagreement for triangulation; furthermore, one may also consider the 3D reconstruction of neighboring vertices, material deformation, etc. Overall, reliance on multi-view stereo does require careful attention when utilizing our method. As such, we provide a few more examples of 3D reconstruction for examples from the test set in order to demonstrate the efficacy of our approach. See Figure 16.

Instead of applying a standard smoothing algorithm to the somewhat noisy results of the 3D reconstructions of the TSNN data, we used the postprocess from [25]. This choice was made because of our desire to use neural networks to branch the gap between physical simulations and real-world material behavior. In order to quantify the impact of the postprocess from [25] on the final results, Figure 17 shows the results obtained when applying the postprocess directly to the inferred cloth as compared to applying it to TS and TSNN data.

(a) Inferred Cloth
(b) TS
(c) TSNN
Figure 17: Comparison of the postprocess from [25] applied to (a) the inferred cloth, (b) the 3D reconstruction from TS data, and (c) the 3D reconstruction from TSNN data. Per-pixel errors (top) and local compression/extension energies (bottom) are shown.

Appendix C Novel View Interpolation

Interpolating between two cameras, each with TS or TSNN data, has the effect of following a straight-line path. However, by choosing the camera array and subsequent interpolation carefully one can interpolate along curved paths. For example in Figure 18, one can interpolate between the 12 cameras (represented by blue dots) in order to follow the curved camera path.

Figure 18: Let the red dot represent the center of the cloth mesh. One can interpolate between the 12 cameras (blue dots) on the trapezoid in order to follow the curved camera path.

Appendix D Error Analysis (for Patches)

In this section, we consider each step of the ray intersection algorithm, carefully illustrating the sources of error. This is done for a single patch consisting of the entire front half of the shirt in order to ensure continuous and unique texture coordinates. Additionally, this section highlights our patch-based approach, noting that we would utilize this approach on a number of overlapping patches and blend the final results together. In fact, when considering only a single patch, we modify our nodes from the inferred cloth to only include that patch, ignoring the rest of the vertices and triangles in the mesh. Similarly, the ground truth cloth is assumed to only consider the data for that patch. Note that any existing network that predicts cloth vertex positions can be adapted to this patch-based approach as a postprocess applied to their training examples, and that one may readily apply the predicted texture separately to each patch.

Along the lines of Section 4.1 and Appendix A, a ray between the camera aperture and each inferred cloth vertex of the patch is intersected with the ground truth cloth, in order to find the ground truth texture coordinates to assign to the inferred cloth vertex. Recall that the inferred cloth vertex remains unassigned when occluded; however, we modify our definition of occlusion to only consider the inferred cloth patch under consideration. This allows, for example, one to reconstruct the back half of the shirt with cameras from the front, since the front half of the shirt would not be considered and not be occluded by the back half. Since we only consider the front half of both the inferred and ground truth cloth, one also does not compute ground truth texture coordinates to be assigned to the inferred vertex when the ray does not intersect the front half patch of the ground truth cloth. Separating the front and back of the shirt guarantees the inferred cloth patch is assigned texture coordinates from a continuous texture. This leads to a sub-mesh of assigned texture coordinates . As usual, we remove any edge (by deleting an inferred vertex) which connects geodesically far away regions as indicated by differing texture coordinate values. See Figure 19.

Figure 19: The maximum texture coordinate displacement before (a) and after (b) removing vertices which connect geodesically far away regions. The inferred cloth vertices are drawn in blue, and the ground truth ray intersection points are drawn in red. The wireframe of the inferred cloth is in blue, and the ground truth cloth is in white.

d.1 Texture Coordinates

To quantify the worst case texture sliding scenarios, we first consider for every pose and camera view used in training. The edge with the largest difference in texture coordinates (as a proxy for geodesic distances) is shown in Figure 20. We do the same for Euclidean distances along every edge to connected vertices in Figure 21.

Figure 20: The edge over the entire training set with the largest change in texture coordinates.
Figure 21: The edge over the entire training set with the largest ground truth intersection Euclidean distance.

d.2 Texture Coordinate Displacements

In order to fill in unassigned vertices for the patch under consideration, we show the most extreme behavior of texture sliding over all . First, we compute the maximal value of among all pairs, where maximal texture sliding occurs in our training set. See Figure 18(b). We also compute along each assigned edge in order to ascertain the biggest jump (indicating high frequency) that would be seen by the TSNN. The edge with the maximal over all pairs is shown in Figure 22. Note that one may obtain better smoothness when training the TSNN by not assigning vertices where is too large or one of the vertices of an edge where is too large.

Figure 22: The edge over the entire training set with the largest change in texture coordinate displacements.

d.3 Smoothness Considerations

As long as extrapolation is done smoothly to assign texture coordinates to the remaining vertices, there should be no new extrema in and . After applying smoothing, we verify that the largest and are the same as before.

Appendix E TSNN – Additional Experiments

Table 3 shows additional TSNN results after applying a displacement threshold to the TS dataset. In addition, in Table 4 we decompose the TSNN errors based on whether vertices were assigned via our ray intersection method or extrapolation. Results indicate that training separate networks for smooth and wrinkled regions of the cloth may be a promising avenue for future work.

Network SqrtMSE ()
TSNN 15.058 6.5256
TSNN + subdivision 14.926 6.5918
Table 3: Comparisons of per-pixel SqrtMSE for the test set after applying a threshold to the ground truth TS displacements.
TSNN (original) TSNN (threshold)
Ray Intersection 11.670 3.2160 10.958 2.9056
Extrapolation 39.564 27.087 95.200 48.406
Combination 14.279 4.5970 15.405 5.5923
Table 4: Breakdown of the TSNN errors () in Tables 2 and 3 based on whether each pixel contains vertices assigned via ray intersection, diffusion, or a combination of both.


  • [1] Thiemo Alldieck, Marcus Magnor, Bharat Lal Bhatnagar, Christian Theobalt, and Gerard Pons-Moll. Learning to reconstruct people in clothing from a single rgb camera. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 1175–1186, 2019.
  • [2] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Detailed human avatars from monocular video. In 2018 International Conference on 3D Vision (3DV), pages 98–109. IEEE, 2018.
  • [3] Thiemo Alldieck, Marcus Magnor, Weipeng Xu, Christian Theobalt, and Gerard Pons-Moll. Video based reconstruction of 3d people models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8387–8397, 2018.
  • [4] Thiemo Alldieck, Gerard Pons-Moll, Christian Theobalt, and Marcus Magnor. Tex2shape: Detailed full human body geometry from a single image. In Proceedings of the International Conference on Computer Vision (ICCV). IEEE, 2019.
  • [5] Dragomir Anguelov, Praveen Srinivasan, Daphne Koller, Sebastian Thrun, Jim Rodgers, and James Davis. Scape: shape completion and animation of people. In ACM transactions on graphics (TOG), volume 24, pages 408–416. ACM, 2005.
  • [6] Alexandru O Bălan and Michael J Black. The naked truth: Estimating body shape under clothing. In European Conference on Computer Vision, pages 15–29. Springer, 2008.
  • [7] David Baraff and Andrew Witkin. Large steps in cloth simulation. In Proceedings of the 25th annual conference on Computer graphics and interactive techniques, pages 43–54. ACM, 1998.
  • [8] David Baraff, Andrew Witkin, and Michael Kass. Untangling cloth. In ACM Transactions on Graphics (TOG), volume 22, pages 862–870. ACM, 2003.
  • [9] Davide Boscaini, Jonathan Masci, Emanuele Rodolà, and Michael Bronstein.

    Learning shape correspondence with anisotropic convolutional neural networks.

    In Advances in Neural Information Processing Systems, pages 3189–3197, 2016.
  • [10] Derek Bradley, Tamy Boubekeur, and Wolfgang Heidrich. Accurate multi-view reconstruction using robust binocular stereo and surface meshing. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008.
  • [11] Derek Bradley, Tiberiu Popa, Alla Sheffer, Wolfgang Heidrich, and Tamy Boubekeur. Markerless garment capture. In ACM Transactions on Graphics (TOG), volume 27, page 99. ACM, 2008.
  • [12] Robert Bridson, Ronald Fedkiw, and John Anderson. Robust treatment of collisions, contact and friction for cloth animation. In ACM Transactions on Graphics (ToG), volume 21, pages 594–603. ACM, 2002.
  • [13] Robert Bridson, Sebastian Marino, and Ronald Fedkiw. Simulation of clothing with folds and wrinkles. In Proceedings of the 2003 ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 28–36. ACM, 2003.
  • [14] Michael M Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: going beyond euclidean data. IEEE Signal Processing Magazine, 34(4):18–42, 2017.
  • [15] Matthew Cong, Michael Bao, Jane L E, Kiran S Bhat, and Ronald Fedkiw. Fully automatic generation of anatomical face simulation models. In Proceedings of the 14th ACM SIGGRAPH/Eurographics Symposium on Computer Animation, pages 175–183. ACM, 2015.
  • [16] R Daněřek, Endri Dibra, Cengiz Öztireli, Remo Ziegler, and Markus Gross. Deepgarment: 3d garment shape estimation from a single image. In Computer Graphics Forum, volume 36, pages 269–280. Wiley Online Library, 2017.
  • [17] Edilson De Aguiar, Leonid Sigal, Adrien Treuille, and Jessica K Hodgins. Stable spaces for real-time clothing. In ACM Transactions on Graphics (TOG), volume 29, page 106. ACM, 2010.
  • [18] Endri Dibra, Himanshu Jain, Cengiz Öztireli, Remo Ziegler, and Markus Gross. Hs-nets: Estimating human body shape from silhouettes with convolutional neural networks. In 2016 Fourth International Conference on 3D Vision (3DV), pages 108–117. IEEE, 2016.
  • [19] Endri Dibra, Himanshu Jain, Cengiz Oztireli, Remo Ziegler, and Markus Gross. Human shape from silhouettes using generative hks descriptors and cross-modal neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4826–4836, 2017.
  • [20] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. arXiv preprint arXiv:1610.07629, 2016.
  • [21] James D Foley, Foley Dan Van, Andries Van Dam, Steven K Feiner, John F Hughes, J Hughes, and Edward Angel. Computer graphics: principles and practice, volume 12110. Addison-Wesley Professional, 1996.
  • [22] Jean-Sébastien Franco, Marc Lapierre, and Edmond Boyer. Visual shapes of silhouette sets. In Third International Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’06), pages 397–404. IEEE, 2006.
  • [23] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015.
  • [24] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2414–2423, 2016.
  • [25] Zhenglin Geng, Dan Johnson, and Ronald Fedkiw. Coercing machine learning to output physically accurate results. arXiv preprint arXiv:1910.09671, 2019.
  • [26] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. MIT press, 2016.
  • [27] Peng Guan, Loretta Reiss, David A Hirshberg, Alexander Weiss, and Michael J Black. Drape: Dressing any person. ACM Trans. Graph., 31(4):35–1, 2012.
  • [28] Erhan Gundogdu, Victor Constantin, Amrollah Seifoddini, Minh Dang, Mathieu Salzmann, and Pascal Fua. Garnet: A two-stream network for fast and accurate 3d cloth draping. In Proceedings of the IEEE International Conference on Computer Vision, pages 8739–8748, 2019.
  • [29] Agrim Gupta, Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Characterizing and improving stability in neural style transfer. In Proceedings of the IEEE International Conference on Computer Vision, pages 4067–4076, 2017.
  • [30] Marc Habermann, Weipeng Xu, Michael Zollhoefer, Gerard Pons-Moll, and Christian Theobalt. Livecap: Real-time human performance capture from monocular video. ACM Transactions on Graphics (TOG), 38(2):14, 2019.
  • [31] Fabian Hahn, Bernhard Thomaszewski, Stelian Coros, Robert W Sumner, Forrester Cole, Mark Meyer, Tony DeRose, and Markus Gross. Subspace clothing simulation using adaptive bases. ACM Transactions on Graphics (TOG), 33(4):105, 2014.
  • [32] Richard I Hartley and Peter Sturm. Triangulation. Computer vision and image understanding, 68(2):146–157, 1997.
  • [33] Peng Huang, Margara Tejera, John Collomosse, and Adrian Hilton. Hybrid skeletal-surface motion graphs for character animation from 4d performance capture. ACM Transactions on Graphics (ToG), 34(2):17, 2015.
  • [34] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1125–1134, 2017.
  • [35] Ning Jin, Yilin Zhu, Zhenglin Geng, and Ronald Fedkiw. A pixel-based framework for data-driven clothing. arXiv preprint arXiv:1812.01677, 2018.
  • [36] Justin Johnson, Alexandre Alahi, and Li Fei-Fei.

    Perceptual losses for real-time style transfer and super-resolution.

    In European conference on computer vision, pages 694–711. Springer, 2016.
  • [37] Ron Kimmel and James A Sethian. Computing geodesic paths on manifolds. Proceedings of the national academy of Sciences, 95(15):8431–8435, 1998.
  • [38] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [39] Ilya Kostrikov, Zhongshi Jiang, Daniele Panozzo, Denis Zorin, and Joan Bruna. Surface networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2540–2548, 2018.
  • [40] Zorah Lahner, Daniel Cremers, and Tony Tung. Deepwrinkles: Accurate and realistic clothing modeling. In Proceedings of the European Conference on Computer Vision (ECCV), pages 667–684, 2018.
  • [41] Matthew Loper, Naureen Mahmood, Javier Romero, Gerard Pons-Moll, and Michael J Black. Smpl: A skinned multi-person linear model. ACM transactions on graphics (TOG), 34(6):248, 2015.
  • [42] Haggai Maron, Meirav Galun, Noam Aigerman, Miri Trope, Nadav Dym, Ersin Yumer, Vladimir G Kim, and Yaron Lipman. Convolutional neural networks on surfaces via seamless toric covers. ACM Trans. Graph., 36(4):71–1, 2017.
  • [43] Steve Marschner and Peter Shirley. Fundamentals of computer graphics. CRC Press, 2015.
  • [44] Jonathan Masci, Davide Boscaini, Michael Bronstein, and Pierre Vandergheynst. Geodesic convolutional neural networks on riemannian manifolds. In Proceedings of the IEEE international conference on computer vision workshops, pages 37–45, 2015.
  • [45] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [46] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodola, Jan Svoboda, and Michael M Bronstein. Geometric deep learning on graphs and manifolds using mixture model cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5115–5124, 2017.
  • [47] Matthias Müller and Nuttapong Chentanez. Wrinkle meshes. In Proceedings of the 2010 ACM SIGGRAPH/Eurographics symposium on computer animation, pages 85–92. Eurographics Association, 2010.
  • [48] Ryota Natsume, Shunsuke Saito, Zeng Huang, Weikai Chen, Chongyang Ma, Hao Li, and Shigeo Morishima. Siclope: Silhouette-based clothed people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4480–4490, 2019.
  • [49] Alexandros Neophytou and Adrian Hilton. A layered model of human body and garment deformation. In 2014 2nd International Conference on 3D Vision, volume 1, pages 171–178. IEEE, 2014.
  • [50] Stanley Osher and Ronald Fedkiw. Level Set Methods and Dynamic Implicit Surfaces. Springer, New York, 2002.
  • [51] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017.
  • [52] Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10975–10985, 2019.
  • [53] Gerard Pons-Moll, Sergi Pujades, Sonny Hu, and Michael J Black. Clothcap: Seamless 4d clothing capture and retargeting. ACM Transactions on Graphics (TOG), 36(4):73, 2017.
  • [54] Tiberiu Popa, Quan Zhou, Derek Bradley, Vladislav Kraevoy, Hongbo Fu, Alla Sheffer, and Wolfgang Heidrich. Wrinkling captured garments using space-time data-driven deformation. In Computer Graphics Forum, volume 28, pages 427–435. Wiley Online Library, 2009.
  • [55] Nadia Robertini, Edilson De Aguiar, Thomas Helten, and Christian Theobalt. Efficient multi-view performance capture of fine-scale surface detail. In 2014 2nd International Conference on 3D Vision, volume 1, pages 5–12. IEEE, 2014.
  • [56] Damien Rohmer, Tiberiu Popa, Marie-Paule Cani, Stefanie Hahmann, and Alla Sheffer. Animation wrinkling: augmenting coarse cloth simulations with realistic-looking wrinkles. In ACM Transactions on Graphics (TOG), volume 29, page 157. ACM, 2010.
  • [57] Shunsuke Saito, Zeng Huang, Ryota Natsume, Shigeo Morishima, Angjoo Kanazawa, and Hao Li. Pifu: Pixel-aligned implicit function for high-resolution clothed human digitization. In Proceedings of the International Conference on Computer Vision (ICCV). IEEE, 2019.
  • [58] Artsiom Sanakoyeu, Dmytro Kotovenko, Sabine Lang, and Bjorn Ommer. A style-aware content loss for real-time hd style transfer. In Proceedings of the European Conference on Computer Vision (ECCV), pages 698–714, 2018.
  • [59] Igor Santesteban, Miguel A Otaduy, and Dan Casas. Learning-based animation of clothing for virtual try-on. In Computer Graphics Forum, volume 38, pages 355–366. Wiley Online Library, 2019.
  • [60] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini. The graph neural network model. IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
  • [61] Bernhard Scholkopf and Alexander J Smola.

    Learning with kernels: support vector machines, regularization, optimization, and beyond

    MIT press, 2001.
  • [62] Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 1, pages 519–528. IEEE, 2006.
  • [63] Andrew Selle, Jonathan Su, Geoffrey Irving, and Ronald Fedkiw. Robust high-resolution cloth using parallelism, history-based collisions, and accurate friction. IEEE transactions on visualization and computer graphics, 15(2):339–350, 2008.
  • [64] Qingyang Tan, Lin Gao, Yu-Kun Lai, and Shihong Xia.

    Variational autoencoders for deforming 3d mesh models.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5841–5850, 2018.
  • [65] Gul Varol, Duygu Ceylan, Bryan Russell, Jimei Yang, Ersin Yumer, Ivan Laptev, and Cordelia Schmid. Bodynet: Volumetric inference of 3d human body shapes. In Proceedings of the European Conference on Computer Vision (ECCV), pages 20–36, 2018.
  • [66] Daniel Vlasic, Ilya Baran, Wojciech Matusik, and Jovan Popović. Articulated mesh animation from multi-view silhouettes. In ACM Transactions on Graphics (TOG), volume 27, page 97. ACM, 2008.
  • [67] Huamin Wang, Florian Hecht, Ravi Ramamoorthi, and James F O’Brien. Example-based wrinkle synthesis for clothing animation. In Acm Transactions on Graphics (TOG), volume 29, page 107. ACM, 2010.
  • [68] Tuanfeng Y Wang, Duygu Ceylan, Jovan Popović, and Niloy J Mitra. Learning a shared shape space for multimodal garment design. In SIGGRAPH Asia 2018 Technical Papers, page 203. ACM, 2018.
  • [69] Chenglei Wu, Kiran Varanasi, and Christian Theobalt. Full body performance capture under uncontrolled and varying illumination: A shading-based approach. In European Conference on Computer Vision, pages 757–770. Springer, 2012.
  • [70] Weipeng Xu, Avishek Chatterjee, Michael Zollhöfer, Helge Rhodin, Dushyant Mehta, Hans-Peter Seidel, and Christian Theobalt. Monoperfcap: Human performance capture from monocular video. ACM Transactions on Graphics (ToG), 37(2):27, 2018.
  • [71] Jinlong Yang, Jean-Sébastien Franco, Franck Hétroy-Wheeler, and Stefanie Wuhrer. Estimation of human body shape in motion with wide clothing. In European Conference on Computer Vision, pages 439–454. Springer, 2016.
  • [72] Jinlong Yang, Jean-Sébastien Franco, Franck Hétroy-Wheeler, and Stefanie Wuhrer. Analyzing clothing layer deformation statistics of 3d human motions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 237–253, 2018.
  • [73] Tao Yu, Zerong Zheng, Yuan Zhong, Jianhui Zhao, Qionghai Dai, Gerard Pons-Moll, and Yebin Liu. Simulcap: Single-view human performance capture with cloth simulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  • [74] Chao Zhang, Sergi Pujades, Michael J Black, and Gerard Pons-Moll. Detailed, accurate, human shape estimation from clothed 3d scan sequences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4191–4200, 2017.