SHARP: Shape-Aware Reconstruction of People In Loose Clothing

by   Sai Sagar Jinka, et al.
IIIT Hyderabad

3D human body reconstruction from monocular images is an interesting and ill-posed problem in computer vision with wider applications in multiple domains. In this paper, we propose SHARP, a novel end-to-end trainable network that accurately recovers the detailed geometry and appearance of 3D people in loose clothing from a monocular image. We propose a sparse and efficient fusion of a parametric body prior with a non-parametric peeled depth map representation of clothed models. The parametric body prior constraints our model in two ways: first, the network retains geometrically consistent body parts that are not occluded by clothing, and second, it provides a body shape context that improves prediction of the peeled depth maps. This enables SHARP to recover fine-grained 3D geometrical details with just L1 losses on the 2D maps, given an input image. We evaluate SHARP on publicly available Cloth3D and THuman datasets and report superior performance to state-of-the-art approaches.



There are no comments yet.


page 1

page 3

page 6

page 7

page 8

page 12


Single-view 3D Body and Cloth Reconstruction under Complex Poses

Recent advances in 3D human shape reconstruction from single images have...

PeelNet: Textured 3D reconstruction of human body using single view RGB image

Reconstructing human shape and pose from a single image is a challenging...

DenseRaC: Joint 3D Pose and Shape Estimation by Dense Render-and-Compare

We present DenseRaC, a novel end-to-end framework for jointly estimating...

StereoPIFu: Depth Aware Clothed Human Digitization via Stereo Vision

In this paper, we propose StereoPIFu, which integrates the geometric con...

SHARP 2020: The 1st Shape Recovery from Partial Textured 3D Scans Challenge Results

The SHApe Recovery from Partial textured 3D scans challenge, SHARP 2020,...

Learning Complex 3D Human Self-Contact

Monocular estimation of three dimensional human self-contact is fundamen...

FACSIMILE: Fast and Accurate Scans From an Image in Less Than a Second

Current methods for body shape estimation either lack detail or require ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Image-based modeling of loosely clothed 3D humans is an interesting and challenging open problem in computer vision. It has several applications in the domains of entertainment, AR/VR, sports, and healthcare. Traditional solutions based on stereo/multi-view reconstruction  [13, 9, 11, 15, 40] require studio environments with multiple synchronized calibrated cameras, yet have limitations in recovering high-frequency geometrical details. Advancements in commercial depth sensing over the last decade  [39, 5, 48, 31] has partly helped in overcoming this limitation. Nevertheless, these sensors suffer from motion blur in dynamic scenes. Additionally, both RGB and depth camera-based methods are susceptible to severe self-occlusion artifacts caused by loose clothing and articulated body poses.

With the advent of deep learning models, significant interest has garnered around 3D reconstruction from a monocular image 

[20, 43, 16], which is an ill-posed problem. Challenges like self-occlusions, viewpoint variations, and clothing obstacles make the scenario more difficult. One class of existing deep learning solutions attempts to fit a parametric body model like SMPL [26] to a monocular input image using global image features [20, 14, 33].Recently, several approaches [1, 8, 34, 2]

have proposed parametric clothing over the SMPL body. However, these methods can only accommodate relatively tight clothing styles as they primarily estimate local displacement/deformation of original SMPL vertices.

The other class of non-parametric body reconstruction techniques pose no such body prior constraints [37, 38, 30, 43, 7, 45] and hence can potentially handle loose clothing scenarios. In particular, the recent implicit function learning models, PIFu [37] and PIFuHD [38] estimate voxel occupancy by utilizing pixel-aligned RGB image features computed by projecting 3D points onto the input image. However, the pixel-aligned features suffer from depth ambiguity as multiple 3D points are projected to the same pixel. As an alternate representation for 3D objects/scenes, some of the recent works model scenes as multiple (depth) plane images (MPIs) [42]. MPI is altogether a layered representation of depths, each layer is at a fixed depth in the camera frustum. 3D human body reconstruction has also been attempted in the same vein by predicting front and back depth maps in [12]. However, they fail to handle self-occlusions by body parts. Another interesting work [19] attempted to address the self-occlusion problem by predicting multiple peeled depth maps. Their sparse representation of the human body was obtained by encoding the 3D body surface using inverse ray-tracing formulation. Nevertheless, these approaches fail to take into account the fine-grain local geometry structure and do not seek to enforce global consistency on the body shape/pose to encourage physically plausible shapes and poses of human body parts.

The aforementioned problems can be addressed by introducing a body prior while reconstructing humans in loose clothing. The volume-to-volume translation network proposed in [49] attempts to fuse image features with the SMPL prior in a volumetric representation. However, volumetric convolutions are computationally expensive and is limited in resolution. Recently, ARCH [18] proposed to induce a human body prior by sampling points around a template SMPL mesh before evaluating occupancy labels for each point. However, sampling around the canonical body surface is not sufficient to reconstruct humans with complex articulated poses in loose clothing. Another interesting work [17] attempted to refine implicit function estimation by fusing volumetric features and pixel-aligned features to resolve local feature ambiguity. However, volumetric feature estimation is still computationally expensive and their method is not end-to-end trainable. Thus, existing methods either lack computational efficiency or compute only small deformations around the SMPL body prior. However, in loose clothing scenarios, garments can also undergo geometrical deformations that are independent of the underlying body shape and pose. Hence, there is an acute need to learn these deformations apart from body shape/pose deformation. This can be achieved by efficiently combining complementary strengths of parametric and non-parametric reconstruction paradigms.

In this paper, we propose an efficient, sparse, and robust 3D body reconstruction method that can successfully handle significantly loose clothing and skewed viewpoints. Our method combines the SMPL body prior with peeled depth predictions to achieve geometrically rich 3D body reconstruction from a monocular image. Starting with a good estimate of the SMPL body from a monocular image (e.g., using [41]), we encode this mesh into the sparse 2D peeled depth map representation proposed in [19] to obtain SMPL peeled depth maps. The SMPL prior along with the monocular RGB image is fed as input to our network which then predicts a dense additive residual deformation for each pixel in the SMPL peeled depth maps. This additive depth-based deformation accounts for clothing-specific geometrical deformations at each location near the body surface. However, such modeling do not focus on reconstructing regions where the clothing is far from the body surface. To reconstruct such regions, we propose to predict peeled depth maps in a separate prediction branch of our network. These predicted peeled maps are later fused with the residual deformation maps along with the initial SMPL peeled depth maps to obtain the final fused peeled depth maps. A separate decoder is used to predict peeled RGB maps. The peeled RGB maps are superimposed on the fused peeled depth maps to recover a colored point-cloud and subsequently a vertex colored mesh. The SMPL prior constrains our model in two ways: first, the network retains the geometrically consistent body parts that are not occluded by clothing, and second, it provides a body shape context that improves prediction of peeled depth maps. This enables our network to learn with only losses unlike other existing methods that use GAN loss and 3D Chamfer loss.

Figure 2: Pipeline: We predict an SMPL prior from input image , and later convert it to peeled depth map (). This along with image , is fed to an encoder. Subsequently, three separate decoders branches predict peeled RGB (), peeled depth () and Residual Deformation () maps, respectively. Finally, a layer-wise fusion of , and is performed to obtain , which is then back-projected along with to obtain a per-vertex colored point-cloud. (The Ground Truth mesh is shown for comparison only.)

Our technical contributions are listed below:

  • We propose a novel end-to-end trainable encoder-decoder architecture SHARP, that uses only losses on 2D maps for predicting a 3D body model with loose clothing from monocular image.

  • SHARP achieves sparse and efficient fusion of a parametric body prior with non-parametric peeled depth representation.

  • We evaluate our method on various publicly available datasets and report superior qualitative & quantitative results as compared to state-of-the-art methods.

2 Related Work

Parametric Body Fitting.

Estimating the 3D human body pose 

[27, 32]

using deep neural networks has achieved great success with robust performance. The naked animatable 3D body is represented by SMPL 

[26], SMPL-X [35], SCAPE [3]. These models can be estimated from a single image by directly regressing the 3D joint locations. HMR [20] proposes to regress SMPL parameters when minimizing re-projection loss with the known 2D joints. Different priors have been used to refine the parametric estimates as in [44, 33, 23, 21]. Despite these approaches being computationally efficient, they lack person-specific details. SMPL vertex offset estimates have been proposed to understand tight clothing details [8, 47, 24, 46].

Non-parametric Body Reconstruction. Recovering 3D human body from multi-camera setups employ voxel carving, triangulation, multi-view stereo, shape-from-X techniques [4, 11, 9, 29]. With the advent of deep learning, initially voxel methods gained popularity as voxels are a natural extension to 2D pixels [45, 43, 49]. SiCloPe [30] estimates silhouettes in novel views to recover underlying 3D shape. Recent implicit function learning methods for human body reconstruction use locally-aligned pixel features space [37, 38]. However, they suffer from sampling limitation and do not model explicit shape representation. Peeled maps [19] proposed a sparse representation by estimating only surface intersections by posing the problem as an extension to ray tracing. Similar idea is used for view synthesis by NERF [28] where it samples points along the camera ray to evaluate RGB on these samples.

Prior-based Non-Parametric Body Reconstruction. ARCH [18] combines the SMPL parametric body model with implicit functions to assign skinning weights for the reconstructed mesh. However, the method cannot reconstruct loose clothing in complex poses. Geo-PIFu [17] learns deep implicit functions by utilizing structure-aware 3D voxel features along with 2D pixel features. DeepHuman [49] leverages dense semantic representations from SMPL as an additional input. Nevertheless, both Geo-PIFu and DeepHuman are volumetric-regression based approaches that incur high computational costs.

3 Method

In this section, we first outline peeled representations for 3D shapes followed by the details of our proposed method.

3.1 Peeled Representation

Peeled representation is a sparse, non-parametric encoding of 3D shapes [19]. Each textured 3D body is represented as a set of 2D maps - four depth maps and four RGB maps. The depth and color values are recorded at each intersection of a camera ray with the 3D surface. This representation is more efficient than voxels and implicit functions as it only stores ray-surface intersection in a 2D multi-layered layout. Although this representation is able to handle severe self-occlusions, it lacks the ability to predict plausible human body shapes for complex poses as there is no inductive bias on the human body structure.

3.2 SHARP Overview

Given an image of a person in arbitrary pose with loose clothing, we aim to reconstruct a textured 3D body surface with high-frequency geometrical details in clothing and plausible body part structure, as shown in Figure 2. There are three main components to our method i) An input peeled SMPL prior ii) Additive Residual Deformation (RD) prediction, and iii) Peeled map fusion. First, we get the SMPL body mesh fitted to the input image and convert it to a peeled depth representation to obtain a peeled shape prior as shown in Figure 2(b). This along with is fed as input to the shared encoder in our network. The network then predicts three outputs through different decoder branches, namely, peeled RGB maps , peeled depth maps and peeled residual deformations , as shown in Figure 2(c-e). The topmost decoder branch predicts only three peeled RGB maps as the input naturally acts as the first peeled RGB map. The predicted peeled depth maps and residual deformation maps are subsequently combined using in a peeled map fusion module to get the final fused peeled depth maps shown in Figure 2(f). Subsequently, a colored point-cloud representation is obtained by back-projecting and to D coordinates using known camera intrinsics, as shown in Figure 2(g). This point-cloud is further post-processed to obtain a mesh reconstruction using Poisson [22]

method. A detailed discussion on these network components and training protocol including the loss functions is provided below.

3.2.1 Peeled Shape Prior

We estimate the SMPL model parameters (, and global translation/rotation) for an input monocular image using [41]. The predicted SMPL mesh is converted to sparse peeled depth map representation defined as:

Figure 3: Residual Deformation rendered as point-cloud. (a) Input () shown in blue, (b) Predicted shown in red. (c) Superimposing and .

The input image concatenated with the SMPL peeled depth maps forms the input to our network .

3.2.2 Residual Deformation (RD)

To estimate image specific deformations from the naked SMPL prior input, we propose to predict Residual Deformation maps by computing pixel-wise offsets from the SMPL peeled depth maps , defined as:


More specifically, this models deformations of the SMPL prior in accordance with the clothing present in the input image. On pixels where there is no clothing (eg. face and hands), it predicts the offsets to be , thereby retaining the geometrical structure of body parts, as shown in Figure 3.

3.2.3 Peeled Map Fusion

The RD map modeling does not focus on regions where loose clothing is far from body surface. Hence, we propose to compute the final fused peeled depth maps as the fusion of RD maps and predicted peeled depth map using the SMPL prior. The layer fused peeled map , is defined as:


where is element-wise multiplication and for each layer , and .

3.3 Training

SHARP’s learning objective is defined as:


where , and are regularization parameters. We provide the formulation for the individual loss terms below.


captures the sum of norm between ground truth peeled depth map and predicted peeled depth map for each of the peeled map layer.


constraints the RD predictions of SMPL to that of ground truth offsets. We also enforce per layer gradient smoothness of the predicted and ground truth .


Additionally, We also train our network with loss between predicted and ground truth RGB peeled maps ().

3.4 Discussion on Architectural Choices

Existing prior-based non-parametric body reconstruction methods represent the body prior in 3D space (either voxels or meshes). However, this leads to computational inefficiencies as they need to densely probe the 3D space. Therefore, we adopted a sparse 2D peeled representation [19] of 3D shapes that can be efficiently processed by 2D convolution layers. Moreover, it enables to deform the body prior in order to approximate the loose clothing using residual deformation maps. Additionally, the peeled map representation also inherently addresses the self-occlusion issue.

Figure 4: (a) Clothed mesh and SMPL body. (b) Casting rays inside clothing volume. (c) Faces of the body which intersect with the rays are removed. (d) final fused mesh.

In terms of network architecture, we use a shared encoder that encodes the peeled shape prior along with the input image using ResNet blocks into a shared latent representation. This enables a joint encoding of both geometry (coming from SMPL prior) and appearance and further regularizes the predicted peeled maps.

Regarding our peeled map fusion strategy, we model deformations of the SMPL prior surface as a pixel-wise residual offset in peeled layer representation. Unlike other existing methods ([24, 18, 49]) that model 3D deformations directly (in either volumetric or surface representations), our method is more efficient as we treat it as a simple pixel-wise offset. Additionally, this enables a more effective and natural fusion of predicted geometry of loose clothing with that of the underlying body.

Figure 5: Qualitative results: Given (a) an input image and (b) corresponding SMPL prior mesh, we render point-cloud of (c) predicted residual deformations (in red) added to the SMPL prior (in blue), (d) point-cloud obtained from the predicted peeled maps, and (e), (f) the reconstructed mesh from two views.

4 Experiments & Results

4.1 Implementation Details

Our multi-branch encoder-decoder network is trained end-to-end for epochs. The shared encoder consists of downsampling layers followed by ResNet blocks. Sigmoid activation is used in last layer of the and decoder branches while a tanh activation is used for the decoder branch. The output values are scaled to a range, which approximately maps to metric scale range and holds empirical validity for cloth variations present in the datasets. Here, we use only layers of peeled representation, although the method is generalized to more number of peeled layers.

We use the Adam optimizer with an exponentially decaying learning rate starting from . Our network takes 30 hrs to train on Nvidia GTX Ti GPUs with a batch size of and , and are set to and , respectively. We use [10] for rendering the peeled maps.

4.2 Datasets

We perform both qualitative and quantitative evaluations on the following publicly datasets.

Cloth3D [6] is a collection of sequences of draped SMPL meshes simulated with MoCap data. Each frame of a sequence contains garment and body-specific deformation information. Garments styles range from skirts to very loose robes. We augment this data by capturing SMPL texture maps with minimal clothing to simulate realistic body textures using [1]. For each sequence, five frames are randomly sampled and the naked body is subtracted from the garment as shown in Figure 4 to obtain clothed mesh sans occluded body as ground truth for training. We use the ground truth SMPL parameter to get initial peeled maps prior for our training. We also augment this data by rotating each mesh by , , along yaw axis to increase viewpoint variations.

THuman [49] consists of 6800 human meshes registered with SMPL body in varying poses and garments. The dataset was obtained using consumer RGBD sensors. Although the dataset has diverse poses and shapes, it has relatively tight clothing examples with low-quality textures.

Method CD P2S
JumpSuit 0.0003 0.00872
Dress 0.0013 0.0221
Top+Trousers 0.0006 0.0108
Table 1: Quantitative evaluation of SHARP on different types of clothing styles in Cloth3D.
Method CD P2S
PIFu [37] 0.0064 0.0515
ARCH [18] 0.0034 0.0357
Geo-PIFu [17] 0.0012 0.0256
PeeledHuman [19] 0.0016 0.0291
Ours (baseline) 0.0010 0.01417
Ours 0.0006 0.01204
Table 2: Quantitative comparison with state-of-the-art methods on Cloth3D data.

4.3 Qualitative Results

Our method is able to predict high-fidelity human reconstructions with very loose clothing, as shown in Figure 5 and Figure 6. We demonstrate qualitative results for different clothing styles in Figure 6, where we were able to successfully reconstruct complex body poses(bottom row). Please refer to supplementary material for additional results.

Figure 6: Qualitative results: SHARP predicts consistent body parts along with realistic clothing deformations from a monocular input image. Note that these are per-frame reconstructions.

4.4 Quantitative Evaluation

To quantitatively evaluate performance of SHARP, we compute Point-to-Surface (P2S) distance and Chamfer Distance (CD) using the point-cloud obtained from the predicted fused maps. P2S calculates the distance between this point-cloud and the ground truth mesh. CD calculates the distance between this point-cloud and the point-cloud sampled from the ground truth mesh. Table 1 summarizes quantitative analysis on Cloth3D dataset where we evaluate these metrics on different styles of clothing to indicate the robustness of our method.

4.5 Comparisons with the state-of-the-art

Method CD P2S
DeepHuman [49] 0.00119 0.00112
Geo-PIFu [17] 0.00017 0.00019
Ours 0.00016 0.00019
Table 3: Quantitative comparisons on THuman dataset

We evaluate the aforementioned metrics on Cloth3D dataset on PIFu [37], ARCH [18], GeoPIFu [17] and PeeledHuman [19]. Note that we retrained all these competing methods on Cloth3D data. We consider predicting peeled depth maps (before fusion) as our baseline.

Figure 7: Reconstruction Results (Cloth3D): Our method yields superior geometrical reconstruction in comparison with PeeledHuman [19], Geo-PIFu [17], ARCH [18] and PIFu [37] on Cloth3D dataset.
Figure 8: Inference results of our model on DeepFashion dataset, trained on Cloth3D dataset.

Unlike [19] that uses GAN with CD loss, we use a simple encoder-decoder architecture without CD loss on the predicted depth maps. To perform a fair comparison with ARCH, instead of sampling 3D points around the SMPL body in the canonical pose, we directly sample from the final pose for evaluation of labels. Similar to other methods, we transform all the predicted models from different methods to the canonical coordinates of the ground truth mesh.

Table 2 summarizes the quantitative results where we outperform existing methods on Cloth3D data by a good margin for both CD and P2S metrics. Figure 7 shows that our method consistently outperforms other existing methods in terms of quality of reconstructed geometrical details around both body and cloth region, over varying body shape and clothing styles.

We also evaluate our method on THuman dataset with the evaluation metric code provided by 

[17] and reported results in Table 3. In this case, we perform on-par with GeoPIFu as THuman dataset has shapes with primarily tight clothing scenarios in contrast to Cloth3D dataset, which has widely loose clothing styles, where we outperform GeoPIFu [17]. Finally, we show reconstruction results on DeepFashion [25] dataset in Figure 8.

blocks Ours (baseline) Ours
CD 6 0.0013 0.0008
9 0.00011 0.0006
18 0.00010 0.0006
P2S 6 0.01720 0.01452
9 0.0155 0.01338
18 0.01417 0.01204
Table 4: Effect of varying number of ResNet blocks.

4.6 Ablation Study

Impact of Peeled Map Fusion: In Figure 10, it can be observed that fusion of the RD maps with the predicted peeled maps helps to retain the body parts in complex and self-occluded poses. On the other hand, in the case of our baseline setup (without fusion module), we obtain distorted body parts.

Impact of Architecture Variations: We also evaluate the performance of our method by varying number of ResNet blocks in encoder as shown in Table 4. We can observe that even with much smaller network size (6 ResNet blocks), our network is able to achieve better performance when compared to baseline setup and all other existing methods compared in Table 2.

Robustness to Noise in SMPL:

We demonstrate the robustness of our method when the initial SMPL prediction is noisy. We induce additive random Gaussian noise with zero mean and three variances (

) in ground-truth SMPL pose parameters, as shown in Figure 11 (top row). We compute and plot the P2S distance between the predicted point-clouds with noisy input SMPL versus predicted point-cloud without noise, as shown in Figure 11 (bottom row). As we can observe, our method can recover from small Gaussian noise in SMPL parameters as we fuse our final depth subsubsection 3.2.3 from RD and predicted peeled depth maps. Also, note that the region near legs in Figure 11 is largely unaffected as the noise in SMPL prior is compensated by peeled maps in the clothed region during fusion.

(a)            (b)             (c)            (d)

Figure 9: Upsampling: (a) Predicted , (b) simplified , (c) upsampled , (d) super-imposed PC.

Effect of Upsampling on the Predicted Point-cloud The predicted point-cloud , generated from the fused depth maps, can sometimes be sparse and contain holes. This issue arises because the perspective projection causes loss of information in the peeled maps, and this loss of information is reflected in the back-projected point-cloud. To tackle this issue, we create a simplified point-cloud by uniformly sampling points from . Subsampling helps in avoiding the amplification of isolated noisy points, majorly from the parts that are not visible in the input image. Subsequently, we perform upsampling of using [36] to obtain dense, upsampled point cloud . The upsampling closes most of the potential holes but might lose the fine-grained geometrical details in . Hence, we superimpose and , which closes a significant number of holes and retains the geometrical details, as shown in  Figure 9.

         (a)                          (b)

Figure 10: Our peeled map fusion retains body parts in the final output (b) as compare to output of baseline setup (a).
Figure 11: Effect of noise in SMPL prior. (a) Noisy SMPL (orange) and ground-truth SMPL (purple). (b) P2S distance between respective noisy and noise-free reconstruction with our method.

5 Conclusion

We introduced SHARP, a novel shape-aware peeled representation for the reconstruction of human bodies with loose clothing. Our method achieves sparse and efficient fusion of parametric body prior with non-parametric peeled depth representation. We evaluated our method on various publicly available datasets and reported superior qualitative and quantitative results as compared to state-of-the-art methods. As part of future work it would be interesting to introduce temporal consistency in the proposed solution.


  • [1] T. Alldieck, M. Magnor, B. L. Bhatnagar, C. Theobalt, and G. Pons-Moll (2019) Learning to reconstruct people in clothing from a single RGB camera. In CVPR, Cited by: §1, §4.2.
  • [2] T. Alldieck, G. Pons-Moll, C. Theobalt, and M. Magnor (2019) Tex2Shape: detailed full human body geometry from a single image. In ICCV, Cited by: §1.
  • [3] D. Anguelov, P. Srinivasan, D. Koller, S. Thrun, J. Rodgers, and J. Davis (2005) SCAPE: shape completion and animation of people. ACM Transactions on Graphics (TOG) 24 (3), pp. 408–416. Cited by: §2.
  • [4] T. C. Azevedo, J. M. R. Tavares, and M. A. Vaz (2009) 3D object reconstruction from uncalibrated images using an off-the-shelf camera. In Advances in Computational Vision and Medical Image Processing: Methods and Applications, pp. 117–136. Cited by: §2.
  • [5] A. Baak, M. Müller, G. Bharaj, H. Seidel, and C. Theobalt (2011) A data-driven approach for real-time full body pose reconstruction from a depth camera. In ICCV, Cited by: §1.
  • [6] H. Bertiche, M. Madadi, and S. Escalera (2020) CLOTH3D: clothed 3d humans. In ECCV, Cited by: §4.2.
  • [7] B. L. Bhatnagar, C. Sminchisescu, C. Theobalt, and G. Pons-Moll (2020)

    LoopReg: self-supervised learning of implicit surface correspondences, pose and shape for 3D human mesh registration

    In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1.
  • [8] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll (2019) Multi-Garment Net: learning to dress 3D people from images. In ICCV, Cited by: §1, §2.
  • [9] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black (2017) Dynamic FAUST: registering human bodies in motion. In CVPR, Cited by: §1, §2.
  • [10] Trimesh External Links: Link Cited by: §4.1.
  • [11] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor, et al. (2016) Fusion4D: real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG) 35 (4), pp. 1–13. Cited by: §1, §2.
  • [12] V. Gabeur, J. Franco, X. Martin, C. Schmid, and G. Rogez (2019) Moulding humans: non-parametric 3d human shape estimation from single images. In ICCV, Cited by: §1.
  • [13] J. Gall, C. Stoll, E. De Aguiar, C. Theobalt, B. Rosenhahn, and H. Seidel (2009) Motion capture using joint skeleton tracking and surface estimation. In CVPR, Cited by: §1.
  • [14] R. A. Güler, N. Neverova, and I. Kokkinos (2018) DensePose: dense human pose estimation in the wild. In CVPR, Cited by: §1.
  • [15] K. Guo, P. Lincoln, P. Davidson, J. Busch, X. Yu, M. Whalen, G. Harvey, S. Orts-Escolano, R. Pandey, J. Dourgarian, et al. (2019) The Relightables: volumetric performance capture of humans with realistic relighting. ACM Transactions on Graphics (TOG) 38 (6), pp. 1–19. Cited by: §1.
  • [16] M. Habermann, W. Xu, M. Zollhofer, G. Pons-Moll, and C. Theobalt (2020) DeepCap: monocular human performance capture using weak supervision. In CVPR, Cited by: §1.
  • [17] T. He, J. Collomosse, H. Jin, and S. Soatto (2020) Geo-PIFu: geometry and pixel aligned implicit functions for single-view human reconstruction. In Advances in Neural Information Processing Systems, Cited by: §1, §2, Figure 7, §4.5, §4.5, Table 2, Table 3.
  • [18] Z. Huang, Y. Xu, C. Lassner, H. Li, and T. Tung (2020) ARCH: animatable reconstruction of clothed humans. In CVPR, Cited by: §1, §2, §3.4, Figure 7, §4.5, Table 2.
  • [19] S. S. Jinka, R. Chacko, A. Sharma, and P. Narayanan (2020) PeeledHuman: robust shape representation for textured 3D human body reconstruction. In 3DV, Cited by: §1, §1, §2, §3.1, §3.4, Figure 7, §4.5, §4.5, Table 2.
  • [20] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2018) End-to-end recovery of human shape and pose. In CVPR, Cited by: §1, §2.
  • [21] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik (2019) Learning 3D human dynamics from video. In CVPR, Cited by: §2.
  • [22] M. Kazhdan, M. Bolitho, and H. Hoppe Poisson surface reconstruction. Cited by: §3.2.
  • [23] N. Kolotouros, G. Pavlakos, M. J. Black, and K. Daniilidis (2019) Learning to reconstruct 3D human pose and shape via model-fitting in the loop. In ICCV, Cited by: §2.
  • [24] N. Kolotouros, G. Pavlakos, and K. Daniilidis (2019) Convolutional mesh regression for single-image human shape reconstruction. In CVPR, Cited by: §2, §3.4.
  • [25] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang (2016-06) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In

    Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §4.5.
  • [26] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Transactions on Graphics (Proc. SIGGRAPH Asia) 34 (6), pp. 248:1–248:16. Cited by: §1, §2.
  • [27] D. Mehta, S. Sridhar, O. Sotnychenko, H. Rhodin, M. Shafiei, H. Seidel, W. Xu, D. Casas, and C. Theobalt (2017) VNect: real-time 3D human pose estimation with a single rgb camera. ACM Transactions on Graphics (TOG) 36 (4), pp. 1–14. Cited by: §2.
  • [28] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng (2020) NeRF: representing scenes as neural radiance fields for view synthesis. In ECCV, Cited by: §2.
  • [29] A. Y. Mulayim, U. Yilmaz, and V. Atalay (2003) Silhouette-based 3-D model reconstruction from multiple images. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 33 (4), pp. 582–591. Cited by: §2.
  • [30] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima (2019) SiCloPe: silhouette-based clothed people. In CVPR, Cited by: §1, §2.
  • [31] R. A. Newcombe, D. Fox, and S. M. Seitz (2015) DynamicFusion: reconstruction and tracking of non-rigid scenes in real-time. In CVPR, Cited by: §1.
  • [32] A. Newell, K. Yang, and J. Deng (2016) Stacked hourglass networks for human pose estimation. In ECCV, Cited by: §2.
  • [33] M. Omran, C. Lassner, G. Pons-Moll, P. Gehler, and B. Schiele (2018) Neural body fitting: unifying deep learning and model-based human pose and shape estimation. In 3DV, Cited by: §1, §2.
  • [34] C. Patel, Z. Liao, and G. Pons-Moll (2020) TailorNet: predicting clothing in 3D as a function of human pose, shape and garment style. In CVPR, Cited by: §1.
  • [35] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas, and M. J. Black (2019) Expressive body capture: 3D hands, face, and body from a single image. In CVPR, Cited by: §2.
  • [36] Y. Qian, J. Hou, S. Kwong, and Y. He (2020) PUGeo-net: a geometry-centric network for 3d point cloud upsampling. In European Conference on Computer Vision, pp. 752–769. Cited by: §4.6.
  • [37] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, Cited by: §1, §2, Figure 7, §4.5, Table 2.
  • [38] S. Saito, T. Simon, J. Saragih, and H. Joo (2020) PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization. In CVPR, Cited by: §1, §2.
  • [39] J. Shotton, A. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake (2011) Real-time human pose recognition in parts from single depth images. In CVPR, Cited by: §1.
  • [40] C. Stoll, N. Hasler, J. Gall, H. Seidel, and C. Theobalt (2011) Fast articulated motion tracking using a sums of gaussians body model. In ICCV, Cited by: §1.
  • [41] F. Tan, H. Zhu, Z. Cui, S. Zhu, M. Pollefeys, and P. Tan (2020) Self-supervised human depth estimation from monocular videos. In CVPR, Cited by: §1, §3.2.1.
  • [42] R. Tucker and N. Snavely (2020) Single-view view synthesis with multiplane images. In CVPR, Cited by: §1.
  • [43] G. Varol, D. Ceylan, B. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid (2018) BodyNet: volumetric inference of 3D human body shapes. In ECCV, Cited by: §1, §1, §2.
  • [44] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In CVPR, Cited by: §2.
  • [45] A. Venkat, S. S. Jinka, and A. Sharma (2018) Deep textured 3D reconstruction of human bodies. In BMVC, Cited by: §1, §2.
  • [46] A. Venkat, C. Patel, Y. Agrawal, and A. Sharma (2019-10) HumanMeshNet: polygonal mesh recovery of humans. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: §2.
  • [47] A. Venkat, C. Patel, Y. Agrawal, and A. Sharma (2019) HumanMeshNet: polygonal mesh recovery of humans. In ICCV-W, Cited by: §2.
  • [48] X. Wei, P. Zhang, and J. Chai (2012) Accurate realtime full-body motion capture using a single depth camera. ACM Transactions on Graphics (TOG) 31 (6), pp. 1–12. Cited by: §1.
  • [49] Z. Zheng, T. Yu, Y. Wei, Q. Dai, and Y. Liu (2019) DeepHuman: 3D human reconstruction from a single image. In ICCV, Cited by: §1, §2, §3.4, §4.2, Table 3.