PeelNet: Textured 3D reconstruction of human body using single view RGB image

02/16/2020
by   Sai Sagar Jinka, et al.
IIIT Hyderabad
0

Reconstructing human shape and pose from a single image is a challenging problem due to issues like severe self-occlusions, clothing variations, and changes in lighting to name a few. Many applications in the entertainment industry, e-commerce, health-care (physiotherapy), and mobile-based AR/VR platforms can benefit from recovering the 3D human shape, pose, and texture. In this paper, we present PeelNet, an end-to-end generative adversarial framework to tackle the problem of textured 3D reconstruction of the human body from a single RGB image. Motivated by ray tracing for generating realistic images of a 3D scene, we tackle this problem by representing the human body as a set of peeled depth and RGB maps which are obtained by extending rays beyond the first intersection with the 3D object. This formulation allows us to handle self-occlusions efficiently. Current parametric model-based approaches fail to model loose clothing and surface-level details and are proposed for the underlying naked human body. Majority of non-parametric approaches are either computationally expensive or provide unsatisfactory results. We present a simple non-parametric solution where the peeled maps are generated from a single RGB image as input. Our proposed peeled depth maps are back-projected to 3D volume to obtain a complete 3D shape. The corresponding RGB maps provide vertex-level texture details. We compare our method against current state-of-the-art methods in 3D reconstruction and demonstrate the effectiveness of our method on BUFF and MonoPerfCap datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

09/18/2018

Deep Textured 3D Reconstruction of Human Bodies

Recovering textured 3D models of non-rigid human body shapes is challeng...
07/30/2020

NormalGAN: Learning Detailed 3D Human from a Single RGB-D Image

We propose NormalGAN, a fast adversarial learning-based method to recons...
06/09/2021

SHARP: Shape-Aware Reconstruction of People In Loose Clothing

3D human body reconstruction from monocular images is an interesting and...
08/01/2019

Moulding Humans: Non-parametric 3D Human Shape Estimation from Single Images

In this paper, we tackle the problem of 3D human shape estimation from s...
11/09/2021

Pipeline for 3D reconstruction of the human body from AR/VR headset mounted egocentric cameras

In this paper, we propose a novel pipeline for the 3D reconstruction of ...
11/30/2021

Robust 3D Garment Digitization from Monocular 2D Images for 3D Virtual Try-On Systems

In this paper, we develop a robust 3D garment digitization solution that...
06/05/2020

SparseFusion: Dynamic Human Avatar Modeling from Sparse RGBD Images

In this paper, we propose a novel approach to reconstruct 3D human body ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Reconstruction of a textured 3D model of the human body from images is a pivotal problem in computer vision and graphics. It has widespread applications ranging from the entertainment industry, e-commerce, health-care to AR/VR platforms. Reconstruction of human body models from a monocular image is an ill-posed problem and traditionally multi-view images captured by a calibrated setup were needed. However, recent advancements in deep learning methods have enabled the reconstruction of plausible textured 3D body models from a monocular image (in the wild setup).

This problem is particularly challenging as the object geometry of non-rigid human shapes varies over time, yielding a large space of complex articulated body poses as well as shape variations. In addition to this, there are several other challenges such as self-occlusions, obstructions due to free form clothing and large viewpoint variations, especially while attempting to reconstruct from a monocular image.

Figure 1:

At inference time, our proposed model (PeelNet) estimates the 3D body by generating the peeled Depth & RGB maps and then back-projecting them into a 3D camera coordinate frame.

Existing deep learning solutions for monocular 3D human reconstruction can be broadly categorized into two classes. The first class of model-based approaches use a parametric body representation, like the SMPL [11], to recover the 3D surface model from a single image (e.g., [10]). Such model-based methods efficiently capture the pose of the underlying naked body (and approximate shape, sans texture) but fail to reconstruct fine surface details of the body and wrapped clothing. A recent effort in [1] has tried to model clothing by estimating the vertex displacement field w.r.t. the SMPL template in a multi-image setup but cannot handle very loose clothing. Another approach by [6] predicts a UV map for every foreground pixel which can be used in texture generation over an SMPL model. However, clothing details are not captured in this method.

The second class of model-free approaches do not assume any parametric model of the body. One set of model-free approaches perform image to volumetric regression for human body recovery [19] which is a natural extension to 2D convolutions. Another parallel work in [21] proposed to obtained textured body models by performing volumetric regression for 3D reconstruction along with a generative model for predicting multiple orthographic RGB maps that were then projected on regressed voxel-grid. However, volumetric regression is known to be memory intensive as it involves redundant 3D convolution on empty voxels. Recently, deep networks for learning implicit functions have been used to recover 3D human models under loose clothing [15]. However, their inference time is higher as one need to exhaustively sample points in the 3D volume and evaluate its binary occupancy information for each point. Very recently, [4] modeled 3D humans as a pixel-wise regression of two independent depth maps, which is similar to capturing depth maps by two RGBD virtual cameras separated by 180

. However, this formulation limits to predicting front and back depth maps of the body completely independently and do not involve any kind of joint loss imposing 3D body structure into regression loss function. Additionally, owing to generating just two depth maps, it fails to handle large self-occlusion in human body models, especially in presence of large variation in camera views (e.g., image capturing side-views of the human body). To summarize, model-based methods cannot handle fine surface details, and model-free approaches are either computationally intensive or do not handle large self-occlusions.

In this paper, we tackle the problem of reconstructing the 3D human body model along with texture from a single RGB image using a generative model. Our approach is motivated by the idea of ray tracing in computer graphics, which produces realistic 2D rendered images of 3D objects/scenes. Each pixel in the 2D image corresponds to the projection of a real-world point in 3D space where the ray intersects with the object surface (mesh facet) for the first time. We pose the problem of 3D reconstruction from a monocular image as an inverse ray tracing problem where the 3D surface geometry can be recovered by estimating the depths at which a set of rays emanating from the virtual camera center (passing through individual pixels in the image) intersect the 3D body surface. This can be interpreted as depth peeling of surface body model, yielding a multi-layered representation called hereinafter as Peeled Depth maps. Such a layered representation of depth enables addressing severe self-occlusion in complex body poses as we can now recover 3D points from multiple surfaces that project to the same pixel while imaging (refer Figure 3).

Additionally, this layered representation is further extended to obtain Peeled RGB maps capturing a discrete sampling of the continuous surface texture. Unlike RGBD maps (of commercial depth sensors) that store the depth and color of the closest 3D point to each pixel, our peeled representation encodes multiple depths and their corresponding RGB value. Thus, we reformulate the solution of monocular 3D textured body model reconstruction task as predicting/generating respective peeled maps for both RGB and depth. To achieve this dual prediction task, we propose a multi-task generative adversarial network that generates a set of depth and RGB maps in two different branches of the network, as shown in Figure 1. Instead of using only norm on each independent depth maps, we propose to add Chamfer loss over the reconstructed point cloud (in the image/camera centric coordinate system) obtained with all four depth maps. This enables the network to implicitly impose a 3D body shape regularization while generating peeled depth-maps. Additionally, we use depth consistency loss for enforcing the strict ordering constraint (imposed by inverse ray-tracing formulation) which further regularizes the solution. An occlusion-aware depth loss further emphasizes on pixels that suffer from self-occlusion. Thus, our network is able to hallucinate plausible parts of the body even if the corresponding part is self-occluded in the image. We evaluate our approach on publicly available BUFF [24] and MonoPerfCap [23] datasets and report superior quantitative and qualitative results with other state-of-the-art methods.

To summarize our contributions in this paper:

  • We present a novel approach to represent a 3D human body in terms of peeled depth and RGB maps which is robust to severe self-occlusions.

  • We propose a complete end-to-end pipeline to reconstruct a 3D human body with texture given a single RGB image using an adversarial approach.

  • We note that our peeled depth and texture representation computation is efficient in terms of network size and feed-forward time yielding a high rate and quality of reconstructions.

The remainder of the paper is organized as follows: we review existing works on human body reconstruction in section 2. Our method is explained in section 3. In section 4, we evaluate our method on different datasets and compare it against state-of-the-art methods.

[1em]

Figure 2: PeelNet overview. The input to our network is a single-view RGB image captured from an arbitrary view. The image is fed to a sequence of residual blocks. This latent representation is then decoded in two separate multi-task branches. One branch generates peeled RGB maps while the other branch generates peeled depth maps. Peeled RGB maps are learned by minimizing loss and peeled depth maps are learned by minimizing & Chamfer loss. We concatenate the generated maps with the input image, and pass it to two different discriminators, one for RGB maps and the other for depth maps. These generated peeled depth and RGB with maps can be back projected to obtain colored point-cloud.

2 Related Work

3D human body reconstruction is broadly categorized into parametric and non-parametric methods. Traditionally, voxel carving and triangulation methods were employed for recovering a 3D human body from calibrated multi-camera setups [2]. Another approach proposed in [3] attempts to solve real-time performance capture of challenging scenes. However, they use RGBD images from multiple cameras as input.

Most of the existing deep learning methods which recover the 3D shape from an RGB image propose solutions on the parametric SMPL [11] model like in [10] where re-projection loss is minimized for 2D keypoints. A discriminator is trained for each joint of the parametric model making the model computationally expensive during training. Segmentation masks [20] are used to further improve the fitting of the 3D model to the available 2D image evidence. However, these parametric body estimation methods yield a smooth naked mesh missing out on the high-frequency surface-level details. Some of the works use a voxel grid i.e. a binary occupancy map and employ volumetric regression to recover human bodies from a single RGB image [21, 19, 7]. The authors in this work propose a volumetric 3D loss and a multi-view reprojection loss which are computationally expensive as well. The volumetric representation poses a serious computational disadvantage due to the sparsity of the voxel grids/maps and surface quality is limited to the voxel grid resolution. Deformation based approaches have been proposed over parametric models which incorporate these details to an extent. The constraints from body joints, silhouettes, and per-pixel shading information are utilized in [25, 21] to produce per-vertex movements away from the SMPL model. Authors in [22] estimate the vertex displacements by regressing to SMPL vertices. Additionally, researchers have explored parametric clothing in addition to parametric human body model to incorporate tight clothing details over SMPL model [1]. This method, however, predicts the mesh using eight input RGB images. These techniques fail for complex clothing topologies such as skirts, dresses which is not handled by their approach.

Few approaches work on directly regressing the point cloud [14]. But these essentially work only for rigid objects. To address these issues during reconstruction of 3D human bodies, interest has garnered around non-parametric approaches recently. Deep generative models have been proposed [13] taking inspiration from the visual hull algorithm to synthesize 2D silhouettes that are back projected from inferred 3D joints. Later, the silhouettes are back projected to obtain variations of clothed models and shape complexity. Implicit representations of 3D objects have been employed for deep learning based approaches [12, 15]

which represent the 3D surface as a continuous decision boundary of a deep neural network classifier. One drawback of this representation is that it is limited by its low inference time evaluation. Moreover, self occlusions are not handled by these approaches as they do not impose human body shape prior. Multi-layer approaches has been used for 3D scene understanding. Layered Depth Images are proposed in

[16] for efficient redendering application. View synthesis as a proxy task has been used in [18]

. Recently, transformer networks were proposed in order to transfer features in a novel view in

[17]

3 Problem Formulation

We represent a textured 3D body model as sets of peeled depth () and RGB maps (). Specifically, given an input RGB image with dimensions from an arbitrary view, we aim to generate peeled depth maps where and peeled RGB maps , where is the corresponding peeled RGB map with corresponding to .

3.1 Peeled Representation:

Given a camera and a 3D human body in a scene, we shoot rays originating from the virtual camera to the 3D world. We assume the human body as a non-convex object and record observations for the first hit of the ray. The depth map and RGB map evidence of this first hit are the visible surface details that are closest to the camera. Here, and correspond to standard RGBD data typically obtained from commercially available Kinect-like sensors. We peel away the occlusion and extend the rays beyond the first bounce to hit the next intersecting surface. The corresponding depth and RGB values are iteratively recorded in the next layer as and respectively. In this work, we consider four intersections of each ray to reconstruct a human for most of the daily actions we perform.

Our aim is to predict the closest plausible shape that is consistent with the available image evidence. If the intrinsics i.e. the focal length of camera along its and axes respectively and its center of axes are known, then the direction of ray in camera coordinate system corresponding to pixel is given by:

(1)

The ray is extended beyond the first hit. Conversely, if RGBD data is available and camera intrinsics are provided, we can locate the 3D point corresponding to each pixel. For a pixel with depth , its location in camera coordinate system is given by:

(2)

where and , given and of an image . Here, we assume is at the center of the image. The point is coloured with RGB values at pixel .

Relation to Implicit Representation: The peeled representation of the 3D body can be considered as an inverse problem of an implicit function where and . The underlying surface is implicitly obtained by an isosurface of . For a , we estimate the space in 3D volume where the body is present. It can be written in relation to implicit function as:

(3)
Figure 3: Ray tracing view. From a camera placed at an arbitrary location, we shoot rays which pass through the human body. For each ray, we record depth and RGB evidences at multiple intersections.

3.2 PeelNet

In this section, we introduce the proposed PeelNet, an end-to-end framework for 3D reconstruction of the human body. Figure 2 gives an overview of our pipeline. The input to our network is a single RGB image taken from an arbitrary viewpoint and outputs and as explained in section 3. For this task, we propose to use a conditional Generative Adversarial Network and formulate the problem of 3D reconstruction as generating sequences of peeled RGB and depth maps.

We build upon the generator and discriminator architectures proposed in [8]

. The input RGB image is fed to a series of residual blocks after encoding. The convolutional layers use normalization layers and ReLU as the activation function. Even though the peeled RGB and depth maps are spatially aligned, they are sampled from different distributions. Hence, we propose to decode these maps in two different branches as shown in Figure 2. The network finally produces

RGB maps and depth maps which is back-projected to 3D volume to reconstruct the 3D human body. We also use two discriminators, one for each RGB and depth maps. We use a Markovian Discriminator as proposed in [8] which enforces correctness on low frequencies. We train our network with the following loss function:

(4)

where , , , are weights for occlusion-aware depth loss, RGB loss, Chamfer loss and depth consistency loss respectively. Each loss term is explained in detail below.

For discriminating generated peeled RGB maps , we concatenate with the available image evidence i.e. input RGB image making it a 12-channel fake input while the ground truth RGB maps are real input. For peeled depth maps, we concatenate the image evidence i.e. with 4 generated peeled depth maps making it a 7-channel fake input while the ground truth depth maps are real input.

We denote our generator as , the discriminator for peeled RGB maps as , the discriminator for peeled depth maps as . The ground truth RGB and depth maps are denoted by and respectively.

[1em]

Figure 4: Qualitative results on MonoPerfCap dataset (Top row) and BUFF dataset (Middle and Bottom row). For each subject, the image depicts (from left to right order) Input monocular image, four peeled RGB and depth maps, 3D point cloud (colored according to their depth order : red , blue, green and yellow, respectively) textured/colored point cloud (two views shown in each column at the end)

Adversarial loss (): The objective of the generator constituting the conditional GAN is to produce as output and maps that cannot be classified as fake by the discriminators. On the other hand, discriminators and are trained to reject fake RGB and depth samples generated by the generator . Hence, both and are trained to predict 1 when provided with ground-truth data and 0 when fake data is provided as input. It can be written as :

(5)

RGB loss (): The generator minimizes the loss between ground-truth RGB images and generated peeled RGB maps. Here, is the number of layers for 3D representation. Note that the gradient propagating through the branch generating these maps is a summation of only RGB loss and adversarial loss.

(6)

Occlusion-aware depth loss (): We minimize masked loss over ground-truth and generated peeled depth maps as loss is known to produce blurry artifacts.

(7)

where for non-occluded pixels and for occluded pixels. We consider the pixels which have non-zero depth values in third and fourth peel maps as occluded pixels.
Chamfer Loss (): Each peeled depth map represents a partial 3D shape. To enable the network to capture the underlying 3D structure of the computed depth maps, we minimize Chamfer distance between reconstructed (rc) and ground truth (gt) point cloud.

(8)

Depth Consistency Loss (): Motivated by [18], we additionally penalize the network when depth hierarchy is not maintained i.e. as is closer to the camera. We enforce a constant penalty of 1 when the criteria is violated.

(9)

4 Experiments

We implement our proposed framework in PyTorch. We use ResNet-18 as our generator and PatchGAN as our discriminator. We capture the ground-truth peeled maps using trimesh

111www.trimesh.org. The resolution of our input and output images are . We use Adam optimizer with learning rate - and , , , and 10,100,500,500,10 respectively. We trained our model for epochs.

4.1 Datasets and Preprocessing

We evaluate our method on two datasets (i) BUFF dataset [24] (ii) MonoPerfCap [23]. First, the BUFF dataset consists of 5 subjects with 2 clothing styles. We use 3 subjects completely as training data and 1 subject completely for testing data. We split the frames of one subject equally between train and test data. Second, the MonoPerfCap dataset consists of daily human motion sequences in tight and loose clothing styles. We use 2 subjects as testing data and remaining subjects as training data. For each frame, we capture the subject after scaling it down to unit box from 4 different camera angles : (canonical view), , , . We compute 4 peeled depth and texture maps for each frame.

4.2 Qualitative results

We evaluate our approach on 3 human actions each from the BUFF and MonoPerfCap datasets as shown in Figure 4. Our approach is able to accurately recover the 3D human shape from previously unseen views.

Figure 5: Noisy depth prior reconstruction.

Even for severely occluded views, the network is able to predict the hidden body parts.

Figure 6: Performance of our method on in-the-wild images. (a) Input RGB image, (b) Segmented RGB image, (c) Generated RGB and depth peel maps, (d) Reconstructed point-cloud.

To realistically simulate commercially available RGBD sensor outputs, we introduce random Gaussian noise in depth map and train with RGBD as input. This helps to increase the robustness of our system. The network is able to ignore the noise and reconstruct as shown in Figure 5. We also show our results on an in-the-wild image not present in any dataset. The captured image is segmented using graphonomy [5] which is evaluated using our model and the reconstruction is shown in Figure 6.

Figure 7: Reconstruction using 2 depth maps
Figure 8: Comparison with HMR, PIFu and our method for MonoPerfCap (Top row) and BUFF (Middle and bottom row) datasets. We are able to recover surface-level detail unlike in HMR. Our approach has better reconstruction quality compared to PIFu.

4.3 Comparison with other approaches

We compare our approach against parametric and non-parametric approaches. We test our approach against HMR [10] which fits the SMPL body. We also test our method against PIFU [15] which evaluates occupancy information on the voxel space. To test our method against [4], we train PeelNet with our specifications as code and data are not available. The output of the network are two nearly symmetric depth maps which is analogous to capturing two depth maps from cameras separated by . For everyday human motions, severe self-occlusions are very frequent when seen from a single view producing inaccurate reconstructions as shown in Figure 7. We compare our results in Figure 8 showing the robustness of our approach to severe self-occlusions. Our network is trained with additional losses that are not considered in current approaches but have a significant effect on reconstruction quality. We quantitatively evaluate our method using Chamfer distance against PIFu [15], BodyNet [19], SiCloPe[13] and VRN [9] in Table 1. Even with a lower input image resolution, our method achieves significantly lower Chamfer distance scores.

Method Chamfer Distance Image Resolution
BodyNet 4.52 256
SiCloPe 4.02 256
VRN 2.48 256
PIFu 1.14 512
Ours 2.34 256
Table 1: Quantitative evaluation of Chamfer distance using single view RGB image

4.4 Analysis

The inclusion of Chamfer distance as a loss metric allows the network to infer the 3D structure inherent in the peeled depth maps. Our network is able to accurately predict the presence of occluded body parts.Training the network without Chamfer loss() results in reconstruction as shown in Figure 9. In the majority of cases, the network is not able to hallucinate the presence of occluded parts in the and depth maps and are hence, missing in the figure. The absence of Chamfer loss also produces significant noise in the reconstruction.

Figure 9: Reconstruction without Chamfer loss. Red points indicate both noise and missed occluded regions.

5 Conclusion

In this paper, we present a novel peeled representation to reconstruct human shape, pose and texture from a single RGB image. This generative formulation allows us to efficiently recover self-occluded parts and texture present in the image. Our end-to-end framework has low inference time and generates robust 3D reconstructions. The peeled representation, however, suffers from the absence of 3D points that are tangential to the viewpoint of the image evidence. In our future work, we would attempt to solve this issue by incorporating a template human mesh to recover these 3D points.

References

  • [1] B. L. Bhatnagar, G. Tiwari, C. Theobalt, and G. Pons-Moll (2019-10) Multi-garment net: learning to dress 3d people from images. In ICCV, Cited by: §1, §2.
  • [2] F. Bogo, J. Romero, G. Pons-Moll, and M. J. Black (2017) Dynamic FAUST: Registering human bodies in motion. In (CVPR), Cited by: §2.
  • [3] M. Dou, S. Khamis, Y. Degtyarev, P. Davidson, S. R. Fanello, A. Kowdle, S. O. Escolano, C. Rhemann, D. Kim, J. Taylor, et al. (2016) Fusion4d: real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG). Cited by: §2.
  • [4] V. Gabeur, J. Franco, X. Martin, C. Schmid, and G. Rogez (2019-10) Moulding humans: non-parametric 3d human shape estimation from single images. In ICCV, Cited by: §1, §4.3.
  • [5] K. Gong, Y. Gao, X. Liang, X. Shen, M. Wang, and L. Lin (2019)

    Graphonomy: universal human parsing via graph transfer learning

    .
    CoRR. External Links: Link, 1904.04536 Cited by: §4.2.
  • [6] R. A. Güler, N. Neverova, and I. Kokkinos (2018)

    Densepose: dense human pose estimation in the wild

    .
    In CVPR, Cited by: §1.
  • [7] Z. Huang, T. Li, W. Chen, Y. Zhao, J. Xing, C. LeGendre, L. Luo, C. Ma, and H. Li (2018) Deep volumetric video from very sparse multi-view performance capture. In ECCV, Cited by: §2.
  • [8] P. Isola, J. Zhu, T. Zhou, and A. A. Efros (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §3.2.
  • [9] A. S. Jackson, C. Manafas, and G. Tzimiropoulos (2018) 3D human body reconstruction from a single image via volumetric regression. In ECCV), Cited by: §4.3.
  • [10] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik (2017) End-to-end recovery of human shape and pose. CoRR abs/1712.06584. External Links: Link, 1712.06584 Cited by: §1, §2, §4.3.
  • [11] M. Loper, N. Mahmood, J. Romero, G. Pons-Moll, and M. J. Black (2015-10) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia). Cited by: §1, §2.
  • [12] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §2.
  • [13] R. Natsume, S. Saito, Z. Huang, W. Chen, C. Ma, H. Li, and S. Morishima (2019) Siclope: silhouette-based clothed people. In CVPR, Cited by: §2, §4.3.
  • [14] C. R. Qi, L. Yi, H. Su, and L. J. Guibas (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Cited by: §2.
  • [15] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. arXiv preprint arXiv:1905.05172. Cited by: §1, §2, §4.3.
  • [16] J. Shade, S. Gortler, L. He, and R. Szeliski (1998) Layered depth images. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’98, New York, NY, USA, pp. 231–242. External Links: ISBN 0897919998, Link, Document Cited by: §2.
  • [17] D. Shin, Z. Ren, E. B. Sudderth, and C. C. Fowlkes (2019) Multi-layer depth and epipolar feature transformers for 3d scene reconstruction. In CVPR Workshops, Cited by: §2.
  • [18] S. Tulsiani, R. Tucker, and N. Snavely (2018) Layer-structured 3d scene inference via view synthesis. In ECCV, Cited by: §2, §3.2.
  • [19] G. Varol, D. Ceylan, B. C. Russell, J. Yang, E. Yumer, I. Laptev, and C. Schmid (2018) BodyNet: volumetric inference of 3d human body shapes. CoRR abs/1804.04875. External Links: Link, 1804.04875 Cited by: §1, §2, §4.3.
  • [20] G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev, and C. Schmid (2017) Learning from synthetic humans. In CVPR, Cited by: §2.
  • [21] A. Venkat, S. S. Jinka, and A. Sharma (2018) Deep textured 3d reconstruction of human bodies. CoRR abs/1809.06547. External Links: Link, 1809.06547 Cited by: §1, §2.
  • [22] A. Venkat, C. Patel, Y. Agrawal, and A. Sharma (2019) HumanMeshNet: polygonal mesh recovery of humans. In ICCV Workshops, Cited by: §2.
  • [23] W. Xu, A. Chatterjee, M. Zollhöfer, H. Rhodin, D. Mehta, H. Seidel, and C. Theobalt (2018-05) MonoPerfCap: human performance capture from monocular video. ACM Trans. Graph. 37 (2), pp. 27:1–27:15. External Links: ISSN 0730-0301, Link, Document Cited by: §1, §4.1.
  • [24] C. Zhang, S. Pujades, M. J. Black, and G. Pons-Moll (2017) Detailed, accurate, human shape estimation from clothed 3d scan sequences. In CVPR, Cited by: §1, §4.1.
  • [25] H. Zhu, X. Zuo, S. Wang, X. Cao, and R. Yang (2019) Detailed human shape estimation from a single image by hierarchical mesh deformation. CoRR abs/1904.10506. External Links: Link, 1904.10506 Cited by: §2.