Reconstruction of a textured 3D model of the human body from images is a pivotal problem in computer vision and graphics. It has widespread applications ranging from the entertainment industry, e-commerce, health-care to AR/VR platforms. Reconstruction of human body models from a monocular image is an ill-posed problem and traditionally multi-view images captured by a calibrated setup were needed. However, recent advancements in deep learning methods have enabled the reconstruction of plausible textured 3D body models from a monocular image (in the wild setup).
This problem is particularly challenging as the object geometry of non-rigid human shapes varies over time, yielding a large space of complex articulated body poses as well as shape variations. In addition to this, there are several other challenges such as self-occlusions, obstructions due to free form clothing and large viewpoint variations, especially while attempting to reconstruct from a monocular image.
Existing deep learning solutions for monocular 3D human reconstruction can be broadly categorized into two classes. The first class of model-based approaches use a parametric body representation, like the SMPL , to recover the 3D surface model from a single image (e.g., ). Such model-based methods efficiently capture the pose of the underlying naked body (and approximate shape, sans texture) but fail to reconstruct fine surface details of the body and wrapped clothing. A recent effort in  has tried to model clothing by estimating the vertex displacement field w.r.t. the SMPL template in a multi-image setup but cannot handle very loose clothing. Another approach by  predicts a UV map for every foreground pixel which can be used in texture generation over an SMPL model. However, clothing details are not captured in this method.
The second class of model-free approaches do not assume any parametric model of the body. One set of model-free approaches perform image to volumetric regression for human body recovery  which is a natural extension to 2D convolutions. Another parallel work in  proposed to obtained textured body models by performing volumetric regression for 3D reconstruction along with a generative model for predicting multiple orthographic RGB maps that were then projected on regressed voxel-grid. However, volumetric regression is known to be memory intensive as it involves redundant 3D convolution on empty voxels. Recently, deep networks for learning implicit functions have been used to recover 3D human models under loose clothing . However, their inference time is higher as one need to exhaustively sample points in the 3D volume and evaluate its binary occupancy information for each point. Very recently,  modeled 3D humans as a pixel-wise regression of two independent depth maps, which is similar to capturing depth maps by two RGBD virtual cameras separated by 180
. However, this formulation limits to predicting front and back depth maps of the body completely independently and do not involve any kind of joint loss imposing 3D body structure into regression loss function. Additionally, owing to generating just two depth maps, it fails to handle large self-occlusion in human body models, especially in presence of large variation in camera views (e.g., image capturing side-views of the human body). To summarize, model-based methods cannot handle fine surface details, and model-free approaches are either computationally intensive or do not handle large self-occlusions.
In this paper, we tackle the problem of reconstructing the 3D human body model along with texture from a single RGB image using a generative model. Our approach is motivated by the idea of ray tracing in computer graphics, which produces realistic 2D rendered images of 3D objects/scenes. Each pixel in the 2D image corresponds to the projection of a real-world point in 3D space where the ray intersects with the object surface (mesh facet) for the first time. We pose the problem of 3D reconstruction from a monocular image as an inverse ray tracing problem where the 3D surface geometry can be recovered by estimating the depths at which a set of rays emanating from the virtual camera center (passing through individual pixels in the image) intersect the 3D body surface. This can be interpreted as depth peeling of surface body model, yielding a multi-layered representation called hereinafter as Peeled Depth maps. Such a layered representation of depth enables addressing severe self-occlusion in complex body poses as we can now recover 3D points from multiple surfaces that project to the same pixel while imaging (refer Figure 3).
Additionally, this layered representation is further extended to obtain Peeled RGB maps capturing a discrete sampling of the continuous surface texture. Unlike RGBD maps (of commercial depth sensors) that store the depth and color of the closest 3D point to each pixel, our peeled representation encodes multiple depths and their corresponding RGB value. Thus, we reformulate the solution of monocular 3D textured body model reconstruction task as predicting/generating respective peeled maps for both RGB and depth. To achieve this dual prediction task, we propose a multi-task generative adversarial network that generates a set of depth and RGB maps in two different branches of the network, as shown in Figure 1. Instead of using only norm on each independent depth maps, we propose to add Chamfer loss over the reconstructed point cloud (in the image/camera centric coordinate system) obtained with all four depth maps. This enables the network to implicitly impose a 3D body shape regularization while generating peeled depth-maps. Additionally, we use depth consistency loss for enforcing the strict ordering constraint (imposed by inverse ray-tracing formulation) which further regularizes the solution. An occlusion-aware depth loss further emphasizes on pixels that suffer from self-occlusion. Thus, our network is able to hallucinate plausible parts of the body even if the corresponding part is self-occluded in the image. We evaluate our approach on publicly available BUFF  and MonoPerfCap  datasets and report superior quantitative and qualitative results with other state-of-the-art methods.
To summarize our contributions in this paper:
We present a novel approach to represent a 3D human body in terms of peeled depth and RGB maps which is robust to severe self-occlusions.
We propose a complete end-to-end pipeline to reconstruct a 3D human body with texture given a single RGB image using an adversarial approach.
We note that our peeled depth and texture representation computation is efficient in terms of network size and feed-forward time yielding a high rate and quality of reconstructions.
2 Related Work
3D human body reconstruction is broadly categorized into parametric and non-parametric methods. Traditionally, voxel carving and triangulation methods were employed for recovering a 3D human body from calibrated multi-camera setups . Another approach proposed in  attempts to solve real-time performance capture of challenging scenes. However, they use RGBD images from multiple cameras as input.
Most of the existing deep learning methods which recover the 3D shape from an RGB image propose solutions on the parametric SMPL  model like in  where re-projection loss is minimized for 2D keypoints. A discriminator is trained for each joint of the parametric model making the model computationally expensive during training. Segmentation masks  are used to further improve the fitting of the 3D model to the available 2D image evidence. However, these parametric body estimation methods yield a smooth naked mesh missing out on the high-frequency surface-level details. Some of the works use a voxel grid i.e. a binary occupancy map and employ volumetric regression to recover human bodies from a single RGB image [21, 19, 7]. The authors in this work propose a volumetric 3D loss and a multi-view reprojection loss which are computationally expensive as well. The volumetric representation poses a serious computational disadvantage due to the sparsity of the voxel grids/maps and surface quality is limited to the voxel grid resolution. Deformation based approaches have been proposed over parametric models which incorporate these details to an extent. The constraints from body joints, silhouettes, and per-pixel shading information are utilized in [25, 21] to produce per-vertex movements away from the SMPL model. Authors in  estimate the vertex displacements by regressing to SMPL vertices. Additionally, researchers have explored parametric clothing in addition to parametric human body model to incorporate tight clothing details over SMPL model . This method, however, predicts the mesh using eight input RGB images. These techniques fail for complex clothing topologies such as skirts, dresses which is not handled by their approach.
Few approaches work on directly regressing the point cloud . But these essentially work only for rigid objects. To address these issues during reconstruction of 3D human bodies, interest has garnered around non-parametric approaches recently. Deep generative models have been proposed  taking inspiration from the visual hull algorithm to synthesize 2D silhouettes that are back projected from inferred 3D joints. Later, the silhouettes are back projected to obtain variations of clothed models and shape complexity. Implicit representations of 3D objects have been employed for deep learning based approaches [12, 15]
which represent the 3D surface as a continuous decision boundary of a deep neural network classifier. One drawback of this representation is that it is limited by its low inference time evaluation. Moreover, self occlusions are not handled by these approaches as they do not impose human body shape prior. Multi-layer approaches has been used for 3D scene understanding. Layered Depth Images are proposed in for efficient redendering application. View synthesis as a proxy task has been used in 
. Recently, transformer networks were proposed in order to transfer features in a novel view in
3 Problem Formulation
We represent a textured 3D body model as sets of peeled depth () and RGB maps (). Specifically, given an input RGB image with dimensions from an arbitrary view, we aim to generate peeled depth maps where and peeled RGB maps , where is the corresponding peeled RGB map with corresponding to .
3.1 Peeled Representation:
Given a camera and a 3D human body in a scene, we shoot rays originating from the virtual camera to the 3D world. We assume the human body as a non-convex object and record observations for the first hit of the ray. The depth map and RGB map evidence of this first hit are the visible surface details that are closest to the camera. Here, and correspond to standard RGBD data typically obtained from commercially available Kinect-like sensors. We peel away the occlusion and extend the rays beyond the first bounce to hit the next intersecting surface. The corresponding depth and RGB values are iteratively recorded in the next layer as and respectively. In this work, we consider four intersections of each ray to reconstruct a human for most of the daily actions we perform.
Our aim is to predict the closest plausible shape that is consistent with the available image evidence. If the intrinsics i.e. the focal length of camera along its and axes respectively and its center of axes are known, then the direction of ray in camera coordinate system corresponding to pixel is given by:
The ray is extended beyond the first hit. Conversely, if RGBD data is available and camera intrinsics are provided, we can locate the 3D point corresponding to each pixel. For a pixel with depth , its location in camera coordinate system is given by:
where and , given and of an image . Here, we assume is at the center of the image.
The point is coloured with RGB values at pixel .
Relation to Implicit Representation: The peeled representation of the 3D body can be considered as an inverse problem of an implicit function where and . The underlying surface is implicitly obtained by an isosurface of . For a , we estimate the space in 3D volume where the body is present. It can be written in relation to implicit function as:
In this section, we introduce the proposed PeelNet, an end-to-end framework for 3D reconstruction of the human body. Figure 2 gives an overview of our pipeline. The input to our network is a single RGB image taken from an arbitrary viewpoint and outputs and as explained in section 3. For this task, we propose to use a conditional Generative Adversarial Network and formulate the problem of 3D reconstruction as generating sequences of peeled RGB and depth maps.
We build upon the generator and discriminator architectures proposed in 
. The input RGB image is fed to a series of residual blocks after encoding. The convolutional layers use normalization layers and ReLU as the activation function. Even though the peeled RGB and depth maps are spatially aligned, they are sampled from different distributions. Hence, we propose to decode these maps in two different branches as shown in Figure 2. The network finally producesRGB maps and depth maps which is back-projected to 3D volume to reconstruct the 3D human body. We also use two discriminators, one for each RGB and depth maps. We use a Markovian Discriminator as proposed in  which enforces correctness on low frequencies. We train our network with the following loss function:
where , , , are weights for occlusion-aware depth loss, RGB loss, Chamfer loss and depth consistency loss respectively. Each loss term is explained in detail below.
For discriminating generated peeled RGB maps , we concatenate with the available image evidence i.e. input RGB image making it a 12-channel fake input while the ground truth RGB maps are real input. For peeled depth maps, we concatenate the image evidence i.e. with 4 generated peeled depth maps making it a 7-channel fake input while the ground truth depth maps are real input.
We denote our generator as , the discriminator for peeled RGB maps as , the discriminator for peeled depth maps as .
The ground truth RGB and depth maps are denoted by and respectively.
Adversarial loss (): The objective of the generator constituting the conditional GAN is to produce as output and maps that cannot be classified as fake by the discriminators. On the other hand, discriminators and are trained to reject fake RGB and depth samples generated by the generator . Hence, both and are trained to predict 1 when provided with ground-truth data and 0 when fake data is provided as input. It can be written as :
RGB loss (): The generator minimizes the loss between ground-truth RGB images and generated peeled RGB maps. Here, is the number of layers for 3D representation. Note that the gradient propagating through the branch generating these maps is a summation of only RGB loss and adversarial loss.
Occlusion-aware depth loss (): We minimize masked loss over ground-truth and generated peeled depth maps as loss is known to produce blurry artifacts.
where for non-occluded pixels and for occluded pixels. We consider the pixels which have non-zero depth values in third and fourth peel maps as occluded pixels.
Chamfer Loss (): Each peeled depth map represents a partial 3D shape. To enable the network to capture the underlying 3D structure of the computed depth maps, we minimize Chamfer distance between reconstructed (rc) and ground truth (gt) point cloud.
Depth Consistency Loss (): Motivated by , we additionally penalize the network when depth hierarchy is not maintained i.e. as is closer to the camera. We enforce a constant penalty of 1 when the criteria is violated.
We implement our proposed framework in PyTorch. We use ResNet-18 as our generator and PatchGAN as our discriminator. We capture the ground-truth peeled maps using trimesh111www.trimesh.org. The resolution of our input and output images are . We use Adam optimizer with learning rate - and , , , and 10,100,500,500,10 respectively. We trained our model for epochs.
4.1 Datasets and Preprocessing
We evaluate our method on two datasets (i) BUFF dataset  (ii) MonoPerfCap . First, the BUFF dataset consists of 5 subjects with 2 clothing styles. We use 3 subjects completely as training data and 1 subject completely for testing data. We split the frames of one subject equally between train and test data. Second, the MonoPerfCap dataset consists of daily human motion sequences in tight and loose clothing styles. We use 2 subjects as testing data and remaining subjects as training data. For each frame, we capture the subject after scaling it down to unit box from 4 different camera angles : (canonical view), , , . We compute 4 peeled depth and texture maps for each frame.
4.2 Qualitative results
We evaluate our approach on 3 human actions each from the BUFF and MonoPerfCap datasets as shown in Figure 4. Our approach is able to accurately recover the 3D human shape from previously unseen views.
Even for severely occluded views, the network is able to predict the hidden body parts.
To realistically simulate commercially available RGBD sensor outputs, we introduce random Gaussian noise in depth map and train with RGBD as input. This helps to increase the robustness of our system. The network is able to ignore the noise and reconstruct as shown in Figure 5. We also show our results on an in-the-wild image not present in any dataset. The captured image is segmented using graphonomy  which is evaluated using our model and the reconstruction is shown in Figure 6.
4.3 Comparison with other approaches
We compare our approach against parametric and non-parametric approaches. We test our approach against HMR  which fits the SMPL body. We also test our method against PIFU  which evaluates occupancy information on the voxel space. To test our method against , we train PeelNet with our specifications as code and data are not available. The output of the network are two nearly symmetric depth maps which is analogous to capturing two depth maps from cameras separated by . For everyday human motions, severe self-occlusions are very frequent when seen from a single view producing inaccurate reconstructions as shown in Figure 7. We compare our results in Figure 8 showing the robustness of our approach to severe self-occlusions. Our network is trained with additional losses that are not considered in current approaches but have a significant effect on reconstruction quality. We quantitatively evaluate our method using Chamfer distance against PIFu , BodyNet , SiCloPe and VRN  in Table 1. Even with a lower input image resolution, our method achieves significantly lower Chamfer distance scores.
|Method||Chamfer Distance||Image Resolution|
The inclusion of Chamfer distance as a loss metric allows the network to infer the 3D structure inherent in the peeled depth maps. Our network is able to accurately predict the presence of occluded body parts.Training the network without Chamfer loss() results in reconstruction as shown in Figure 9. In the majority of cases, the network is not able to hallucinate the presence of occluded parts in the and depth maps and are hence, missing in the figure. The absence of Chamfer loss also produces significant noise in the reconstruction.
In this paper, we present a novel peeled representation to reconstruct human shape, pose and texture from a single RGB image. This generative formulation allows us to efficiently recover self-occluded parts and texture present in the image. Our end-to-end framework has low inference time and generates robust 3D reconstructions. The peeled representation, however, suffers from the absence of 3D points that are tangential to the viewpoint of the image evidence. In our future work, we would attempt to solve this issue by incorporating a template human mesh to recover these 3D points.
-  (2019-10) Multi-garment net: learning to dress 3d people from images. In ICCV, Cited by: §1, §2.
-  (2017) Dynamic FAUST: Registering human bodies in motion. In (CVPR), Cited by: §2.
-  (2016) Fusion4d: real-time performance capture of challenging scenes. ACM Transactions on Graphics (TOG). Cited by: §2.
-  (2019-10) Moulding humans: non-parametric 3d human shape estimation from single images. In ICCV, Cited by: §1, §4.3.
Graphonomy: universal human parsing via graph transfer learning. CoRR. External Links: Cited by: §4.2.
Densepose: dense human pose estimation in the wild. In CVPR, Cited by: §1.
-  (2018) Deep volumetric video from very sparse multi-view performance capture. In ECCV, Cited by: §2.
-  (2017) Image-to-image translation with conditional adversarial networks. In CVPR, Cited by: §3.2.
-  (2018) 3D human body reconstruction from a single image via volumetric regression. In ECCV), Cited by: §4.3.
-  (2017) End-to-end recovery of human shape and pose. CoRR abs/1712.06584. External Links: Cited by: §1, §2, §4.3.
-  (2015-10) SMPL: a skinned multi-person linear model. ACM Trans. Graphics (Proc. SIGGRAPH Asia). Cited by: §1, §2.
-  (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §2.
-  (2019) Siclope: silhouette-based clothed people. In CVPR, Cited by: §2, §4.3.
-  (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In NIPS, Cited by: §2.
-  (2019) PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. arXiv preprint arXiv:1905.05172. Cited by: §1, §2, §4.3.
-  (1998) Layered depth images. In Proceedings of the 25th Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH ’98, New York, NY, USA, pp. 231–242. External Links: Cited by: §2.
-  (2019) Multi-layer depth and epipolar feature transformers for 3d scene reconstruction. In CVPR Workshops, Cited by: §2.
-  (2018) Layer-structured 3d scene inference via view synthesis. In ECCV, Cited by: §2, §3.2.
-  (2018) BodyNet: volumetric inference of 3d human body shapes. CoRR abs/1804.04875. External Links: Cited by: §1, §2, §4.3.
-  (2017) Learning from synthetic humans. In CVPR, Cited by: §2.
-  (2018) Deep textured 3d reconstruction of human bodies. CoRR abs/1809.06547. External Links: Cited by: §1, §2.
-  (2019) HumanMeshNet: polygonal mesh recovery of humans. In ICCV Workshops, Cited by: §2.
-  (2018-05) MonoPerfCap: human performance capture from monocular video. ACM Trans. Graph. 37 (2), pp. 27:1–27:15. External Links: Cited by: §1, §4.1.
-  (2017) Detailed, accurate, human shape estimation from clothed 3d scan sequences. In CVPR, Cited by: §1, §4.1.
-  (2019) Detailed human shape estimation from a single image by hierarchical mesh deformation. CoRR abs/1904.10506. External Links: Cited by: §2.