This repository contains the code for the paper "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization"
We introduce Pixel-aligned Implicit Function (PIFu), a highly effective implicit representation that locally aligns pixels of 2D images with the global context of their corresponding 3D object. Using PIFu, we propose an end-to-end deep learning method for digitizing highly detailed clothed humans that can infer both 3D surface and texture from a single image, and optionally, multiple input images. Highly intricate shapes, such as hairstyles, clothing, as well as their variations and deformations can be digitized in a unified way. Compared to existing representations used for 3D deep learning, PIFu can produce high-resolution surfaces including largely unseen regions such as the back of a person. In particular, it is memory efficient unlike the voxel representation, can handle arbitrary topology, and the resulting surface is spatially aligned with the input image. Furthermore, while previous techniques are designed to process either a single image or multiple views, PIFu extends naturally to arbitrary number of views. We demonstrate high-resolution and robust reconstructions on real world images from the DeepFashion dataset, which contains a variety of challenging clothing types. Our method achieves state-of-the-art performance on a public benchmark and outperforms the prior work for clothed human digitization from a single image.READ FULL TEXT VIEW PDF
This repository contains the code for the paper "PIFu: Pixel-Aligned Implicit Function for High-Resolution Clothed Human Digitization"
3D human reconstruction from a single image summary.
In an era where immersive technologies and sensor-packed autonomous systems are becoming increasingly prevalent, our ability to create virtual 3D content at scale goes hand-in-hand with our ability to digitize and understand 3D objects in the wild. If digitizing an entire object in 3D would be as simple as taking a picture, there would be no need for sophisticated 3D scanning devices, multi-view stereo algorithms, or tedious capture procedures, where a sensor needs to be moved around.
For certain domain-specific objects, such as faces, human bodies, or known man made objects, it is already possible to infer relatively accurate 3D surfaces from images with the help of parametric models, data-driven techniques, or deep neural networks. Recent 3D deep learning advances have shown that general shapes can be inferred from very few images and sometimes even a single input. However, the resulting resolutions and accuracy are typically limited, due to ineffective model representations, even for domain specific modeling tasks.
We propose a new Pixel-aligned Implicit Function (PIFu) representation for 3D deep learning and focus on the challenging problem of textured surface inference of clothed 3D humans from a single or multiple input images. While most successful deep learning methods for 2D image processing (e.g., semantic segmentation , 2D joint detection , etc.) take advantage of “fully-convolutional” network architectures that preserve the spatial alignment between the image and the output, this is particularly challenging in the 3D domain. While voxel representations  can be applied in a fully-convolutional manner, the memory intensive nature of the representation inherently restrict its ability to produce fine-scale detailed surfaces. Inference techniques based on global representations [19, 30, 44] are more memory efficient, but cannot guarantee that details of input images are preserved. Similarly, methods based on implicit functions [11, 44, 38] rely on the global context of the image to infer the overall shape, which may not align with the input image accurately. On the other hand, PIFu aligns individual local features at the pixel level to the global context of the entire object in a fully convolutional manner, and does not require high memory usage, as in voxel-based representations. This is particularly relevant for the 3D reconstruction of clothed subjects, whose shape can be of arbitrary topology, highly deformable and highly detailed.
Specifically, we train an encoder to learn individual feature vectors for each pixel of an image that takes into account the global context relative to its position. Given this per-pixel feature vector and a specified z-depth along the outgoing camera ray from this pixel, we learn an implicit function that can classify whether a 3D point corresponding to this z-depth is inside or outside the surface. In particular, our feature vector spatially aligns the global 3D surface shape to the pixel, which allows us to preserve local details present in the input image while inferring plausible ones in unseen regions.
Our end-to-end and unified digitization approach can directly predict high-resolution 3D shapes of a person with complex hairstyles and wearing arbitrary clothing. Despite the amount of unseen regions, particularly for a single-view input, our method can generate a complete model similar to ones obtained from multi-view stereo photogrammetry or other 3D scanning techniques. As shown in Figure 1, our algorithm can handle a wide range of complex clothing, such as skirts, scarfs, and even high-heels while capturing high frequency details such as wrinkles that match the input image at the pixel level.
By simply adopting the implicit function to regress RGB values at each queried point along the ray, PIFu can be naturally extended to infer per-vertex colors. Hence, our digitization framework also generates a complete texture of the surface, while predicting plausible appearance details in unseen regions. Through additional multi-view stereo constraints, PIFu can also be naturally extended to handle multiple input images, as is often desired for practical human capture settings. Since producing a complete textured mesh is already possible from a single input image, adding more views only improves our results further by providing additional information for unseen regions.
We demonstrate the effectiveness and accuracy of our approach on a wide range of challenging real-world and unconstrained images of clothed subjects. We also show for the first time, high-resolution examples of monocular and textured 3D reconstructions of dynamic clothed human bodies reconstructed from a video sequence. We provide comprehensive evaluations of our method using ground truth 3D scan datasets obtained using high-end photogrammetry. We compare our method with prior work and demonstrate the state-of-the-art performance on a public benchmark for digitizing clothed humans.
Single-View 3D Human Digitization. Single-view digitization techniques require strong priors due to the ambiguous nature of the problem. Thus, parametric models of human bodies and shapes [4, 35] are widely used for digitizing humans from input images. Silhouettes and other types of manual annotations [20, 70] are often used to initialize the fitting of a statistical body model to images. Bogo et al. 
proposed a fully automated pipeline for unconstrained input data. Recent methods involve deep neural networks to improve the robustness of pose and shape parameters estimations for highly challenging images[30, 46]. Methods that involve part segmentation as input [33, 42] can produce more accurate fittings. Despite their capability to capture human body measurements and motions, parametric models only produce a naked human body. The 3D surfaces of clothing, hair, and other accessories are fully ignored. For skin-tight clothing, a displacement vector for each vertex is sometimes used to model some level of clothing as shown in [2, 65, 1]. Nevertheless, these techniques fail for more complex topology such as dresses, skirts, and long hair. To address this issue, template-free methods such as BodyNet  learn to directly generate a voxel representation of the person using a deep neural network. Due to the high memory requirements of voxel representations, fine-scale details are often missing in the output. More recently,  introduced a multi-view inference approach by synthesizing novel silhouette views from a single image. While multi-view silhouettes are more memory efficient, concave regions are difficult to infer as well as consistently generated views. Consequentially, fine-scale details cannot be produced reliably. In contrast, PIFu is memory efficient and is able to capture fine-scale details present in the image, as well as predict per-vertex colors.
Multi-View 3D Human Digitization. Multi-view acquisition methods are designed to produce a complete model of a person and simplify the reconstruction problem, but are often limited to studio settings and calibrated sensors. Early attempts are based on visual hulls [37, 60, 15, 14] which uses silhouettes from multiple views to carve out the visible areas of a capture volume. Reasonable reconstructions can be obtained when large numbers of cameras are used, but concavities are inherently challenging to handle. More accurate geometries can be obtained using multi-view stereo constraints [55, 73, 63, 16] or using controlled illumination, such as multi-view photometric stereo techniques [61, 66]. Several methods use parametric body models to further guide the digitization process [54, 17, 5, 25, 3, 1]. The use of motion cues has also been introduced as additional priors [47, 68]. While it is clear that multi-view capture techniques outperform single-view ones, they are significantly less flexible and deployable.
A middle ground solution consists of using deep learning frameworks to generate plausible 3D surfaces from very sparse views.  train a 3D convolutional LSTM to predict the 3D voxel representation of objects from arbitrary views.  combine information from arbitrary views using differentiable unprojection operations.  also uses a similar approach, but requires at least two views. All of these techniques rely on the use of voxels, which is memory intensive and prevents the capture of high-frequency details. [26, 18] introduced a deep learning approach based on a volumetric occupancy field that can capture dynamic clothed human performances using very sparse views as input. At least three views are required for these methods to produce reasonable output.
Texture Inference. When reconstructing a 3D model from a single image, the texture can be easily sampled from the input. However, the appearance in occluded regions needs to be inferred in order to obtain a complete texture. Related to the problem of 3D texture inference are view-synthesis approaches that predict novel views from a single image [71, 43]. Within the context of texture mesh inference of clothed human bodies, 
introduced a view synthesis technique that can predict the back view from the front one. Both front and back views are then used to texture the final 3D mesh, however self-occluding regions and side views cannot be handled. Akin to the image inpainting problem,  inpaints UV images that are sampled from the output of detected surface points, and [57, 22] infers per voxel colors, but the output resolution is very limited.  directly predicts RGB values on a UV parameterization, but their technique can only handle shapes with known topology and are therefore not suitable for clothing inference. Our proposed method can predict per vertex colors in an end-to-end fashion and can handle surfaces with arbitrary topology.
Given a single or multi-view images, our goal is to reconstruct the underlining 3D geometry and texture of a clothed human while preserving the detail present in the image. To this end, we introduce Pixel-Aligned Implicit Functions (PIFu) which is a memory efficient and spatially-aligned 3D representation for 3D surfaces. An implicit function defines a surface as a level set of a function , . This results in a memory efficient representation of a surface where the space in which the surface is embedded does not need to be explicitly stored. The proposed pixel-aligned implicit function consists of a fully convolutional image encoder and a continuous implicit function
represented by multi-layer perceptrons (MLPs), where the surface is defined as a level set of
where for a 3D point , is its 2D projection, is the depth value in the camera coordinate space, is the image feature at . We assume a weak-perspective camera, but extending to perspective cameras is straightforward. Note that we obtain the pixel-aligned feature using bilinear sampling, because the 2D projection of is defined in a continuous space rather than a discrete one (i.e., pixel).
The key observation is that we learn an implicit function over the 3D space with pixel-aligned image features rather than global features, which allows the learned functions to preserve the local detail present in the image. The continuous nature of PIFu allows us to generate detailed geometry with arbitrary topology in a memory efficient manner. Moreover, PIFu can be cast as a general framework that can be extended to various co-domains such as RGB colors.
Figure 2 illustrates the overview of our framework. Given an input image, PIFu for surface reconstruction predicts the continuous inside/outside probability field of a clothed human, in which iso-surface can be easily extracted (Sec. 3.1). Similarly, PIFu for texture inference (Tex-PIFu) outputs an RGB value at 3D positions of the surface geometry, enabling texture inference in self-occluded surface regions and shapes of arbitrary topology (Sec. 3.2). Furthermore, we show that the proposed approach can handle single-view and multi-view input naturally, which allows us to produce even higher fidelity results when more views are available (Sec. 3.3).
For surface reconstruction, we represent the ground truth surface as a level-set of a continuous 3D occupancy field:
We train a pixel-aligned implicit function (PIFu) by minimizing the average of mean squared error:
where , is the image feature from the image encoder at and is the number of sampled points. Given a pair of an input image and the corresponding 3D mesh that is spatially aligned with the input image, the parameters of the image encoder and PIFu are jointly updated by minimizing Eq. 3. As Bansal et al.  demonstrate for semantic segmentation, training an image encoder with a subset of pixels does not hurt convergence compared with training with all the pixels. During inference, we densely sample the probability field over the 3D space and extract the iso-surface of the probability field at threshold using the Marching Cube algorithm . This implicit surface representation is suitable for detailed objects with arbitrary topology. Aside from PIFu’s expressiveness and memory-efficiency, we develop a spatial sampling strategy and network architecture that is critical for achieving high-fidelity inference. Please refer to the supplemental materials for our network architecture and training procedure.
The resolution of the training data plays a central role in achieving the expressiveness and accuracy of our implicit function. Unlike voxel-based methods, our approach does not require discretization of ground truth 3D meshes. Instead, we can directly sample 3D points on the fly from the ground truth mesh in the original resolution using an efficient ray tracing algorithm . Note that this operation requires water-tight meshes. In the case of non-watertight meshes, one can use off-the-shelf solutions to make the meshes watertight 
. Additionally, we observe that the sampling strategy can largely influence the final reconstruction quality. If one uniformly samples points in the 3D space, the majority of points are far from the iso-surface, which would unnecessarily weight the network toward outside predictions. On the other hand, sampling only around the iso-surface can cause overfitting. Consequently, we propose to combine uniform sampling and adaptive sampling based on the surface geometry. We first randomly sample points on the surface geometry and add offsets with normal distribution( cm in our experiments) for x, y, and z axis to perturb their positions around the surface. We combine those samples with uniformly sampled points within bounding boxes using a ratio of . We provide an ablation study on our sampling strategy in the supplemental materials.
While texture inference is often performed on either a 2D parameterization of the surface [31, 21] or in view-space , PIFu enables us to directly predict the RGB colors on the surface geometry by defining in Eq. 1 as an RGB vector field instead of a scalar field. This supports texturing of shapes with arbitrary topology and self-occlusion. However, extending PIFu to color prediction is a non-trivial task as RGB colors are defined only on the surface while the 3D occupancy field is defined over the entire 3D space. Here, we highlight the modification of PIFu in terms of training procedure and network architecture.
Given sampled 3D points on the surface , the objective function for texture inference is the average of L1 error of the sampled colors as follows:
where is the ground truth RGB value on the surface point and is the number of sampled points. We found that naively training
with the loss function above severely suffers from overfitting. The problem is thatis expected to learn not only RGB color on the surface but also the underlining 3D surfaces of the object so that can infer texture of unseen surface with different pose and shape during inference, which poses a significant challenge. We address this problem with the following modifications. First, we condition the image encoder for texture inference with the image features learned for the surface reconstruction . This way, the image encoder can focus on color inference of a given geometry even if unseen objects have different shape, pose, or topology. Additionally, we introduce an offset to the surface points along the surface normal so that the color can be defined not only on the exact surface but also on the 3D space around it. With the modifications above, the training objective function can be rewritten as:
where . We use cm for all the experiments. Please refer to the supplemental material for the network architecture for texture inference.
Additional views provide more coverage about the person and should improve the digitization accuracy. Our formulation of PIFu provides the option to incorporate information from more views for both surface reconstruction and texture inference. We achieve this by using PIFu to learn a feature embedding for every 3D point in space. Specifically the output domain of Eq. 1 is now a -dimensional vector space that represents the latent feature embedding associated with the specified 3D coordinate and the image feature from each view. Since this embedding is defined in the 3D world coordinate space, we can aggregate the embedding from all available views that share the same 3D point. The aggregated feature vector can be used to make a more confident prediction of the surface and the texture.
Specifically we decompose the pixel-aligned function into a feature embedding network and a multi-view reasoning network as . See Figure 3 for illustrations. The first function encodes the image feature and depth value from each view point into latent feature embedding . This allows us to aggregate the corresponding pixel features from all the views. Now that the corresponding 3D point is shared by different views, each image can project on its own image coordinate system by and . Then, we aggregate the latent features by average pooling operation and obtain the fused embedding . The second function maps from the aggregated embedding to our target implicit field (i.e., inside/outside probability for surface reconstruction and RGB value for texture inference). The additive nature of the latent embedding allows us to incorporate arbitrary number of inputs. Note that a single-view input can be also handled without modification in the same framework as the average operation simply returns the original latent embedding. For training, we use the same training procedure as the aforementioned single-view cases including loss functions and the point sampling scheme. While we train with three random views, our experiments show that the model can incorporate information from more than three views (See Sec. 4).
We evaluate our proposed approach on a variety of datasets, including RenderPeople  and BUFF , which has ground truth measurements, as well as DeepFashion  which contains a diverse variety of complex clothing.
We quantitatively evaluate our reconstruction accuracy with three different metrics. In the model space, we measure the average point-to-surface Euclidean distance (P2S) in cm from the vertices on the reconstructed surface to the ground truth. We also measure the Chamfer distance between the reconstructed and the ground truth surfaces. In addition, we introduce the normal reprojection error to measure the fineness of reconstructed local details, as well as the projection consistency from the input image. For both reconstructed and ground truth surfaces, we render their normal maps in the image space from the input viewpoint respectively. We then calculate the L error between these two normal maps.
Single-View Reconstruction. In Table 1 and Figure 5, we evaluate the reconstruction errors for each method on both Buff and RenderPeople test set. Note that while Voxel Regression Network (VRN) , IM-GAN , and ours are retrained with the same High-Fidelity Clothed Human dataset we use for our approach, the reconstruction of [39, 58] are obtained from their trained models as off-the-shelf solutions. In contrast to the state-of-the-art single-view reconstruction method using implicit function (IM-GAN)  that reconstruct surface from one global feature per image, our method outputs pixel-aligned high-resolution surface reconstruction that captures hair styles and wrinkles of the clothing. We also demonstrate the expressiveness of our PIFu representation compared with voxels. Although VRN and ours share the same network architecture for the image encoder, the higher expressiveness of implicit representation allows us to achieve higher fidelity.
In Figure 6, we also compare our single-view texture inferences with a state-of-the-art texture inference method on clothed human, SiCloPe , which infers a 2D image from the back view and stitches it together with the input front-view image to obtain textured meshes. While SiCloPe suffers from projection distortion and artifacts around the silhouette boundary, our approach predicts textures on the surface mesh directly, removing projection artifacts.
Multi-View Reconstruction. In Table 2 and Figure 7, we compare our multi-view reconstruction with other deep learning-based multi-view methods including voxel-based multi-view stereo machine, LSM , and a deep visual hull method proposed by Huang et al. . All approaches are trained on the same High-Fidelity Clothed Human Dataset using three-view input images. Note that Huang et al. can be seen as a degeneration of our method where the multi-view feature fusion process solely relies on image features, without explicit conditioning on the 3D coordinate information. To evaluate the importance of conditioning on the depth, we denote our network architecture removing from input of PIFu as Huang et al. in our experiments. We demonstrate that PIFu achieves the state-of-the-art reconstruction qualitatively and quantitatively in our metrics. We also show that our multi-view PIFu allows us to increasingly refine the geometry and texture by incorporating arbitrary number of views in Figure 8.
In Figure 4, we present our digitization results using real world input images from the DeepFashion dataset . We demonstrate our PIFu can handle wide varieties of clothing, including skirts, jackets, and dresses. Our method can produce high-resolution local details, while inferring plausible 3D surfaces in unseen regions. Complete textures are also inferred successfully from a single input image, which allows us to view our 3D models from degrees. We refer to the supplemental video222https://youtu.be/S1FpjwKqtPs for additional static and dynamic results. In particular, we show how dynamic clothed human performances and complex deformations can be digitized in 3D from a single 2D input video.
We introduced a novel pixel-aligned implicit function, which spatially aligns the pixel-level information of the input image with the shape of the 3D object, for deep learning based 3D shape and texture inference of clothed humans from a single input image. Our experiments indicate that highly plausible geometry can be inferred including largely unseen regions such as the back of a person, while preserving high-frequency details present in the image. Unlike voxel-based representations, our method can produce high-resolution output since we are not limited by the high memory requirements of volumetric representations. Furthermore, we also demonstrate how this method can be naturally extended to infer the entire texture on a person given partial observations. Unlike existing methods, which synthesize the back regions based on frontal views in an image space, our approach can predict colors in unseen, concave and side regions directly on the surface. In particular, our method is the first approach that can inpaint textures for shapes of arbitrary topology. Since we are capable for generating textured 3D surfaces of a clothed person from a single RGB camera, we are moving a step closer toward monocular reconstructions of dynamic scenes from video without the need of a template model. Our ability to handle arbitrary additional views also makes our approach particularly suitable for practical and efficient 3D modeling settings using sparse views, where traditional multi-view stereo or structure-from-motion would fail.
While our texture predictions are reasonable and not limited by the topology or parameterization of the inferred 3D surface, we believe that higher resolution appearances can be inferred, possibly using generative adversarial networks. In this work, we focused largely on clothed human surfaces. A natural question is how it extends to general object shapes. Our preliminary experiments on the ShapeNet dataset  in a class agnostic setting reveals new challenges as shown in Figure 9. We speculate that the greater variety of object shapes makes it difficult to learn a globally coherent shape from pixel-level features, which future work can address. Lastly, in all our examples, none of the segmented subjects are occluded by any other objects or scene elements. In real-world settings, occlusions often occur and perhaps only a part of the body is framed in the camera. Being able to digitize and predict complete objects in partially visible settings could be highly valuable for analyzing humans in unconstrained settings. Whether it will be an end-to-end approach or a sophisticated system, we believe that it will be eventually possible to digitize arbitrary 3D objects from a single RGB input, and PIFu represents an important building block toward this goal.
Acknowledgements Hao Li is affiliated with the University of Southern California, the USC Institute for Creative Technologies, and Pinscreen. This research was conducted at USC and was funded by in part by the ONR YIP grant N00014-17-S-FO14, the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation program sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S. Army Research Laboratory under contract number W911NF-14-D-0005, Adobe, and Sony. This project was not funded by Pinscreen, nor has it been conducted at Pinscreen or by anyone else affiliated with Pinscreen. Shigeo Morishima is supported by the JST ACCEL Grant Number JPMJAC1602, JSPS KAKENHI Grant Number JP17H06101, the Waseda Research Institute for Science and Engineering. Angjoo Kanazawa is supported by the Berkeley Artificial Intelligence Research sponsors. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Hao Li is affiliated with the University of Southern California, the USC Institute for Creative Technologies, and Pinscreen. This research was conducted at USC and was funded by in part by the ONR YIP grant N00014-17-S-FO14, the CONIX Research Center, one of six centers in JUMP, a Semiconductor Research Corporation program sponsored by DARPA, the Andrew and Erna Viterbi Early Career Chair, the U.S. Army Research Laboratory under contract number W911NF-14-D-0005, Adobe, and Sony. This project was not funded by Pinscreen, nor has it been conducted at Pinscreen or by anyone else affiliated with Pinscreen. Shigeo Morishima is supported by the JST ACCEL Grant Number JPMJAC1602, JSPS KAKENHI Grant Number JP17H06101, the Waseda Research Institute for Science and Engineering. Angjoo Kanazawa is supported by the Berkeley Artificial Intelligence Research sponsors. The content of the information does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Perceptual losses for real-time style transfer and super-resolution.In European Conference on Computer Vision, pages 694–711. Springer, 2016.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In IEEE International Conference on Computer Vision, pages 2223–2232, 2017.
High-quality video view interpolation using a layered representation.ACM Transactions on Graphics, 23(3):600–608, 2004.
Since there is no large scale datasets for high-resolution clothed humans, we collected photogrammetry data of high-quality textured human meshes with a wide range of clothing, shapes, and poses, each consisting of about triangles from RenderPeople333https://renderpeople.com/3d-people/. We refer to this database as High-Fidelity Clothed Human Data set. We randomly split the dataset into a training set of subjects and a test set of subjects. To efficiently render the digital humans, Lambertian diffuse shading with surface normal and spherical harmonics are typically used due to its simplicity and efficiency [59, 39]. However, we found that to achieve high-fidelity reconstructions on real images, the synthetic renderings need to correctly simulate light transport effects resulting from both global and local geometric properties such as ambient occlusion. To this end, we use a precomputed radiance transfer technique (PRT) that precomputes visibility on the surface using spherical harmonics and efficiently represents global light transport effects by multiplying spherical harmonics coefficients of illumination and visibility . PRT only needs to be computed once per object and can be reused with arbitrary illuminations and camera angles. Together with PRT, we use second-order spherical harmonics of indoor scene from HDRI Haven444https://hdrihaven.com/ using random rotations around y axis. We render the images by aligning subjects to the image center using a weak-perspective camera model and image resolution of . We also rotate the subjects for degrees in yaw axis, resulting in images for training. For the evaluation, we render subjects from RenderPeople and subjects from the BUFF data set  using views spanning every degrees in yaw axis. Note that we render the images without the background. We also test our approach on real images of humans from the DeepFashion data set . In the case of real data, we use a off-the-shelf semantic segmentation network together with Grab-Cut refinement .
Since the framework of PIFu is not limited to a specific network architecture, one can technically use any fully convolutional neural network as the image encoder. For surface reconstruction, we found that sequential architectures proposed for human pose estimations . We also replace batch normalization with group normalization
Since the framework of PIFu is not limited to a specific network architecture, one can technically use any fully convolutional neural network as the image encoder. For surface reconstruction, we found that sequential architectures proposed for human pose estimations[64, 41] are effective for human digitization with better generalization on real images. We believe this is because such an architecture increasingly refines the prediction by incorporating long-range geometric structure. We adapt the stacked hourglass network  with modifications proposed by 
. We also replace batch normalization with group normalization, which improves the training stability when the batch sizes are small. Similar to , the intermediate features of each stack are fed into PIFu, and the losses from all the stacks are aggregated for parameter update. We have conducted ablation study on the network architecture design and compare against other alternatives (VGG16, ResNet34) in Appendix II. The image encoder for texture inference adopts the architecture of CycleGAN  consisting of residual blocks . Instead of using transpose convolutions to upsample the latent features, we directly feed the output of the residual blocks to the following Tex-PIFu.
PIFu for surface reconstruction is based on a multi-layer perceptron, where the number of neurons is with non-linear activations using leaky ReLU except the last layer that uses sigmoid activation. To effectively propagate the depth information, each layer of MLP has skip connections from the image feature
PIFu for surface reconstruction is based on a multi-layer perceptron, where the number of neurons is
with non-linear activations using leaky ReLU except the last layer that uses sigmoid activation. To effectively propagate the depth information, each layer of MLP has skip connections from the image featureand depth in spirit of . For multi-view PIFu, we simply take the -th layer output as feature embedding and apply average pooling to aggregate the embedding from different views. Tex-PIFu takes together with the image feature for surface reconstruction by setting the number of the first neurons in the MLP to instead of . We also replace the last layer of PIFu with neurons, followed by tanh activation to represent RGB values.
Since the texture inference module requires pretrained image features from the surface reconstruction module, we first train PIFu for the surface reconstruction and then for texture inference, using the learned image features as condition. We use RMSProp for the surface reconstruction following , the number of epochs of
as condition. We use RMSProp for the surface reconstruction following and Adam for the texture inference with learning rate of as in , the batch size of and
, the number of epochs ofand , and the number of sampled points of and per object in every training batch respectively. The learning rate of RMSProp is decayed by the factor of at -th epoch following . The multi-view PIFu is fine-tuned from the models trained for single-view surface reconstruction and texture inference with a learning rate of and epochs. The training of PIFu for single-view surface reconstruction and texture inference takes and days, respectively, and fine-tuning for multi-view PIFu can be achieved within day on a single 1080ti GPU.
In Table 4 and Figure 10, we provide the effects of sampling methods for surface reconstruction. The most straightforward way is to uniformly sample inside the bounding box of the target object. Although it helps to remove artifacts caused by overfitting, the decision boundary becomes less sharp, losing all the local details (See Figure 10 , first column). To obtain a sharper decision boundary, we propose to sample points around the surface with distances following a standard deviation
, first column). To obtain a sharper decision boundary, we propose to sample points around the surface with distances following a standard deviationfrom the actual surface mesh. We use and cm. The smaller becomes, the sharper the decision boundary is the result becomes more prone to artifacts outside the decision boundary (second column). We found that combining adaptive sampling with cm and uniform sampling achieves qualitatively and quantitatively the best results (right-most column). Note that each sampling scheme is trained with the identical setup as our training procedure described in Appendix I.
In this section, we show comparisons of different architectures for the surface reconstruction and provide insight on design choices of the image encoders. One option is to use bottleneck features of fully convolutional networks [29, 64, 41]. Due to its state-of-the-art performance in volumetric regression for human faces and bodies, we choose Stacked Hourglass network  with a modification proposed by , denoted as HG. Another option is to aggregate features from multiple layers to obtain multi-scale feature embedding [6, 26]. Here we use two widely used network architectures: VGG16  and ResNet34  for the comparison. We extract the features from the layers of ‘relu1_2’, ‘relu2_2’, ‘relu3_3’, ‘relu4_3’, and ‘relu5_3’ for VGG network using bilinear sampling based on , resulting in dimensional features. Similarly, we extract the features before every pooling layers in ResNet, resulting in -D features. We modify the first channel size in PIFu to incorporate the feature dimensions and train the surface reconstruction model using the Adam optimizer with a learning rate of , the number of sampling of and batch size of and for VGG and ResNet respectively. Note that VGG and ResNet are initialized with models pretrained with ImageNet . The other hyper-paremeters are the same as the ones used for our sequential network based on Stacked Hourglass.
In Table 3 and Figure 11, we show comparisons of three architectures using our evaluation data. While ResNet has slightly better performance in the same domain as the training data (i.e., test set in RenderPeople dataset), we observe that the network suffers from overfitting, failing to generalize to other domains (i.e., BUFF and DeepFashion dataset). Thus, we adopt a sequential architecture based on Stacked Hourglass network as our final model.
|cm + Uniform||0.084||1.52||1.50||0.092||1.15||1.14|
Please see the supplementary video for more results.
We provide an additional comparison with Voxel Regression Network (VRN)  to clarify the advantages of PIFu. Figure 12 demonstrates that the proposed PIFu representation can align the 3D reconstruction with pixels at higher resolution, while VRN suffers from misalignment due to the limited precision of its voxel representation. Additionally, the generality of PIFu offers texturing of shapes with arbitrary topology and self-occlusion, which has not been addressed by the work of VRN. Note that VRN only is able to project the image texture onto the recovered surface, and does not provide an approach to do texture inpainting on the unseen side.
We also apply our approach to video sequences obtained from . For the reconstruction, video frames are center cropped and scaled so that the size of the subjects are roughly aligned with our training data. Note that the cropping and scale is fixed for each sequence. Figure 13 demonstrates that our reconstructed results are reasonably temporally coherent even though the frames are processed independently.