H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction

07/26/2021
by   Eduard Ramon, et al.
3

Recent learning approaches that implicitly represent surface geometry using coordinate-based neural representations have shown impressive results in the problem of multi-view 3D reconstruction. The effectiveness of these techniques is, however, subject to the availability of a large number (several tens) of input views of the scene, and computationally demanding optimizations. In this paper, we tackle these limitations for the specific problem of few-shot full 3D head reconstruction, by endowing coordinate-based representations with a probabilistic shape prior that enables faster convergence and better generalization when using few input images (down to three). First, we learn a shape model of 3D heads from thousands of incomplete raw scans using implicit representations. At test time, we jointly overfit two coordinate-based neural networks to the scene, one modeling the geometry and another estimating the surface radiance, using implicit differentiable rendering. We devise a two-stage optimization strategy in which the learned prior is used to initialize and constrain the geometry during an initial optimization phase. Then, the prior is unfrozen and fine-tuned to the scene. By doing this, we achieve high-fidelity head reconstructions, including hair and shoulders, and with a high level of detail that consistently outperforms both state-of-the-art 3D Morphable Models methods in the few-shot scenario, and non-parametric methods when large sets of views are available.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 5

page 7

page 8

12/16/2019

Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision

Learning-based 3D reconstruction methods have shown impressive results. ...
12/03/2020

Learned Initializations for Optimizing Coordinate-Based Neural Representations

Coordinate-based neural representations have shown significant promise a...
08/23/2021

Learning Signed Distance Field for Multi-view Surface Reconstruction

Recent works on implicit neural representations have shown promising res...
12/15/2019

SDFDiff: Differentiable Rendering of Signed Distance Fields for 3D Shape Optimization

We propose SDFDiff, a novel approach for image-based shape optimization ...
12/08/2021

Geometry-Guided Progressive NeRF for Generalizable and Efficient Neural Human Rendering

In this work we develop a generalizable and efficient Neural Radiance Fi...
01/18/2021

Secrets of 3D Implicit Object Shape Reconstruction in the Wild

Reconstructing high-fidelity 3D objects from sparse, partial observation...
05/11/2021

Vision-based Neural Scene Representations for Spacecraft

In advanced mission concepts with high levels of autonomy, spacecraft ne...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent learning based methods have shown impressive results in reconstructing 3D shapes from 2D images. These approaches can be roughly split into two main categories: model-based [3, 11, 24, 30, 34, 35, 36, 42, 43, 47, 50] and model-free [9, 10, 16, 18, 21, 32, 33, 38, 48, 49, 51]. The former incorporate prior knowledge obtained from training data to limit the space of feasible solutions, making these approaches well suited for few-shot and one-shot shape estimation. However, most model-based methods produce shapes that usually lack geometric detail and cannot handle arbitrary topology changes.

On the other hand, model-free approaches based on discrete representations like voxels, meshes or point-clouds, have the flexibility to represent a wider spectrum of shapes, although at the cost of being computationally tractable only for small resolutions or being restricted to fixed topologies. These limitations have been overcome by neural implicit representations [8, 23, 26, 31, 37], which can represent both geometry and appearance as a continuum, encoded in the weights of a neural network. [25, 52] have shown the success of such representations in learning detail-rich 3D geometry directly from images, with no 3D ground truth supervision. Unfortunately, the performance of these methods is currently conditioned to the availability of a large number of input views, which leads to a time consuming inference.

In this work we introduce H3D-Net, a hybrid scheme that combines the strengths of model-based and model-free representations by incorporating prior knowledge into neural implicit models for category-specific multi-view reconstruction. We apply this approach to the problem of few-shot full head reconstruction. In order to build the prior, we first use several thousands of raw incomplete scans to learn a space of Signed Distance Functions (SDF) representing 3D head shapes [26]. At inference, this learnt shape prior is used to initalize and guide the optimization of an Implicit Differentiable Renderer (IDR) [52] that, given a potentially reduced number of input images, estimates the full head geometry. The use of the learned prior enables faster convergence during optimization and prevents it from being trapped into local minima, yielding 3D shape estimates that capture fine details of the face, head and hair from just three input images (see. Fig. H3D-Net: Few-Shot High-Fidelity 3D Head Reconstruction).

We exhaustively evaluate our approach on a mid-resolution Multiview-Stereo (MVS) public dataset [28] and on a high-resolution dataset we collected with a structured-light scanner, consisting of 10 3D full-head scans. The results show that we consistently outperform current state-of-the-art, both in a few-shot setting and when many input views are available. Importantly, the use of the prior also makes our approach very efficient, achieving competitive results in terms of accuracy about 20 faster than IDR [52]. Our key contributions can be summarized as follows:

  • We introduce a method for reconstructing high quality full heads in 3D from small sets of in-the-wild images.

  • Our method is the first to use implicit functions for reconstructing 3D humans heads from multiple images and also to rival parametric and non-parametric models in 3D accuracy at the same time.

  • We devise a guided optimization approach to introduce a probabilistic shape prior into neural implicit models.

  • We collect and will release a new dataset containing high-resolution 3D full head scans, images, masks and camera poses for evaluation purposes, which we dub H3DS.

2 Related work

Model-based. 3D Morphable Models (3DMMs) [5, 6, 20, 27, 29, 44, 45, 46] have become the de facto representation used for few-shot 3D face reconstruction in-the-wild given that they lead to light-weight, fast and robust systems. Adopting 3DMMs as a representation, the 3D reconstruction problem boils down to estimating the small set of parameters that best represent a target 3D shape. This makes it possible to obtain 3D reconstructions from very few images [3, 11, 34, 50] and even a single input  [35, 36, 42, 43, 47]. Nevertheless, one of the main limitations of morphable models is their lack of expressiveness, specially for high frequencies. This issue has been addressed by learning a post processing that transfers the fine details from the image domain to the 3D geometry [21, 36, 43]. Another limitation of 3DMMs is their inability to represent arbitrary shapes and topologies. Thus, they are not suitable for reconstructing full heads with hair, beard, facial accessories and upper body clothing.

Model-free. Model-free approaches build upon more generic representations, such as voxel-grids or meshes, in order to gain expressiveness and flexibility. Voxel-grids have been extensively used for 3D reconstruction [10, 14, 16, 18, 51] and concretely for 3D face reconstruction [16]. Their main limitation is that memory requirements grow cubically with resolution, and octrees [14] have been proposed to address this issue. On the other hand, meshes [9, 17, 21, 48] are a more efficient representation for surfaces than voxel-grids, and are suitable for graphics applications. Meshes have been proposed for 3D face reconstruction [9, 21] in combination with graph neural networks [7]. However, similarly to 3DMMs, meshes are also usually restricted to fixed topologies and are not suitable for reconstructing other elements beyond the face itself.

Implicit representations. Recently, implicit representations [22, 23, 26]

have been proposed to jointly address the memory limitations of voxel grids and the topological rigidity of meshes. Implicit representations model surfaces as a level-set of a coordinate-based continuous function, a signed distance function or an occupancy function. These functions, usually implemented as multi-layer perceptrons (MLPs), can theoretically express any shape with infinite resolution and a fixed memory footprint. Implicit methods can be divided in those that, at inference time, perform a single forward pass of a previously trained model  

[8, 23, 37], and those that overfit a model to a set of input images through an optimization process using implicit differentiable rendering  [25, 52]. In the later, given that the inference is an optimization process, the obtained 3D reconstructions are more accurate. However, they are slow and require an important number of multi-view images, failing in few-shot setups as those we consider in this work.

Figure 1: Overview of our method. Left. The two configurations of the prior model at training and inference phases. Right. Integration of the pre-trained prior model with the implicit differentiable renderer. During the prior-aided 3D reconstruction process, the geometry network starts off with frozen weights (commuter at position A), constraining the predicted shape to lie within its pre-learnt latent space, and is eventually unfrozen (commuter at position B) to allow fine-tuning of the fine details.

Priors for implicit representations. Building priors for implicit representations has been addressed with two main purposes. The first consists in speeding up convergence of methods that perform an optimization at inference time [39] using meta-learning techniques [12]. The second is to find a space of implicit functions that represent a certain category using auto-decoders [26, 53]. However,  [26, 53] have been used to solve tasks using 3D supervision, and it is still an open problem how to use these priors when the supervision signal is generated from 2D images.

As done with morphable models, implicit shape models can be used to constrain image-based 3D reconstruction systems to make them more reliable. Drawing inspiration from this idea, in this work we leverage implicit shape models [26] to guide the optimization-based implicit 3D reconstruction method [52] towards more accurate and robust solutions, even under few-shot in-the-wild scenarios.

3 Method

Given a small set of input images , , with associated head masks and camera parameters , our goal is to recover the 3D head surface using only visual cues as supervision. Formally, we aim to approximate the signed distance function (SDF) such that .

In order to approximate , we propose to optimize a previously learnt probabilistic model , that represents a prior distribution over 3D head SDFs. and

are a latent vector encoding specific shapes and the learnt parameters of an auto-decoder

[40], respectively. Building on DeepSDF [26], we learn these parameters from thousands of incomplete scans. We describe this process in section 3.1.

At test time, the reconstruction process is reduced to finding the optimal parameters such that . To that end, we compose the prior model , which we also refer to as geometry network, with a rendering network that models the RGB radiance emitted from a surface point with normal in a viewing direction , and minimize a photometric error w.r.t. the input images , as in [52]. Moreover, we propose a two-step optimization schedule that prevents the reconstruction process from getting trapped into local minima and, as we shall see in the results section, leads to much more accurate, robust and realistic reconstructions. We describe the reconstruction step in section 3.2.

3.1 Learning a prior for human head SDFs

Given a set of scenes with associated raw 3D point clouds, we use the DeepSDF framework to learn a prior distribution of signed distance functions representing 3D heads, . While the original DeepSDF formulation requires watertight meshes as training data to use signed distances as supervision, we use the Eikonal loss [13] to learn directly from raw, and potentially incomplete, surface point clouds. In addition, Fourier features are used to overcome the spectral bias of MLPs towards low frequencies in low dimensional tasks [41]. We illustrate the training and inference process of the prior model in figure 1-left.

For each scene, indexed by , we sample a subset of points on the surface, and another set uniformly taken from a volume containing the scene, and minimize the following objective:

(1)

where and

are hyperparameters and

accounts for the SDF error at surface points:

(2)

enforces a zero-mean multivariate-Gaussian distribution with spherical covariance

over the space of latent vectors:

(3)

Finally, regularizes with the Eikonal loss to ensure that it approximates a signed distance function by keeping its gradients close to unit norm:

(4)

This regularization across the whole volume is necessary given that our meshes are not watertight and only a subset of surface points is available as ground truth. [13].

After training, we have obtained the parameters that represent a space of human head SDFs. We can now draw signed distance functions of heads from by sampling the latent space . We use this pre-trained model as the prior for the 3D reconstruction schedule described in the following section.

3.2 Prior-aided 3D Reconstruction

Given a new scene, for which no 3D information is provided at this point, we aim to approximate the SDF that implicitly encodes the surface of the head by only supervising in the image domain. To that end, we compose the previously learnt geometry probabilistic model with the rendering network , and supervise on the photometric error to find the optimal parameters , and . The reconstruction process is illustrated in figure 1-right.

For every pixel coordinate of each input image , we march a ray , where is the position of the associated camera , and the viewing direction. The intersection point between the ray and the surface can be efficiently found using sphere tracing [15]. This intersection point can be made differentiable w.r.t and without having to store the gradients corresponding to all the forward passes of the geometry network, as shown in [25] and generalized by [52]. The following expression is exact in value and first derivatives:

(5)

Here and denote the parameters of at iteration , and represents the intersection point made differentiable w.r.t. the geometry network parameters.

Next, we evaluate the mapping at , and to estimate the color for the pixel in the image :

(6)

Finally, in order to optimize the surface parameters and , and the rendering parameters , we minimize the following loss [52]:

(7)

where and are hyperparameters. We next describe each component of this loss. Let be a mini-batch of pixels from view , the subset of pixels whose associated ray intersects and which have a nonzero mask value, and . The is the photometric error, computed as:

(8)

accounts for silhouette errors:

(9)

where is the estimated silhouette, is the binary cross-entropy and is a hyperparameter. Lastly, encourages to approximate a signed distance function as in equation 4.

Instead of jointly optimizing all the parameters to minimize we introduce a two-step optimization schedule which is more appropriate for auto-decoders like DeepSDF. We begin by initializing the geometry network with the previously learnt prior for human head SDFs, , and a randomly sampled such that to stay near the mean of the latent space. In a first phase, we only optimize and as , which is equivalent to the standard auto-decoder inference. By doing so, the resulting surface is forced to stay within the learnt distribution of 3D heads. Once the geometry and the radiance mappings have reached an equilibrium, the optimization has converged, we unfreeze the decoder parameters to fine-tune the whole model as .

In section 5, we empirically prove that by using this optimization schedule instead of optimizing all the parameters at once, the obtained 3D reconstructions are more accurate and less prone to artifacts, specially in few-shot setups.

4 Implementation details

Figure 2: Ablation study of our method in the few-shot setup (3 views). From left to right: (a) Ours with geometric initialization [2] and no schedule, (b) Ours with small prior initialization (500 subjects) and no schedule, (c) Ours with large prior initialization (10,000 subjects) and no schedule, (d) Ours with large prior initialization and schedule and (e) ground truth.
(a) (b) (c) (d)
Face mean distance [mm] 4.04 2.68 1.90 1.49
Full-head mean distance [mm] 16.68 17.08 14.59 12.76
Table 1: Ablation study of our method in the few-shot setup (3 views). The face and full-head mean distances are the averages over all the subjects in the H3DS dataset. The configurations a,b,c,d are the same as those described in figure 2.

Our implementation of the prior model closely follows the one proposed in [13], with the addition that we apply a positional encoding to the input coordinates with 6 log-linear spaced frequencies. The encoded 3D coordinates are concatenated with the

latent vector of size 256 and set as the input to the decoder. The decoder is a MLP of 8 layers with 512 neurons in each layer and single skip connection from the input of the decoder to the output of the 4th layer. We use Softplus as activation function in every layer except the last, where no activation is used. The prior model is trained for 100 epochs using Adam

[19] with standard parameters, learning rate of and learning rate step decay of 0.5 every 15 epochs. The training takes approximately 50 minutes for a small dataset (500 scenes) and 10 hours for a large one (10,000 scenes).

Figure 3: Latent interpolation between different subjects, being

a linear interpolation factor in

space. (a) uses the small prior model (500 subjects), and (b) the large prior (10,000 subjects).

The 3D reconstruction network is composed by the prior model described above and a mapping that is split into two sub-networks and as shown in figure 1. is a MLP implemented exactly as the decoder of the prior model, except for the input layer, which takes in a 3 dimensional vector, and the output layer, which outputs a -dimensional vector. As in [52],

is a smaller MLP composed by 4 layers, each 512 neurons wide, no skip connections, and ReLU activations except in the output layer which is tanh. We also apply the positional encodings

and to with 6 and 4 log-linear spaced frequencies respectively. Each scene is trained for 2000 epochs using Adam with fixed learning rate of and learning rate step decay of 0.5 at epochs 1000 and 1500. The scene reconstruction process takes approximately 25 minutes for scenes of 3 views and 4 hours and 15 minutes for scenes of 32.

All the experiments for both prior and reconstruction models have been performed using a single Nvidia RTX 2080Ti.

5 Experiments

In this section, we evaluate quantitatively and qualitatively our multi-view 3D reconstruction method. We empirically demonstrate that the proposed solution surpasses the state of the art in the few-shot [3, 50] and many-shot [52] scenarios for 3D face and head reconstruction in-the-wild.

3DFAW H3DS
3 views 3 views 4 views 8 views 16 views 32 views
face face head face head face head face head face head
MVFNet [50] 1.54 1.66 - - - - - - - - -
DFNRMVS [3] 1.53 1.83 - - - - - - - - -
IDR [52] 3.92 3.52 17.04 2.14 8.04 1.95 8.71 1.43 5.94 1.39 5.86
H3D-Net (Ours) 1.37 1.49 12.76 1.65 7.95 1.38 5.47 1.24 4.80 1.21 4.90
Table 2: 3D reconstruction method comparison. Average surface error in millimeters computed over all the subjects in the 3DFAW and H3DS datasets. Find the precise definition of the face/head metrics, as well as a description of the distribution of the views, in section 5.2.
Figure 4: 3D reconstruction convergence comparison between H3D-Net and IDR [52] using 32 views. Metrics are computed over all the samples in the H3DS dataset. The dotted lines indicate the time when our method first surpasses the best mean error attained by IDR over the entire optimization. Top. Mean surface error in the face. Bottom. Mean surface error in the full head.

5.1 Datasets

Prior training. In order to train the geometry prior, we use an internal dataset made of 3D head scans from 10,000 individuals. The dataset is perfectly balanced in gender and diverse in age and ethnicity. The raw data is automatically processed to remove internal mesh faces and non-human parts such as background walls. Finally, all the scenes are aligned by registering a template 3D model with non-rigid Iterative Closest Point (ICP) [1].

3DFAW [28]. We evaluate our method in the 3DFAW dataset. This dataset provides videos recorded in front, and around, the head of a person in static position as well as mid-resolution 3D ground truth of the facial region. We select 5 male and 5 female scenes and use them to evaluate only the facial region.

H3DS. We introduce a new dataset called H3DS, the first dataset containing high resolution full head 3D textured scans and 360º images with associated ground truth camera poses and ground truth masks. The 3D geometry has been captured using a structured light scanner, which leads to more precise ground truth geometries than the ones from 3DFAW [28], which were generated using Multi-View Stereo (MVS). The dataset consists of 10 individuals, 50% man and 50% woman. We use this dataset to evaluate the accuracy of the different methods in both the full head and the facial regions. We plan to release, maintain and progressively grow the H3DS dataset.

5.2 Experiments setup

We use the 3DMM-based methods MVFNet [50] and DFNRMVS [3], and the model-free method IDR [52] as baselines to compare against H3D-Net.

In the few-shot scenario (3 views), all the methods are evaluated on the 3DFAW and H3DS datasets. To benchmark our method when more than 3 views are available, we compare it against IDR on the H3DS dataset.

Figure 5: Qualitative results obtained for 4 subjects from the 3DFAW dataset [28] with only three input views. First and third rows show the reconstructed geometry and second and fourth rows show the surface error with the color code being in millimeters.

The evaluation criteria have been the same for all methods and in all the experiments. The predicted 3D reconstruction is roughly aligned with the ground truth mesh using manually annotated landmarks, and then refined with rigid ICP [4]. Then, we compute the unidirectional Chamfer distance from the predicted reconstruction to the ground truth. All the distances are computed in millimeters.

We report metrics in two different regions, the face and the full head. For the finer evaluation in the face region, we cut both the reconstructions and the ground truth using a sphere of 95 mm radius and with center at the tip of the nose of the ground truth mesh, and refine the alignment with ICP as in [28, 47]. Then, we compute the Chamfer distance in this sub-region. For the full head evaluation, the ICP alignment is performed using an annotated region that includes the face, the ears, and the neck, since it is a region visible in all view configurations (3, 4, 8, 16 and 32). These configurations are defined by their yaw angles as follow: , and for . In this case, the Chamfer distance is computed for all the vertices of the reconstruction.

5.3 Ablation study

We conduct an ablation study on the H3DS dataset in the few-shot scenario (3 views) and show the numerical results in table 1, and the qualitative results in figure 2. First, we reconstruct the scenes without prior and without schedule (a). In this case, the geometry network is initialized using geometric initialization [2], representing a sphere of radius one at the beginning of the optimization. Then, we initialize the geometry network with two different priors, a small one trained on 500 subjects (b), and a large one trained on 10,000 subjects (c), and perform the reconstructions without schedule. As it can be observed, initializing the geometry network with a previously learnt prior leads to smoother and more plausible surfaces, specially when more subjects have been used to train it. It is important to note that the benefits of the initialization are not only due to a better initial shape, but also to the ability of the initial weights to generalize to unseen shapes, which is greater in the large prior model. Finally, we initialize the geometry network with the large prior and use the proposed optimization schedule during the reconstruction process. It can be observed how the resulting 3D heads resemble much more to the ground truth in terms of finer facial details.

Figure 6: Qualitative results obtained for 2 subjects from the H3DS dataset when varying the number of views. The first and second rows correspond with the results of IDR and the third and the forth with H3D-Net results (ours). The surface error is represented with the color code in millimeters.

Given the notable effect that the number of samples has in the learnt prior representations and in the resulting 3D reconstructions as well, we visualize latent space interpolations in figure 3. To that end, we optimize the latent vector for two ground truth 3D scans as shown in figure 1-left-bottom in order minimize the loss 1. Then, we interpolate between the two optimal latent vectors. As it can be observed, the 3D reconstructions resulting from the interpolation in space of the large prior model are more detailed and plausible than the ones from the small prior model, suggesting that the later achieves poorer generalization.

5.4 Quantitative results

Quantitative results in terms of surface error are reported in table 2. Remarkably, H3D-Net outperforms both 3DMM-based methods in the few-shot regime, and the model-free method IDR when the largest number of views (32) are available. It is worth noting how the enhancement due to the prior is more significant as the number of views decreases. Nevertheless, the prior does not prevent the model from becoming more accurate when more views are available, which is a current limitation of model-based approaches.

We also analyze the trade-off between the optimization time and the accuracy in IDR and H3D-Net for the case of 32 views, which we illustrate in figure 4. It can be observed that, despite reaching similar errors asymptotically, in average our method achieves the best performance attained by IDR much faster. In particular, we report convergence gains of 20 for the facial region error and 4

for the full head. Moreover, the smaller variance observed in H3D-Net (blue) indicates that it is a more stable method.

5.5 Qualitative results

Quantitative results show improvements over the baselines in both few-shot and many-shot setups. Here, we study how this is translated into the reconstructed 3D shape.

In figure 5, we qualitatively evaluate the three baselines and H3D-Net for the case of 3 input views. As expected, IDR [52] is the worst performing model in this scenario, generating reconstructions with artifacts and with no resemblance to human faces. On the other hand, 3DMM-based models [3, 50] achieve more plausible shapes, but they struggle to capture fine details. H3D-Net, in contrast, is able to capture much more detail and reduce significantly the errors over the whole face and, concretely, in difficult areas such as the nose, the eyebrows, the cheeks, and the chin.

We also evaluate the impact that varying the number of available views has on the reconstructed surface, and compare our method to IDR [52]. As shown in figure 6, H3D-Net is able to obtain surfaces with less error (greener) with far fewer views, which is consistent with the quantitative results reported in table 2. Notably, it can also be observed that, even when errors are numerically similar (first and third columns), the reconstructions from H3D-Net are much more realistic. In addition, H3D-Net improvements are especially notable within the face region. We attribute this to the fact that training data used to build the prior model is more rich in this area, whereas training examples frequently present holes in other parts of the head.

6 Conclusions

In this work we have presented H3D-Net, a method for high-fidelity 3D head reconstruction from small sets of in-the-wild images with associated head masks and camera poses. Our method combines a pre-trained probabilistic model, which represents a distribution of head SDFs, with an implicit differentiable renderer that allows direct supervision in the image domain. By constraining the reconstruction process with the prior model, we are able to robustly recover detailed 3D human heads, including hair and shoulders, from only three input images. After a thorough quantitative and qualitative evaluation, our experiments show that our method outperforms both model-based methods in the few-shot setup and model-free methods when a large number of views are available. One limitation of our method is that it still requires several minutes to generate 3D reconstructions. An interesting direction for future work is to use more efficient representations for SDFs in order to speed up the optimization process. We also find it promising to introduce texture priors, which could reduce the convergence time and reduce the final overall error.

7 Acknowledgments

This work has been partially funded by the Spanish government with the projects MoHuCo PID2020-120049RB-I00, DeeLight PID2020-117142GB-I00 and Maria de Maeztu Seal of Excellence MDM-2016-0656, and by the Government of Catalonia under the industrial doctorate 2017 DI 028.

References

  • [1] B. Amberg, S. Romdhani, and T. Vetter (2007) Optimal step nonrigid icp algorithms for surface registration. In CVPR, Cited by: §5.1.
  • [2] M. Atzmon and Y. Lipman (2020) Sal: sign agnostic learning of shapes from raw data. In CVPR, Cited by: Figure 2, §5.3.
  • [3] Z. Bai, Z. Cui, J. A. Rahim, X. Liu, and P. Tan (2020) Deep facial non-rigid multi-view stereo. In CVPR, Cited by: §1, §2, §5.2, §5.5, Table 2, §5.
  • [4] P. J. Besl and N. D. McKay (1992) Method for registration of 3-d shapes. In Sensor fusion IV: control paradigms and data structures, Cited by: §5.2.
  • [5] J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou (2017) 3D face morphable models” in-the-wild”. In CVPR, Cited by: §2.
  • [6] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway (2016) A 3d morphable model learnt from 10,000 faces. In CVPR, Cited by: §2.
  • [7] M. M. Bronstein, J. Bruna, Y. LeCun, A. Szlam, and P. Vandergheynst (2017)

    Geometric deep learning: going beyond euclidean data

    .
    IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.
  • [8] Z. Chen and H. Zhang (2019) Learning implicit fields for generative shape modeling. In CVPR, Cited by: §1, §2.
  • [9] S. Cheng, G. Tzimiropoulos, J. Shen, and M. Pantic (2020) Faster, better and more detailed: 3d face reconstruction with graph convolutional networks. In ACCV, Cited by: §1, §2.
  • [10] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese (2016) 3d-r2n2: a unified approach for single and multi-view 3d object reconstruction. In ECCV, Cited by: §1, §2.
  • [11] P. Dou and I. A. Kakadiaris (2018)

    Multi-view 3d face reconstruction with deep recurrent neural networks

    .
    Image and Vision Computing 80, pp. 80–91. Cited by: §1, §2.
  • [12] C. Finn, P. Abbeel, and S. Levine (2017) Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, Cited by: §2.
  • [13] A. Gropp, L. Yariv, N. Haim, M. Atzmon, and Y. Lipman (2020) Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099. Cited by: §3.1, §3.1, §4.
  • [14] C. Häne, S. Tulsiani, and J. Malik (2017) Hierarchical surface prediction for 3d object reconstruction. In 3DV, Cited by: §2.
  • [15] J. C. Hart (1996) Sphere tracing: a geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer 12 (10), pp. 527–545. Cited by: §3.2.
  • [16] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos (2017) Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. In ICCV, Cited by: §1, §2.
  • [17] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik (2018) Learning category-specific mesh reconstruction from image collections. In ECCV, Cited by: §2.
  • [18] A. Kar, C. Häne, and J. Malik (2017) Learning a multi-view stereo machine. In NeurIPS, Cited by: §1, §2.
  • [19] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [20] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero (2017) Learning a model of facial shape and expression from 4d scans.. ACM Trans. Graph. 36 (6), pp. 194–1. Cited by: §2.
  • [21] J. Lin, Y. Yuan, T. Shao, and K. Zhou (2020) Towards high-fidelity 3d face reconstruction from in-the-wild images using graph convolutional networks. In CVPR, Cited by: §1, §2, §2.
  • [22] G. Littwin and L. Wolf (2019) Deep meta functionals for shape representation. In ICCV, Cited by: §2.
  • [23] L. Mescheder, M. Oechsle, M. Niemeyer, S. Nowozin, and A. Geiger (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §1, §2.
  • [24] F. Moreno-Noguer and P. Fua (2013) Stochastic exploration of ambiguities for nonrigid shape recovery. 35 (2), pp. 463–475. Cited by: §1.
  • [25] M. Niemeyer, L. Mescheder, M. Oechsle, and A. Geiger (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In CVPR, Cited by: §1, §2, §3.2.
  • [26] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove (2019) Deepsdf: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §1, §1, §2, §2, §2, §3.
  • [27] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009)

    A 3d face model for pose and illumination invariant face recognition

    .
    In AVSS, pp. 296–301. Cited by: §2.
  • [28] R. K. Pillai, L. A. Jeni, H. Yang, Z. Zhang, L. Yin, and J. F. Cohn (2019) The 2nd 3d face alignment in the wild challenge (3dfaw-video): dense reconstruction from video.. In ICCV Workshops, Cited by: §1, Figure 5, §5.1, §5.1, §5.2.
  • [29] S. Ploumpis, H. Wang, N. Pears, W. A. Smith, and S. Zafeiriou (2019) Combining 3d morphable models: a large scale face-and-head model. In CVPR, Cited by: §2.
  • [30] A. Pumarola, A. Agudo, L. Porzi, A. Sanfeliu, V. Lepetit, and F. Moreno-Noguer (2018) Geometry-aware network for non-rigid shape prediction from a single view. In CVPR, Cited by: §1.
  • [31] A. Pumarola, E. Corona, G. Pons-Moll, and F. Moreno-Noguer (2021) D-nerf: neural radiance fields for dynamic scenes. In CVPR, Cited by: §1.
  • [32] A. Pumarola, S. Popov, F. Moreno-Noguer, and V. Ferrari (2020) C-flow: conditional generative flow models for images and 3d point clouds. In CVPR, Cited by: §1.
  • [33] A. Pumarola, J. Sanchez-Riera, G. Choi, A. Sanfeliu, and F. Moreno-Noguer (2019) 3dpeople: modeling the geometry of dressed humans. In ICCV, Cited by: §1.
  • [34] E. Ramon, J. Escur, and X. Giro-i-Nieto (2019) Multi-view 3d face reconstruction in the wild using siamese networks. In ICCV Workshops, Cited by: §1, §2.
  • [35] E. Richardson, M. Sela, and R. Kimmel (2016) 3D face reconstruction by learning from synthetic data. In 3DV, Cited by: §1, §2.
  • [36] E. Richardson, M. Sela, R. Or-El, and R. Kimmel (2017) Learning detailed face reconstruction from a single image. In CVPR, Cited by: §1, §2.
  • [37] S. Saito, Z. Huang, R. Natsume, S. Morishima, A. Kanazawa, and H. Li (2019) Pifu: pixel-aligned implicit function for high-resolution clothed human digitization. In ICCV, Cited by: §1, §2.
  • [38] M. Sela, E. Richardson, and R. Kimmel (2017)

    Unrestricted facial geometry reconstruction using image-to-image translation

    .
    In ICCV, Cited by: §1.
  • [39] V. Sitzmann, E. R. Chan, R. Tucker, N. Snavely, and G. Wetzstein (2020) Metasdf: meta-learning signed distance functions. arXiv preprint arXiv:2006.09662. Cited by: §2.
  • [40] S. Tan and M. L. Mayrovouniotis (1995) Reducing data dimensionality through optimizing neural network inputs. AIChE Journal 41 (6), pp. 1471–1480. Cited by: §3.
  • [41] M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng (2020) Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. Cited by: §3.1.
  • [42] A. Tewari, M. Zollhofer, H. Kim, P. Garrido, F. Bernard, P. Perez, and C. Theobalt (2017)

    Mofa: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction

    .
    In ICCV, Cited by: §1, §2.
  • [43] A. T. Tran, T. Hassner, I. Masi, E. Paz, Y. Nirkin, and G. G. Medioni (2018) Extreme 3d face reconstruction: seeing through occlusions.. In CVPR, Cited by: §1, §2.
  • [44] L. Tran, F. Liu, and X. Liu (2019) Towards high-fidelity nonlinear 3d face morphable model. In CVPR, Cited by: §2.
  • [45] L. Tran and X. Liu (2018) Nonlinear 3d face morphable model. In CVPR, Cited by: §2.
  • [46] L. Tran and X. Liu (2019) On learning 3d face morphable model from in-the-wild images. TPAMI 43 (1), pp. 157–171. Cited by: §2.
  • [47] A. Tuan Tran, T. Hassner, I. Masi, and G. Medioni (2017) Regressing robust and discriminative 3d morphable models with a very deep neural network. In CVPR, Cited by: §1, §2, §5.2.
  • [48] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y. Jiang (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §1, §2.
  • [49] H. Wei, S. Liang, and Y. Wei (2019) 3d dense face alignment via graph convolution networks. arXiv preprint arXiv:1904.05562. Cited by: §1.
  • [50] F. Wu, L. Bao, Y. Chen, Y. Ling, Y. Song, S. Li, K. N. Ngan, and W. Liu (2019) Mvf-net: multi-view 3d face morphable model regression. In CVPR, Cited by: §1, §2, §5.2, §5.5, Table 2, §5.
  • [51] X. Yan, J. Yang, E. Yumer, Y. Guo, and H. Lee (2016) Perspective transformer nets: learning single-view 3d object reconstruction without 3d supervision. In NeurIPS, Cited by: §1, §2.
  • [52] L. Yariv, Y. Kasten, D. Moran, M. Galun, M. Atzmon, B. Ronen, and Y. Lipman (2020) Multiview neural surface reconstruction by disentangling geometry and appearance. NeurIPS 33. Cited by: §1, §1, §1, §2, §2, §3.2, §3.2, §3, §4, Figure 4, §5.2, §5.5, §5.5, Table 2, §5.
  • [53] T. Yenamandra, A. Tewari, F. Bernard, H. Seidel, M. Elgharib, D. Cremers, and C. Theobalt (2020) I3DMM: deep implicit 3d morphable model of human heads. arXiv preprint arXiv:2011.14143. Cited by: §2.