1 Introduction
Recent learning based methods have shown impressive results in reconstructing 3D shapes from 2D images. These approaches can be roughly split into two main categories: modelbased [3, 11, 24, 30, 34, 35, 36, 42, 43, 47, 50] and modelfree [9, 10, 16, 18, 21, 32, 33, 38, 48, 49, 51]. The former incorporate prior knowledge obtained from training data to limit the space of feasible solutions, making these approaches well suited for fewshot and oneshot shape estimation. However, most modelbased methods produce shapes that usually lack geometric detail and cannot handle arbitrary topology changes.
On the other hand, modelfree approaches based on discrete representations like voxels, meshes or pointclouds, have the flexibility to represent a wider spectrum of shapes, although at the cost of being computationally tractable only for small resolutions or being restricted to fixed topologies. These limitations have been overcome by neural implicit representations [8, 23, 26, 31, 37], which can represent both geometry and appearance as a continuum, encoded in the weights of a neural network. [25, 52] have shown the success of such representations in learning detailrich 3D geometry directly from images, with no 3D ground truth supervision. Unfortunately, the performance of these methods is currently conditioned to the availability of a large number of input views, which leads to a time consuming inference.
In this work we introduce H3DNet, a hybrid scheme that combines the strengths of modelbased and modelfree representations by incorporating prior knowledge into neural implicit models for categoryspecific multiview reconstruction. We apply this approach to the problem of fewshot full head reconstruction. In order to build the prior, we first use several thousands of raw incomplete scans to learn a space of Signed Distance Functions (SDF) representing 3D head shapes [26]. At inference, this learnt shape prior is used to initalize and guide the optimization of an Implicit Differentiable Renderer (IDR) [52] that, given a potentially reduced number of input images, estimates the full head geometry. The use of the learned prior enables faster convergence during optimization and prevents it from being trapped into local minima, yielding 3D shape estimates that capture fine details of the face, head and hair from just three input images (see. Fig. H3DNet: FewShot HighFidelity 3D Head Reconstruction).
We exhaustively evaluate our approach on a midresolution MultiviewStereo (MVS) public dataset [28] and on a highresolution dataset we collected with a structuredlight scanner, consisting of 10 3D fullhead scans. The results show that we consistently outperform current stateoftheart, both in a fewshot setting and when many input views are available. Importantly, the use of the prior also makes our approach very efficient, achieving competitive results in terms of accuracy about 20 faster than IDR [52]. Our key contributions can be summarized as follows:

We introduce a method for reconstructing high quality full heads in 3D from small sets of inthewild images.

Our method is the first to use implicit functions for reconstructing 3D humans heads from multiple images and also to rival parametric and nonparametric models in 3D accuracy at the same time.

We devise a guided optimization approach to introduce a probabilistic shape prior into neural implicit models.

We collect and will release a new dataset containing highresolution 3D full head scans, images, masks and camera poses for evaluation purposes, which we dub H3DS.
2 Related work
Modelbased. 3D Morphable Models (3DMMs) [5, 6, 20, 27, 29, 44, 45, 46] have become the de facto representation used for fewshot 3D face reconstruction inthewild given that they lead to lightweight, fast and robust systems. Adopting 3DMMs as a representation, the 3D reconstruction problem boils down to estimating the small set of parameters that best represent a target 3D shape. This makes it possible to obtain 3D reconstructions from very few images [3, 11, 34, 50] and even a single input [35, 36, 42, 43, 47]. Nevertheless, one of the main limitations of morphable models is their lack of expressiveness, specially for high frequencies. This issue has been addressed by learning a post processing that transfers the fine details from the image domain to the 3D geometry [21, 36, 43]. Another limitation of 3DMMs is their inability to represent arbitrary shapes and topologies. Thus, they are not suitable for reconstructing full heads with hair, beard, facial accessories and upper body clothing.
Modelfree. Modelfree approaches build upon more generic representations, such as voxelgrids or meshes, in order to gain expressiveness and flexibility. Voxelgrids have been extensively used for 3D reconstruction [10, 14, 16, 18, 51] and concretely for 3D face reconstruction [16]. Their main limitation is that memory requirements grow cubically with resolution, and octrees [14] have been proposed to address this issue. On the other hand, meshes [9, 17, 21, 48] are a more efficient representation for surfaces than voxelgrids, and are suitable for graphics applications. Meshes have been proposed for 3D face reconstruction [9, 21] in combination with graph neural networks [7]. However, similarly to 3DMMs, meshes are also usually restricted to fixed topologies and are not suitable for reconstructing other elements beyond the face itself.
Implicit representations. Recently, implicit representations [22, 23, 26]
have been proposed to jointly address the memory limitations of voxel grids and the topological rigidity of meshes. Implicit representations model surfaces as a levelset of a coordinatebased continuous function, a signed distance function or an occupancy function. These functions, usually implemented as multilayer perceptrons (MLPs), can theoretically express any shape with infinite resolution and a fixed memory footprint. Implicit methods can be divided in those that, at inference time, perform a single forward pass of a previously trained model
[8, 23, 37], and those that overfit a model to a set of input images through an optimization process using implicit differentiable rendering [25, 52]. In the later, given that the inference is an optimization process, the obtained 3D reconstructions are more accurate. However, they are slow and require an important number of multiview images, failing in fewshot setups as those we consider in this work.Priors for implicit representations. Building priors for implicit representations has been addressed with two main purposes. The first consists in speeding up convergence of methods that perform an optimization at inference time [39] using metalearning techniques [12]. The second is to find a space of implicit functions that represent a certain category using autodecoders [26, 53]. However, [26, 53] have been used to solve tasks using 3D supervision, and it is still an open problem how to use these priors when the supervision signal is generated from 2D images.
As done with morphable models, implicit shape models can be used to constrain imagebased 3D reconstruction systems to make them more reliable. Drawing inspiration from this idea, in this work we leverage implicit shape models [26] to guide the optimizationbased implicit 3D reconstruction method [52] towards more accurate and robust solutions, even under fewshot inthewild scenarios.
3 Method
Given a small set of input images , , with associated head masks and camera parameters , our goal is to recover the 3D head surface using only visual cues as supervision. Formally, we aim to approximate the signed distance function (SDF) such that .
In order to approximate , we propose to optimize a previously learnt probabilistic model , that represents a prior distribution over 3D head SDFs. and
are a latent vector encoding specific shapes and the learnt parameters of an autodecoder
[40], respectively. Building on DeepSDF [26], we learn these parameters from thousands of incomplete scans. We describe this process in section 3.1.At test time, the reconstruction process is reduced to finding the optimal parameters such that . To that end, we compose the prior model , which we also refer to as geometry network, with a rendering network that models the RGB radiance emitted from a surface point with normal in a viewing direction , and minimize a photometric error w.r.t. the input images , as in [52]. Moreover, we propose a twostep optimization schedule that prevents the reconstruction process from getting trapped into local minima and, as we shall see in the results section, leads to much more accurate, robust and realistic reconstructions. We describe the reconstruction step in section 3.2.
3.1 Learning a prior for human head SDFs
Given a set of scenes with associated raw 3D point clouds, we use the DeepSDF framework to learn a prior distribution of signed distance functions representing 3D heads, . While the original DeepSDF formulation requires watertight meshes as training data to use signed distances as supervision, we use the Eikonal loss [13] to learn directly from raw, and potentially incomplete, surface point clouds. In addition, Fourier features are used to overcome the spectral bias of MLPs towards low frequencies in low dimensional tasks [41]. We illustrate the training and inference process of the prior model in figure 1left.
For each scene, indexed by , we sample a subset of points on the surface, and another set uniformly taken from a volume containing the scene, and minimize the following objective:
(1) 
where and
are hyperparameters and
accounts for the SDF error at surface points:(2) 
enforces a zeromean multivariateGaussian distribution with spherical covariance
over the space of latent vectors:(3) 
Finally, regularizes with the Eikonal loss to ensure that it approximates a signed distance function by keeping its gradients close to unit norm:
(4) 
This regularization across the whole volume is necessary given that our meshes are not watertight and only a subset of surface points is available as ground truth. [13].
After training, we have obtained the parameters that represent a space of human head SDFs. We can now draw signed distance functions of heads from by sampling the latent space . We use this pretrained model as the prior for the 3D reconstruction schedule described in the following section.
3.2 Prioraided 3D Reconstruction
Given a new scene, for which no 3D information is provided at this point, we aim to approximate the SDF that implicitly encodes the surface of the head by only supervising in the image domain. To that end, we compose the previously learnt geometry probabilistic model with the rendering network , and supervise on the photometric error to find the optimal parameters , and . The reconstruction process is illustrated in figure 1right.
For every pixel coordinate of each input image , we march a ray , where is the position of the associated camera , and the viewing direction. The intersection point between the ray and the surface can be efficiently found using sphere tracing [15]. This intersection point can be made differentiable w.r.t and without having to store the gradients corresponding to all the forward passes of the geometry network, as shown in [25] and generalized by [52]. The following expression is exact in value and first derivatives:
(5) 
Here and denote the parameters of at iteration , and represents the intersection point made differentiable w.r.t. the geometry network parameters.
Next, we evaluate the mapping at , and to estimate the color for the pixel in the image :
(6) 
Finally, in order to optimize the surface parameters and , and the rendering parameters , we minimize the following loss [52]:
(7) 
where and are hyperparameters. We next describe each component of this loss. Let be a minibatch of pixels from view , the subset of pixels whose associated ray intersects and which have a nonzero mask value, and . The is the photometric error, computed as:
(8) 
accounts for silhouette errors:
(9) 
where is the estimated silhouette, is the binary crossentropy and is a hyperparameter. Lastly, encourages to approximate a signed distance function as in equation 4.
Instead of jointly optimizing all the parameters to minimize we introduce a twostep optimization schedule which is more appropriate for autodecoders like DeepSDF. We begin by initializing the geometry network with the previously learnt prior for human head SDFs, , and a randomly sampled such that to stay near the mean of the latent space. In a first phase, we only optimize and as , which is equivalent to the standard autodecoder inference. By doing so, the resulting surface is forced to stay within the learnt distribution of 3D heads. Once the geometry and the radiance mappings have reached an equilibrium, the optimization has converged, we unfreeze the decoder parameters to finetune the whole model as .
In section 5, we empirically prove that by using this optimization schedule instead of optimizing all the parameters at once, the obtained 3D reconstructions are more accurate and less prone to artifacts, specially in fewshot setups.
4 Implementation details
(a)  (b)  (c)  (d)  

Face mean distance [mm]  4.04  2.68  1.90  1.49 
Fullhead mean distance [mm]  16.68  17.08  14.59  12.76 
Our implementation of the prior model closely follows the one proposed in [13], with the addition that we apply a positional encoding to the input coordinates with 6 loglinear spaced frequencies. The encoded 3D coordinates are concatenated with the
latent vector of size 256 and set as the input to the decoder. The decoder is a MLP of 8 layers with 512 neurons in each layer and single skip connection from the input of the decoder to the output of the 4th layer. We use Softplus as activation function in every layer except the last, where no activation is used. The prior model is trained for 100 epochs using Adam
[19] with standard parameters, learning rate of and learning rate step decay of 0.5 every 15 epochs. The training takes approximately 50 minutes for a small dataset (500 scenes) and 10 hours for a large one (10,000 scenes).The 3D reconstruction network is composed by the prior model described above and a mapping that is split into two subnetworks and as shown in figure 1. is a MLP implemented exactly as the decoder of the prior model, except for the input layer, which takes in a 3 dimensional vector, and the output layer, which outputs a dimensional vector. As in [52],
is a smaller MLP composed by 4 layers, each 512 neurons wide, no skip connections, and ReLU activations except in the output layer which is tanh. We also apply the positional encodings
and to with 6 and 4 loglinear spaced frequencies respectively. Each scene is trained for 2000 epochs using Adam with fixed learning rate of and learning rate step decay of 0.5 at epochs 1000 and 1500. The scene reconstruction process takes approximately 25 minutes for scenes of 3 views and 4 hours and 15 minutes for scenes of 32.All the experiments for both prior and reconstruction models have been performed using a single Nvidia RTX 2080Ti.
5 Experiments
In this section, we evaluate quantitatively and qualitatively our multiview 3D reconstruction method. We empirically demonstrate that the proposed solution surpasses the state of the art in the fewshot [3, 50] and manyshot [52] scenarios for 3D face and head reconstruction inthewild.
3DFAW  H3DS  

3 views  3 views  4 views  8 views  16 views  32 views  
face  face  head  face  head  face  head  face  head  face  head  
MVFNet [50]  1.54  1.66                   
DFNRMVS [3]  1.53  1.83                   
IDR [52]  3.92  3.52  17.04  2.14  8.04  1.95  8.71  1.43  5.94  1.39  5.86 
H3DNet (Ours)  1.37  1.49  12.76  1.65  7.95  1.38  5.47  1.24  4.80  1.21  4.90 
5.1 Datasets
Prior training. In order to train the geometry prior, we use an internal dataset made of 3D head scans from 10,000 individuals. The dataset is perfectly balanced in gender and diverse in age and ethnicity. The raw data is automatically processed to remove internal mesh faces and nonhuman parts such as background walls. Finally, all the scenes are aligned by registering a template 3D model with nonrigid Iterative Closest Point (ICP) [1].
3DFAW [28]. We evaluate our method in the 3DFAW dataset. This dataset provides videos recorded in front, and around, the head of a person in static position as well as midresolution 3D ground truth of the facial region. We select 5 male and 5 female scenes and use them to evaluate only the facial region.
H3DS. We introduce a new dataset called H3DS, the first dataset containing high resolution full head 3D textured scans and 360º images with associated ground truth camera poses and ground truth masks. The 3D geometry has been captured using a structured light scanner, which leads to more precise ground truth geometries than the ones from 3DFAW [28], which were generated using MultiView Stereo (MVS). The dataset consists of 10 individuals, 50% man and 50% woman. We use this dataset to evaluate the accuracy of the different methods in both the full head and the facial regions. We plan to release, maintain and progressively grow the H3DS dataset.
5.2 Experiments setup
We use the 3DMMbased methods MVFNet [50] and DFNRMVS [3], and the modelfree method IDR [52] as baselines to compare against H3DNet.
In the fewshot scenario (3 views), all the methods are evaluated on the 3DFAW and H3DS datasets. To benchmark our method when more than 3 views are available, we compare it against IDR on the H3DS dataset.
The evaluation criteria have been the same for all methods and in all the experiments. The predicted 3D reconstruction is roughly aligned with the ground truth mesh using manually annotated landmarks, and then refined with rigid ICP [4]. Then, we compute the unidirectional Chamfer distance from the predicted reconstruction to the ground truth. All the distances are computed in millimeters.
We report metrics in two different regions, the face and the full head. For the finer evaluation in the face region, we cut both the reconstructions and the ground truth using a sphere of 95 mm radius and with center at the tip of the nose of the ground truth mesh, and refine the alignment with ICP as in [28, 47]. Then, we compute the Chamfer distance in this subregion. For the full head evaluation, the ICP alignment is performed using an annotated region that includes the face, the ears, and the neck, since it is a region visible in all view configurations (3, 4, 8, 16 and 32). These configurations are defined by their yaw angles as follow: , and for . In this case, the Chamfer distance is computed for all the vertices of the reconstruction.
5.3 Ablation study
We conduct an ablation study on the H3DS dataset in the fewshot scenario (3 views) and show the numerical results in table 1, and the qualitative results in figure 2. First, we reconstruct the scenes without prior and without schedule (a). In this case, the geometry network is initialized using geometric initialization [2], representing a sphere of radius one at the beginning of the optimization. Then, we initialize the geometry network with two different priors, a small one trained on 500 subjects (b), and a large one trained on 10,000 subjects (c), and perform the reconstructions without schedule. As it can be observed, initializing the geometry network with a previously learnt prior leads to smoother and more plausible surfaces, specially when more subjects have been used to train it. It is important to note that the benefits of the initialization are not only due to a better initial shape, but also to the ability of the initial weights to generalize to unseen shapes, which is greater in the large prior model. Finally, we initialize the geometry network with the large prior and use the proposed optimization schedule during the reconstruction process. It can be observed how the resulting 3D heads resemble much more to the ground truth in terms of finer facial details.
Given the notable effect that the number of samples has in the learnt prior representations and in the resulting 3D reconstructions as well, we visualize latent space interpolations in figure 3. To that end, we optimize the latent vector for two ground truth 3D scans as shown in figure 1leftbottom in order minimize the loss 1. Then, we interpolate between the two optimal latent vectors. As it can be observed, the 3D reconstructions resulting from the interpolation in space of the large prior model are more detailed and plausible than the ones from the small prior model, suggesting that the later achieves poorer generalization.
5.4 Quantitative results
Quantitative results in terms of surface error are reported in table 2. Remarkably, H3DNet outperforms both 3DMMbased methods in the fewshot regime, and the modelfree method IDR when the largest number of views (32) are available. It is worth noting how the enhancement due to the prior is more significant as the number of views decreases. Nevertheless, the prior does not prevent the model from becoming more accurate when more views are available, which is a current limitation of modelbased approaches.
We also analyze the tradeoff between the optimization time and the accuracy in IDR and H3DNet for the case of 32 views, which we illustrate in figure 4. It can be observed that, despite reaching similar errors asymptotically, in average our method achieves the best performance attained by IDR much faster. In particular, we report convergence gains of 20 for the facial region error and 4
for the full head. Moreover, the smaller variance observed in H3DNet (blue) indicates that it is a more stable method.
5.5 Qualitative results
Quantitative results show improvements over the baselines in both fewshot and manyshot setups. Here, we study how this is translated into the reconstructed 3D shape.
In figure 5, we qualitatively evaluate the three baselines and H3DNet for the case of 3 input views. As expected, IDR [52] is the worst performing model in this scenario, generating reconstructions with artifacts and with no resemblance to human faces. On the other hand, 3DMMbased models [3, 50] achieve more plausible shapes, but they struggle to capture fine details. H3DNet, in contrast, is able to capture much more detail and reduce significantly the errors over the whole face and, concretely, in difficult areas such as the nose, the eyebrows, the cheeks, and the chin.
We also evaluate the impact that varying the number of available views has on the reconstructed surface, and compare our method to IDR [52]. As shown in figure 6, H3DNet is able to obtain surfaces with less error (greener) with far fewer views, which is consistent with the quantitative results reported in table 2. Notably, it can also be observed that, even when errors are numerically similar (first and third columns), the reconstructions from H3DNet are much more realistic. In addition, H3DNet improvements are especially notable within the face region. We attribute this to the fact that training data used to build the prior model is more rich in this area, whereas training examples frequently present holes in other parts of the head.
6 Conclusions
In this work we have presented H3DNet, a method for highfidelity 3D head reconstruction from small sets of inthewild images with associated head masks and camera poses. Our method combines a pretrained probabilistic model, which represents a distribution of head SDFs, with an implicit differentiable renderer that allows direct supervision in the image domain. By constraining the reconstruction process with the prior model, we are able to robustly recover detailed 3D human heads, including hair and shoulders, from only three input images. After a thorough quantitative and qualitative evaluation, our experiments show that our method outperforms both modelbased methods in the fewshot setup and modelfree methods when a large number of views are available. One limitation of our method is that it still requires several minutes to generate 3D reconstructions. An interesting direction for future work is to use more efficient representations for SDFs in order to speed up the optimization process. We also find it promising to introduce texture priors, which could reduce the convergence time and reduce the final overall error.
7 Acknowledgments
This work has been partially funded by the Spanish government with the projects MoHuCo PID2020120049RBI00, DeeLight PID2020117142GBI00 and Maria de Maeztu Seal of Excellence MDM20160656, and by the Government of Catalonia under the industrial doctorate 2017 DI 028.
References
 [1] (2007) Optimal step nonrigid icp algorithms for surface registration. In CVPR, Cited by: §5.1.
 [2] (2020) Sal: sign agnostic learning of shapes from raw data. In CVPR, Cited by: Figure 2, §5.3.
 [3] (2020) Deep facial nonrigid multiview stereo. In CVPR, Cited by: §1, §2, §5.2, §5.5, Table 2, §5.
 [4] (1992) Method for registration of 3d shapes. In Sensor fusion IV: control paradigms and data structures, Cited by: §5.2.
 [5] (2017) 3D face morphable models” inthewild”. In CVPR, Cited by: §2.
 [6] (2016) A 3d morphable model learnt from 10,000 faces. In CVPR, Cited by: §2.

[7]
(2017)
Geometric deep learning: going beyond euclidean data
. IEEE Signal Processing Magazine 34 (4), pp. 18–42. Cited by: §2.  [8] (2019) Learning implicit fields for generative shape modeling. In CVPR, Cited by: §1, §2.
 [9] (2020) Faster, better and more detailed: 3d face reconstruction with graph convolutional networks. In ACCV, Cited by: §1, §2.
 [10] (2016) 3dr2n2: a unified approach for single and multiview 3d object reconstruction. In ECCV, Cited by: §1, §2.

[11]
(2018)
Multiview 3d face reconstruction with deep recurrent neural networks
. Image and Vision Computing 80, pp. 80–91. Cited by: §1, §2.  [12] (2017) Modelagnostic metalearning for fast adaptation of deep networks. In ICML, Cited by: §2.
 [13] (2020) Implicit geometric regularization for learning shapes. arXiv preprint arXiv:2002.10099. Cited by: §3.1, §3.1, §4.
 [14] (2017) Hierarchical surface prediction for 3d object reconstruction. In 3DV, Cited by: §2.
 [15] (1996) Sphere tracing: a geometric method for the antialiased ray tracing of implicit surfaces. The Visual Computer 12 (10), pp. 527–545. Cited by: §3.2.
 [16] (2017) Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. In ICCV, Cited by: §1, §2.
 [17] (2018) Learning categoryspecific mesh reconstruction from image collections. In ECCV, Cited by: §2.
 [18] (2017) Learning a multiview stereo machine. In NeurIPS, Cited by: §1, §2.
 [19] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
 [20] (2017) Learning a model of facial shape and expression from 4d scans.. ACM Trans. Graph. 36 (6), pp. 194–1. Cited by: §2.
 [21] (2020) Towards highfidelity 3d face reconstruction from inthewild images using graph convolutional networks. In CVPR, Cited by: §1, §2, §2.
 [22] (2019) Deep meta functionals for shape representation. In ICCV, Cited by: §2.
 [23] (2019) Occupancy networks: learning 3d reconstruction in function space. In CVPR, Cited by: §1, §2.
 [24] (2013) Stochastic exploration of ambiguities for nonrigid shape recovery. 35 (2), pp. 463–475. Cited by: §1.
 [25] (2020) Differentiable volumetric rendering: learning implicit 3d representations without 3d supervision. In CVPR, Cited by: §1, §2, §3.2.
 [26] (2019) Deepsdf: learning continuous signed distance functions for shape representation. In CVPR, Cited by: §1, §1, §2, §2, §2, §3.

[27]
(2009)
A 3d face model for pose and illumination invariant face recognition
. In AVSS, pp. 296–301. Cited by: §2.  [28] (2019) The 2nd 3d face alignment in the wild challenge (3dfawvideo): dense reconstruction from video.. In ICCV Workshops, Cited by: §1, Figure 5, §5.1, §5.1, §5.2.
 [29] (2019) Combining 3d morphable models: a large scale faceandhead model. In CVPR, Cited by: §2.
 [30] (2018) Geometryaware network for nonrigid shape prediction from a single view. In CVPR, Cited by: §1.
 [31] (2021) Dnerf: neural radiance fields for dynamic scenes. In CVPR, Cited by: §1.
 [32] (2020) Cflow: conditional generative flow models for images and 3d point clouds. In CVPR, Cited by: §1.
 [33] (2019) 3dpeople: modeling the geometry of dressed humans. In ICCV, Cited by: §1.
 [34] (2019) Multiview 3d face reconstruction in the wild using siamese networks. In ICCV Workshops, Cited by: §1, §2.
 [35] (2016) 3D face reconstruction by learning from synthetic data. In 3DV, Cited by: §1, §2.
 [36] (2017) Learning detailed face reconstruction from a single image. In CVPR, Cited by: §1, §2.
 [37] (2019) Pifu: pixelaligned implicit function for highresolution clothed human digitization. In ICCV, Cited by: §1, §2.

[38]
(2017)
Unrestricted facial geometry reconstruction using imagetoimage translation
. In ICCV, Cited by: §1.  [39] (2020) Metasdf: metalearning signed distance functions. arXiv preprint arXiv:2006.09662. Cited by: §2.
 [40] (1995) Reducing data dimensionality through optimizing neural network inputs. AIChE Journal 41 (6), pp. 1471–1480. Cited by: §3.
 [41] (2020) Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. Cited by: §3.1.

[42]
(2017)
Mofa: modelbased deep convolutional face autoencoder for unsupervised monocular reconstruction
. In ICCV, Cited by: §1, §2.  [43] (2018) Extreme 3d face reconstruction: seeing through occlusions.. In CVPR, Cited by: §1, §2.
 [44] (2019) Towards highfidelity nonlinear 3d face morphable model. In CVPR, Cited by: §2.
 [45] (2018) Nonlinear 3d face morphable model. In CVPR, Cited by: §2.
 [46] (2019) On learning 3d face morphable model from inthewild images. TPAMI 43 (1), pp. 157–171. Cited by: §2.
 [47] (2017) Regressing robust and discriminative 3d morphable models with a very deep neural network. In CVPR, Cited by: §1, §2, §5.2.
 [48] (2018) Pixel2mesh: generating 3d mesh models from single rgb images. In ECCV, Cited by: §1, §2.
 [49] (2019) 3d dense face alignment via graph convolution networks. arXiv preprint arXiv:1904.05562. Cited by: §1.
 [50] (2019) Mvfnet: multiview 3d face morphable model regression. In CVPR, Cited by: §1, §2, §5.2, §5.5, Table 2, §5.
 [51] (2016) Perspective transformer nets: learning singleview 3d object reconstruction without 3d supervision. In NeurIPS, Cited by: §1, §2.
 [52] (2020) Multiview neural surface reconstruction by disentangling geometry and appearance. NeurIPS 33. Cited by: §1, §1, §1, §2, §2, §3.2, §3.2, §3, §4, Figure 4, §5.2, §5.5, §5.5, Table 2, §5.
 [53] (2020) I3DMM: deep implicit 3d morphable model of human heads. arXiv preprint arXiv:2011.14143. Cited by: §2.