1 Introduction
The reconstruction of faces from visual data has a wide range of applications in vision and graphics, including face tracking, emotion recognition, and interactive image/video editing tasks relevant in multimedia. Facial images and videos are ubiquitous, as smart devices as well as consumer and professional cameras provide a continuous and virtually endless source thereof. When such data is captured without controlled scene location, lighting, or intrusive equipment (e.g. egocentric cameras or markers on actors), one speaks of “inthewild” images. Usually inthewild data is of low resolution, noisy, or contains motion and focal blur, making the reconstruction problem much harder than in a controlled setup. 3D face reconstruction from inthewild monocular 2D image and video data [71]
deals with disentangling facial shape identity (neutral geometry), skin appearance (or albedo) and expression, as well as estimating the scene lighting and camera parameters. Some of these attributes,
e.g. albedo and lighting, are not easily separable in monocular images. Besides, poor scene lighting, depth ambiguity, and occlusions due to facial hair, sunglasses and large head rotations complicates 3D face reconstruction.In order to tackle the difficult monocular 3D face reconstruction problem, most existing methods rely on the availability of strong prior models that serve as regularizers for an otherwise illposed problem [6, 20, 68]. Although such approaches achieve impressive facial shape and albedo reconstruction, they introduce an inherent bias due to the used face model. For instance, the 3D Morphable Model (3DMM) by Blanz et al. [6]
is based on a comparably small set of 3D laser scans of Caucasian actors, thus limiting generalization to general realworld identities and ethnicities. With the rise of CNNbased deep learning, various techniques have been proposed, which in addition to 3D reconstruction also perform face model learning from monocular images
[63, 62, 59, 55]. However, these methods heavily rely on a preexisting 3DMM to resolve the inherent depth ambiguities of the monocular reconstruction setting. Another line of work, where 3DMMlike face models are not required, are based on photocollections [30, 37, 57]. However, these methods need a very large number (e.g. ) of facial images of the same subject, and thus they impose strong demands on the training corpus.In this paper, we introduce an approach that learns a comprehensive face identity model using clips crawled from inthewild Internet videos [19]. This face identity model comprises two components: One component to represent the geometry of the facial identity (modulo expressions), and another to represent the facial appearance in terms of the albedo. As we have only weak requirements on the training data (cf. Sec. 3.1), our approach can employ a virtually endless amount of community data and thus obtain a model with better generalization; laser scanning a similarly large group of people for model building would be nearly impossible. Unlike most previous approaches, we do not require a preexisting shape identity and albedo model as initialization, but instead learn their variations from scratch. As such, our methodology is applicable in scenarios when no existing model is available, or if it is difficult to create such a model from 3D scans (e.g. for faces of babies).
From a technical point of view, one of our main contributions is a novel multiframe consistency loss, which ensures that the face identity and albedo reconstruction is consistent across frames of the same subject. This way we can avoid depth ambiguities present in many monocular approaches and obtain a more accurate and robust model of facial geometry and albedo. Moreover, by imposing orthogonality between our learned face identity model and an existing blendshape expression model [20], our approach automatically disentangles facial expressions from identity based geometry variations, without resorting to a large set of handcrafted priors. In summary, our approach is based on the following technical contributions:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt,]

A deep neural network that learns a facial shape and appearance space from a big dataset of unconstrained images that contains multiple images of each subject,
e.g. multiview sequences, or even monocular videos. 
Explicit blendshape and identity separation by a projection onto the blendshapes’ nullspace that enables a multiframe consistency loss.

A novel multiframe identity consistency loss based on a Siamese network [67], with the ability to handle monocular and multiframe reconstruction.
2 Related Work
The literature on 3D model learning is quite vast and we mainly review methods for reconstructing 3D face models from scanner data, monocular video data, photocollections and a single 2D image. An overview of the stateoftheart in modelbased face reconstruction is given in [71].
Morphable Models from Highquality Scans:
3DMMs represent deformations in a lowdimensional subspace and are often built from scanner data [7, 8, 36]. Traditional 3DMMs model geometry/appearance variation from limited data via PCA [7, 6, 26]. Recently, richer PCA models have been obtained from largescale datasets [13, 44]
. Multilinear models generalize statistical models by capturing a set of mutually orthogonal variation modes (e.g., global and local deformations) via a tensor decomposition
[68, 9, 10]. However, unstructured subspaces or even tensor generalizations are incapable of modeling localized deformations from limited data. In this respect, Neumann et al. [41] and Bernard et al. [5] devise methods for computing sparse localized deformation components directly from mesh data. Lüthi et al. [38] propose the socalled Gaussian Process morphable models (GPMMs), which are modeled with arbitrary nonlinear kernels, to handle strong nonlinear shape deformations. Ranjan et al. [46]learn a nonlinear model using a deep mesh autoencoder with fast spectral convolution kernels. Garrido et al.
[25]train radial basis functions networks to learn a corrective 3D lip model from multiview data. In an orthogonal direction, Li et al.
[36] learn a hybrid model that combines a linear shape space with articulated motions and semantic blendshapes. All these methods mainly model shape deformations and are limited to the availability of scanner data.Parametric Models from Monocular Data:
Here, we distinguish between personalized, corrective, and morphable model learning. Personalized face models have been extracted from monocular video by first refining a parametric model in a coarsetofine manner (e.g., as in
[49]) and then learning a mapping from coarse semantic deformations to finer nonsemantic detail layers [28, 24]. Corrective models represent outofspace deformations (e.g., in shape or appearance) which are not modeled by the underlying parametric model. Examples are adaptive linear models customized over a video sequence [15, 27] or nonlinear models learned from a training corpus [48, 59]. A number of works have been proposed for inthewild 3DMM learning [53, 63, 4, 12]. Such solutions decompose the face into its intrinsic components through encoderdecoder architectures that exploit weak supervision. Tran et al. [63] employ two separate convolutional decoders to learn a nonlinear model that disentangles shape from appearance. Similarly, Sengupta et al. [53] propose residual blocks to produce a complete separation of surface normal and albedo features. There also exist approaches that learn 3DMMs of rigid [65] or articulated objects [29] by leveraging image collections. These methods predict an instance of a 3DMM directly from an image [29] or use additional cues (e.g., segmentation and shading) to fit and refine a 3DMM [65].Monocular 3D Reconstruction:
Optimizationbased reconstruction algorithms rely on a personalized model [18, 21, 23, 69] or a parametric prior [2, 15, 35, 24, 54] to estimate 3D geometry from a 2D video. Learningbased approaches regress 3D face geometry from a single image by learning an imagetoparameter or imagetogeometry mapping [42, 48, 60, 59, 52, 64, 32]. These methods require ground truth face geometry [64, 34], a morphable model from which synthetic training images are generated [47, 48, 52, 32], or a mixture of both [39, 33]. Recently, Tewari et al. [60] trained fully unsupervised through an inverse renderingbased loss. However, color and shape variations lie in the subspace of a parametric face prior. Only very recent methods for monocular face reconstruction [59, 63, 62, 12] allow for outofspace model generalization while training from inthewild data.
3D Reconstruction via Photocollections:
Face reconstruction is also possible by fitting a template model to photocollections. In [31], an average shape and appearance model is reconstructed from a personspecific photocollection via lowrank matrix factorization. Suwajanakorn et al. [57] use this model to track detailed facial motion from unconstrained video. KemelmacherShlizerman [30] learns a 3DMM from a large photocollection of people, grouped into a fixed set of semantic labels. Also, Liang et al. [37] leverage multiview personspecific photocollections to reconstruct the full head. In a different line of research, Thies et al. [61] fit a coarse parametric model to userselected views to recover personalized face shape and albedo. Roth et al. [49] personalize an existing morphable model to an image collection by using a coarsetofine photometric stereo formulation. Note that most of these methods do not learn a general face model, e.g. a shape basis that spans the range of facial shapes of an entire population, but instead they obtain a single personspecific 3D face instance. Besides, these methods require curated photocollections. We, on the contrary, build a 3DMM representation that generalizes across multiple face identities and impose weaker assumptions on the training data.
Multiframe 3D Reconstruction:
Multiframe reconstruction techniques exploit either temporal information or multiple views to better estimate 3D geometry. Shi et al. [54]
globally fit a multilinear model to 3D landmarks at multiple keyframes and enforce temporal consistency of inbetween frames via interpolation. In
[24], personspecific facial shape is obtained by averaging perframe estimates of a parametric face model. Ichim et al. [28] employ a multiview bundle adjustment approach to reconstruct facial shape and refine expressions using actorspecific sequences. Piotraschke et al. [43] combine regionwise reconstructions of a 3DMM from many images using a normal distance function. Garg et al. [22] propose a modelfree approach that globally optimizes for dense 3D geometry in a nonrigid structure from motion framework. Beyond faces, Tulsian et al. [66] train a CNN to predict singleview 3D shape (represented as voxels) using multiview ray consistency.3 Face Model Learning
Our novel face model learning approach solves two tasks: it jointly learns (i) a parametric face geometry and appearance model, and (ii) an estimator for facial shape, expression, albedo, rigid pose and incident illumination parameters. An overview of our approach is shown in Fig. 1.
Training:
Our network is trained in a selfsupervised fashion based on a training set of multiframe images, i.e., multiple images of the same person sampled from a video clip, see Section 3.1. The network jointly learns an appearance and shape identity model (Section 3.2). It also estimates perframe parameters for the rigid head pose, illumination, and expression parameters, as well as shape and appearance identity parameters that are shared among all frames. We train the network based on a differentiable renderer that incorporates a pervertex appearance model and a graphbased shape deformation model (Section 3.3). To this end, we propose a set of training losses that account for geometry smoothness, photoconsistency, sparse feature alignment and appearance sparsity, see Section 3.4.
Testing:
At test time, our network jointly reconstructs shape, expression, albedo, pose and illumination from an arbitrary number of face images of the same person. Hence, the same trained network is usable both for monocular and multiframe face reconstruction.
3.1 Dataset
We train our approach using the VoxCeleb2 multiframe video dataset [19]. This dataset contains over 140k videos of over 6000 celebrities crawled from Youtube. We sample a total of k multiframe images from this dataset. The th multiframe image comprises frames of the same person extracted from the same video clip to avoid unwanted variations, e.g., due to aging or accessories. The same person can appear multiple times in the dataset. To obtain these images, we perform several sequential steps. First, the face region is cropped based on automatically detected facial landmarks [50, 51]. Afterwards, we discard images whose cropped region is smaller than a threshold (i.e., 200 pixels) and that have low landmark detection confidence, as provided by the landmark tracker [50, 51]. The remaining crops are rescaled to pixels. When sampling the frames in , we ensure sufficient diversity in head pose based on the head orientation obtained by the landmark tracker. We split our multiframe dataset into a training (383k images) and test set (21k images).
3.2 Graphbased Face Representation
We propose a multilevel face representation that is based on both a coarse shape deformation graph and a highresolution surface mesh, where each vertex has a color value that encodes the facial appearance. This representation enables our approach to learn a face model of geometry and appearance based on multiframe consistency. In the following, we explain the components in detail.
Learnable Graphbased Identity Model:
Rather than learning the identity model on the highres mesh with k vertices, we simplify this task by considering a lowerdimensional parametrization based on deformation graphs [56]. We obtain our (coarse) deformation graph by downsampling the mesh to nodes, see Fig. 2. The network now learns a deformation on that is then transferred to the mesh
via linear blend skinning. The vector
of the stacked node positions of the 3D graph is defined as(1) 
where denotes the mean graph node positions. We obtain by downsampling a face mesh with slightly open mouth (to avoid connecting the upper and lower lips). The columns of the learnable matrix span the dimensional () graph deformation subspace, and represents the graph deformation parameters.
The vertex positions of the highresolution mesh that encode the shape identity are then given as
(2) 
Here, is fixed to the neutral mean face shape as defined in the 3DMM [7]. The skinning matrix is obtained based on the mean shape and mean graph nodes .
To sum up, our identity model is represented by a deformation graph , where the deformation parameter is regressed by the network while learning the deformation subspace basis . We regularize this illposed learning problem by exploiting multiframe consistency.
Blendshape Expression Model:
For capturing facial expressions, we use a linear blendshape model that combines the facial expression models from [3] and [16]. This model is fixed, i.e. not learned. Hence, the expression deformations are directly applied to the highres mesh. The vertex positions of the highres mesh that account for shape identity as well as the facial expression are given by
(3) 
where is the fixed blendshape basis, is the vector of blendshape parameters, and is explained next.
Separating Shape and Expression:
We ensure a separation of shape identity from facial expressions by imposing orthogonality between our learned shape identity basis and the fixed blendshape basis. To this end, we first represent the blendshape basis with respect to the deformation graph domain by solving for the graphdomain blendshape basis in a leastsquares sense. Here, is fixed. Then, we orthogonalize the columns of . We propose the Orthogonal Complement Layer (OCL) to ensure that our learned fulfills the orthogonality constraint . Our layer is defined in terms of the projection of onto the orthogonal complement of , i.e.,
(4)  
(5) 
The property can easily be verified.
Learnable Pervertex Appearance Model:
The facial appearance is encoded in the dimensional vector
(6) 
that stacks all pervertex colors represented as RGB triplets. The mean facial appearance and the appearance basis are learnable, while the facial appearance parameters are regressed. Note that we initialize the mean appearance to a constant skin tone and define the reflectance directly on the highres mesh .
3.3 Differentiable Image Formation
To enable endtoend selfsupervised training, we employ a differentiable image formation model that maps 3D model space coordinates onto 2D screen space coordinates . The mapping is implemented as where and denote the rigid head pose and camera projection, respectively. We also apply a differentiable illumination model that transforms illumination parameters as well as pervertex appearance and normal into shaded pervertex color . We explain these two models in the following.
Camera Model:
We assume w.l.o.g. that the camera space corresponds to world space. We model the head pose via a rigid mapping , defined by the global rotation and the translation . After mapping a vertex from model space onto camera space , the full perspective camera model projects the points into screen space .
Illumination Model:
Under the assumption of distant smooth illumination and purely Lambertian surface properties, we employ Spherical Harmonics (SH) [45] to represent the incident radiance at a vertex with normal and appearance as
(7) 
The illumination parameters stack weights per color channel. Each controls the illumination w.r.t. the red, green and blue channel.
3.4 Multiframe Consistent Face Model Learning
We propose a novel network for consistent multiframe face model learning. It consists of Siamese towers that simultaneously process frames of the multiframe image in different streams, see Fig. 1. Each tower consists of an encoder that estimates framespecific parameters and identity feature maps. Note that the jointly learned geometric identity and appearance model , which are common to all faces, are shared across streams.
Regressed Parameters:
We train our network in a selfsupervised manner based on the multiframe images . For each frame of the multiframe image , we stack the framespecific parameters regressed by a Siamese tower (see Parameter Estimation in Fig. 1) in a vector that parametrizes rigid pose, illumination and expression. The frameindependent personspecific identity parameters for the whole multiframe image are pooled from all Siamese towers. We use to denote all regressed frameindependent and framespecific parameters of .
Perframe Parameter Estimation Network:
We employ a convolutional network to extract lowlevel features. Based on these features, we apply a series of convolutions, ReLU, and fully connected layers to regress the perframe parameters
. We refer to the supplemental document for further details.Multiframe Identity Estimation Network:
As explained in Section 3.1, each frame of our multiframe input exhibits the same face identity under different head poses and expression. We exploit this information and use a single identity estimation network (see Fig. 1) to impose the estimation of common identity parameters (shape , appearance ) for all frames. This way, we model a hard constraint on by design. More precisely, given the framespecific lowlevel features obtained by the Siamese networks we apply two additional convolution layers to extract mediumlevel features. The resulting mediumlevel feature maps are fused into a single multiframe feature map via average pooling. Note that the average pooling operation allows us to handle a variable number of inputs. As such, we can perform monocular or multiview reconstruction at test time, as demonstrated in Sec. 4. This pooled feature map is then fed to an identity parameter estimation network that is based on convolution layers, ReLU, and fully connected layers. For details, we refer to the supplemental.
3.5 Loss Functions
Let denote the regressed parameters as well as the learnable network weights . Note, is fully learned during training, whereas the network infers only at test time. Here,
is parameterized by the trainable weights of the network. To measure the reconstruction quality during minibatch gradient descent, we employ the following loss function:
(8)  
(9) 
which is based on two data terms (8) and three regularization terms (9). We found the weights empirically and kept them fixed in all experiments, see supplemental document for details.
Multiframe Photometric Consistency:
One of the key contributions of our approach is to enforce multiframe consistency of the shared identity parameters . This can be thought of as solving modelbased nonrigid structurefrommotion (NSfM) on each of the multiframe inputs during training. We do this by imposing the following photometric consistency loss with respect to the frame :
Here, with abuse of notation, we use to denote the projection of the th vertex into screen space, is its rendered color, and is the set of all visible vertices, as determined by backface culling in the forward pass. Note that the identity related parameters are shared across all frames in . This enables a better disentanglement of illumination and appearance, since only the illumination and head pose are allowed to change across the frames.
Multiframe Landmark Consistency:
To better constrain the problem, we also employ a sparse 2D landmark alignment constraint. This is based on a set of automatically detected 2D feature points [50, 51] in each frame . Each feature point comes with a confidence , so that we use the loss
Here, is the 2D position of the th mesh feature point in screen space. We use sliding correspondences, akin to [59]. Note, the position of the mesh landmarks depends both on the predicted perframe parameters and the shared identity parameters .
Geometry Smoothness on Graphlevel:
We employ a linearized membrane energy [14] to define a firstorder geometric smoothness prior on the displacements of the deformation graph nodes
(10) 
where is the set of nodes that have a skinned vertex in common with the th node. Note, the graph parameterizes the geometric identity, i.e., it only depends on the shared identity parameters . This term enforces smooth deformations of the parametric shape and leads to higher quality reconstruction results.
Appearance Sparsity:
In our learned face model, skin appearance is parameterized on a pervertex basis. To further constrain the underlying intrinsic decomposition problem, we employ a local pervertex spatial reflectance sparsity prior as in [40, 11], defined as follows
(11) 
The peredge weights model the similarity of neighboring vertices in terms of chroma and are defined as
Here, is the chroma of and denotes the parameters predicted in the last forward pass. We fix and for training.
Expression Regularization:
To prevent overfitting and enable a better learning of the identity basis, we regularize the magnitude of the expression parameters :
(12) 
Here, is the th expression parameter of frame , and
is the corresponding standard deviation computed based on Principal Component Analysis (PCA).
4 Results
We show qualitative results reconstructing geometry, reflectance and scene illumination from monocular images in Fig. 3
. As our model is trained on a large corpus of multiview images, it generalizes well to different ethnicities, even in the presence of facial hair and makeup. We implement and train our networks in TensorFlow
[1]. We pretrain the expression model and then train the full network endtoend. After convergence, the network is finetuned using a larger learning rate for reflectance. We empirically found that this training strategy improves the capture of facial hair, makeup and eyelids, and thus the model’s generalization. Our method can also be applied to multiframe reconstruction at test time. Fig. 4 shows that feeding two images simultaneously improves the consistency and quality of the obtained 3D reconstructions when compared to the monocular case. Please note that we can successfully separate identity and reflectance due to our novel Orthogonal Complement Layer (OCL). For the experiments shown in the following sections, we trained our network on multiframe images and used only one input image at test time, unless stated otherwise. Our networks take around 30 hours to train on a Titan V. Inference takes only 5.2 ms on a Titan Xp. More details, results, and experiments can also be found in the supplemental document.4.1 Comparisons to Monocular Approaches
Stateoftheart monocular reconstruction approaches that rely on an existing face model [60] or synthetically generated data [52, 48] during training, do not generalize well to faces outside the span of the model. As such, they can not handle facial hair, makeup, and unmodeled expressions, see Fig. 6. Since we train our models on inthewild videos, we can capture these variations and thus generalize better in such challenging cases. We also compare to the refinement based approaches of [59, 62]. Tran et al. [62] (see Fig 7) refine a 3DMM [7] based on inthewild data. Our approach produces better geometry without requiring a 3DMM and, contrary to [62], it also separates albedo from illumination. The approach of Tewari et al. [59] (see Fig 5) requires a 3DMM [7] as input and only learns shape and reflectance correctives. Since they learn from monocular data, their correctives are prone to artifacts, especially when occlusions or extreme head poses exist. In contrast, our approach learns a complete model from scratch based on multiview supervision, thus improving robustness and reconstruction quality. We also compare to [12], which only learns a texture model, see Fig. 8. In contrast, our approach learns a model that separates albedo from illumination. Besides, their method needs a 3DMM [7] as initialization, while we start from a single constantly colored mesh and learn all variation modes (geometry and reflectance) from scratch.
4.2 Quantitative Results
Ours  [59] Fine  [59] Coarse  [60]  
Train  M = 1  M = 2  M = 4  M = 2  M = 4  
Test  M = 1  M = 1  M = 1  M = 2  M = 2  
Mean  1.92 mm  1.82 mm  1.76 mm  1.80 mm  1.74 mm  1.83 mm  1.81 mm  3.22 mm 
SD  0.48 mm  0.45 mm  0.44 mm  0.46 mm  0.43 mm  0.39 mm  0.47 mm  0.77 mm 
Ours  Others  

Learning  Learning  Optimization  Hybrid  
[59] Fine  [59] Coarse  [60]  [32]  [24]  [58]  
Mean  1.90 mm  1.84 mm  2.03 mm  2.19 mm  2.11 mm  1.59 mm  1.87 mm 
SD  0.40 mm  0.38 mm  0.52 mm  0.54 mm  0.46 mm  0.30 mm  0.42 mm 
Time  5.2 ms  4 ms  4 ms  4 ms  4 ms  120 s  110 ms 
We also evaluate our reconstructions quantitatively on a subset of the BU3DFE dataset [70], see Tab. 1. This dataset contains images and corresponding ground truth geometry of multiple people performing a variety of expressions. It includes two different viewpoints. We evaluate the importance of multiframe training in the case of monocular reconstruction using pervertex root mean squared error based on a precomputed dense correspondence map. The lowest error is achieved with multiview supervision during training, in comparison to monocular input data. Multiview supervision can better resolve depth ambiguity and thus learn a more accurate model. In addition, the multiview supervision also leads to a better disentanglement of reflectance and shading. We also evaluate the advantage of multiframe input at test time. When both images corresponding to a shape are given, we consistently obtain better results. Further, our estimates are better than the stateoftheart approach of [59]. Since [59] refine an existing 3DMM only using monocular images during training, it cannot resolve depth ambiguity well. Thus, it does not improve the performance compared to their coarse model on the degree poses of BU3DFE [70]. Similar to previous work, we also evaluate monocular reconstruction on 180 meshes of FaceWarehouse [17], see Tab. 2. We perform similar to the 3DMMbased stateoftheart. Note that we do not use a precomputed 3DMM, but learn a model from scratch during training, unlike all other approaches in this comparison. For this test, we employ a model learned starting from an asian mean face, as FaceWarehouse mainly contains asians. Our approach is agnostic to the mean face chosen and thus allows us this freedom.
5 Conclusion & Discussion
We have proposed a selfsupervised approach for joint multiframe learning of a face model and a 3D face reconstruction network. Our model is learned from scratch based on a large corpus of inthewild video clips without available ground truth. Although we have demonstrated compelling results by learning from inthewild data, such data is often of low resolution, noisy, or blurred, which imposes a bound on the achievable quality. Nevertheless, our approach already matches or outperforms the stateoftheart in learningbased face reconstruction. We hope that it will inspire followup work and that multiview supervision for learning 3D face reconstruction will receive more attention.
Acknowledgements
We thank TrueVisionSolutions Pty Ltd for providing the 2D face tracker, and the authors of [12, 48, 52, 62] for the comparisons. We also thank Franziska Müller for the video voiceover. This work was supported by the ERC Consolidator Grant 4DReply (770784), the Max Planck Center for Visual Computing and Communications (MPCVCC), and by Technicolor.
References

[1]
M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving,
M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg,
D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens,
B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and
X. Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
Software available from tensorflow.org. 
[2]
A. Agudo, L. Agapito, B. Calvo, and J. M. M. Montiel.
Good vibrations: A modal analysis approach for sequential nonrigid
structure from motion.
In
Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition
, CVPR ’14, pages 1558–1565. IEEE Computer Society, 2014.  [3] O. Alexander, M. Rogers, W. Lambeth, M. Chiang, and P. Debevec. The Digital Emily Project: photoreal facial modeling and animation. In ACM SIGGRAPH Courses, pages 12:1–12:15. ACM, 2009.
 [4] A. Bas and W. A. P. Smith. Statistical transformer networks: learning shape and appearance models via self supervision. arXiv:1804.02541, 2018.
 [5] F. Bernard, P. Gemmar, F. Hertel, J. Goncalves, and J. Thunberg. Linear shape deformation models with local support using graphbased structured matrix factorisation. In CVPR, 2016.
 [6] V. Blanz, C. Basso, T. Poggio, and T. Vetter. Reanimating faces in images and video. In Computer graphics forum, pages 641–650. Wiley Online Library, 2003.
 [7] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proc. SIGGRAPH, pages 187–194. ACM Press/AddisonWesley Publishing Co., 1999.
 [8] F. Bogo, J. Romero, M. Loper, and M. J. Black. Faust: Dataset and evaluation for 3d mesh registration. In CVPR ’14, pages 3794–3801. IEEE Computer Society, 2014.
 [9] T. Bolkart and S. Wuhrer. A groupwise multilinear correspondence optimization for 3d faces. In ICCV, pages 3604–3612. IEEE Computer Society, 2015.
 [10] T. Bolkart and S. Wuhrer. A robust multilinear model learning framework for 3d faces. In CVPR, pages 4911–4919. IEEE Computer Society, 2016.
 [11] N. Bonneel, K. Sunkavalli, J. Tompkin, D. Sun, S. Paris, and H. Pfister. Interactive intrinsic video editing. ACM Transactions on Graphics (Proceedings of SIGGRAPH Asia 2014), 33(6), 2014.
 [12] J. Booth, E. Antonakos, S. Ploumpis, G. Trigeorgis, Y. Panagakis, and S. Zafeiriou. 3d face morphable models ”inthewild”. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [13] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway. A 3d morphable model learnt from 10,000 faces. In CVPR, 2016.
 [14] M. Botsch and O. Sorkine. On linear variational surface deformation methods. IEEE Transactions on Visualization and Computer Graphics, 14(1):213–230, Jan. 2008.
 [15] S. Bouaziz, Y. Wang, and M. Pauly. Online modeling for realtime facial animation. ACM Trans. Graph., 32(4):40:1–40:10, 2013.
 [16] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3D facial expression database for visual computing. IEEE TVCG, 20(3):413–425, 2014.
 [17] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, Mar. 2014.
 [18] C. Cao, H. Wu, Y. Weng, T. Shao, and K. Zhou. Realtime facial animation with imagebased dynamic avatars. ACM Trans. Graph., 35(4):126:1–126:12, 2016.
 [19] J. S. Chung, A. Nagrani, and A. Zisserman. Voxceleb2: Deep speaker recognition. In INTERSPEECH, 2018.
 [20] P. Ekman and E. L. Rosenberg. What the face reveals: Basic and applied studies of spontaneous expression using the Facial Action Coding System (FACS). Oxford University Press, USA, 1997.
 [21] G. Fyffe, A. Jones, O. Alexander, R. Ichikari, and P. Debevec. Driving highresolution facial scans with video performance capture. ACM Trans. Graph., 34(1):8:1–8:14, 2014.
 [22] R. Garg, A. Roussos, and L. Agapito. Dense variational reconstruction of nonrigid surfaces from monocular video. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 2328, 2013, pages 1272–1279. IEEE Computer Society, 2013.
 [23] P. Garrido, L. Valgaerts, C. Wu, and C. Theobalt. Reconstructing detailed dynamic face geometry from monocular video. In ACM Trans. Graph. (Proceedings of SIGGRAPH Asia 2013), volume 32, pages 158:1–158:10, November 2013.
 [24] P. Garrido, M. Zollhöfer, D. Casas, L. Valgaerts, K. Varanasi, P. Pérez, and C. Theobalt. Reconstruction of personalized 3D face rigs from monocular video. ACM Transactions on Graphics, 35(3):28:1–15, June 2016.
 [25] P. Garrido, M. Zollhöfer, C. Wu, D. Bradley, P. Pérez, T. Beeler, and C. Theobalt. Corrective 3d reconstruction of lips from monocular video. ACM Trans. Graph., 35(6):219:1–219:11, 2016.
 [26] N. Hasler, C. Stoll, M. Sunkel, B. Rosenhahn, and H. Seidel. A statistical model of human pose and body shape. Comput. Graph. Forum, 28(2):337–346, 2009.
 [27] P. Hsieh, C. Ma, J. Yu, and H. Li. Unconstrained realtime facial performance capture. In CVPR, pages 1675–1683. IEEE Computer Society, 2015.
 [28] A. E. Ichim, S. Bouaziz, and M. Pauly. Dynamic 3d avatar creation from handheld video input. ACM Trans. Graph., 34(4):45:1–45:14, 2015.
 [29] A. Kanazawa, S. Tulsiani, A. A. Efros, and J. Malik. Learning categoryspecific mesh reconstruction from image collections. In ECCV, volume 11219 of Lecture Notes in Computer Science, pages 386–402. Springer, 2018.
 [30] I. KemelmacherShlizerman. Internetbased morphable model. ICCV, pages 3256–3263, 2013.
 [31] I. KemelmacherShlizerman and S. M. Seitz. Face reconstruction in the wild. In ICCV, pages 1746–1753, 2011.
 [32] H. Kim, M. Zollhöfer, A. Tewari, J. Thies, C. Richardt, and C. Theobalt. InverseFaceNet: Deep SingleShot Inverse Face Rendering From A Single Image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [33] M. Klaudiny, S. McDonagh, D. Bradley, T. Beeler, and K. Mitchell. RealTime MultiView Facial Capture with Synthetic Training. Comput. Graph. Forum, 2017.

[34]
S. Laine, T. Karras, T. Aila, A. Herva, S. Saito, R. Yu, H. Li, and
J. Lehtinen.
Productionlevel facial performance capture using deep convolutional neural networks.
In SCA, pages 10:1–10:10. ACM, 2017.  [35] H. Li, J. Yu, Y. Ye, and C. Bregler. Realtime facial animation with onthefly correctives. ACM Trans. Graph., 32(4):42:1–42:10, 2013.
 [36] T. Li, T. Bolkart, M. J. Black, H. Li, and J. Romero. Learning a model of facial shape and expression from 4d scans. ACM Trans. Graph., 36(6):194:1–194:17, 2017.
 [37] S. Liang, L. G. Shapiro, and I. KemelmacherShlizerman. Head reconstruction from internet photos. In European Conference on Computer Vision, pages 360–374. Springer, 2016.
 [38] M. Lüthi, T. Gerig, C. Jud, and T. Vetter. Gaussian process morphable models. PAMI, 40(8):1860–1873, 2018.
 [39] S. McDonagh, M. Klaudiny, D. Bradley, T. Beeler, I. Matthews, and K. Mitchell. Synthetic prior design for realtime face tracking. 3DV, 00:639–648, 2016.
 [40] A. Meka, M. Zollhöfer, C. Richardt, and C. Theobalt. Live intrinsic video. ACM Transactions on Graphics (Proceedings SIGGRAPH), 35(4), 2016.
 [41] T. Neumann, K. Varanasi, S. Wenger, M. Wacker, M. Magnor, and C. Theobalt. Sparse localized deformation components. ACM Trans. Graph., 32(6):179:1–179:10, 2013.
 [42] K. Olszewski, J. J. Lim, S. Saito, and H. Li. Highfidelity facial and speech animation for vr hmds. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia 2016), 35(6), 2016.
 [43] M. Piotraschke and V. Blanz. Automated 3d face reconstruction from multiple images using quality measures. In CVPR, pages 3418–3427. IEEE Computer Society, 2016.
 [44] L. Pishchulin, S. Wuhrer, T. Helten, C. Theobalt, and B. Schiele. Building statistical shape spaces for 3d human modeling. Pattern Recognition, 67:276–286, 2017.
 [45] R. Ramamoorthi and P. Hanrahan. A signalprocessing framework for inverse rendering. In Proc. SIGGRAPH, pages 117–128. ACM, 2001.
 [46] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating 3d faces using convolutional mesh autoencoders. In ECCV ’18, volume 11207 of Lecture Notes in Computer Science, pages 725–741. Springer, 2018.
 [47] E. Richardson, M. Sela, and R. Kimmel. 3D face reconstruction by learning from synthetic data. In 3DV, 2016.
 [48] E. Richardson, M. Sela, R. OrEl, and R. Kimmel. Learning detailed face reconstruction from a single image. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [49] J. Roth, Y. Tong, and X. Liu. Adaptive 3d face reconstruction from unconstrained photo collections. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2127–2141, 2017.
 [50] J. M. Saragih, S. Lucey, and J. F. Cohn. Deformable model fitting by regularized landmark meanshift. 91(2):200–215, 2011.
 [51] J. M. Saragih, S. Lucey, and J. F. Cohn. Realtime avatar animation from a single image. In FG, pages 213–220. IEEE, 2011.

[52]
M. Sela, E. Richardson, and R. Kimmel.
Unrestricted Facial Geometry Reconstruction Using ImagetoImage Translation.
In ICCV, 2017.  [53] S. Sengupta, A. Kanazawa, C. D. Castillo, and D. W. Jacobs. Sfsnet: Learning shape, refectance and illuminance of faces in the wild. In Computer Vision and Pattern Regognition (CVPR), 2018.
 [54] F. Shi, H.T. Wu, X. Tong, and J. Chai. Automatic acquisition of highfidelity facial performances using monocular videos. ACM Trans. Graph., 33(6):222:1–222:13, 2014.
 [55] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In Computer Vision and Pattern Recognition, 2017. CVPR 2017. IEEE Conference on, pages –. IEEE, 2017.
 [56] R. W. Sumner, J. Schmid, and M. Pauly. Embedded deformation for shape manipulation. In ACM Transactions on Graphics (TOG), volume 26, page 80. ACM, 2007.
 [57] S. Suwajanakorn, I. KemelmacherShlizerman, and S. M. Seitz. Total moving face reconstruction. In ECCV, pages 796–812, 2014.
 [58] A. Tewari, M. Zollhöfer, F. Bernard, P. Garrido, H. Kim, P. Pérez, and C. Theobalt. Highfidelity monocular face reconstruction based on an unsupervised modelbased face autoencoder. PAMI, 2018.
 [59] A. Tewari, M. Zollhöfer, P. Garrido, F. Bernard, H. Kim, P. Pérez, and C. Theobalt. Selfsupervised multilevel face model learning for monocular reconstruction at over 250 hz. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [60] A. Tewari, M. Zollhöfer, H. Kim, P. Garrido, F. Bernard, P. Perez, and T. Christian. MoFA: Modelbased Deep Convolutional Face Autoencoder for Unsupervised Monocular Reconstruction. In ICCV, 2017.
 [61] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt. Realtime expression transfer for facial reenactment. ACM Trans. Graph., 34(6):183:1–183:14, 2015.
 [62] L. Tran and X. Liu. Nonlinear 3d face morphable model. In In Proceeding of IEEE Computer Vision and Pattern Recognition, Salt Lake City, UT, June 2018.
 [63] L. Tran and X. Liu. On learning 3d face morphable model from inthewild images. arXiv:1808.09560, 2018.
 [64] A. Tuan Tran, T. Hassner, I. Masi, and G. Medioni. Regressing robust and discriminative 3d morphable models with a very deep neural network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
 [65] S. Tulsiani, A. Kar, J. Carreira, and J. Malik. Learning categoryspecific deformable 3d models for object reconstruction. PAMI, 39(4):719–731, 2017.
 [66] S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik. Multiview supervision for singleview reconstruction via differentiable ray consistency. In CVPR, pages 209–217. IEEE Computer Society, 2017.
 [67] O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al. Matching networks for one shot learning. In Advances in Neural Information Processing Systems, pages 3630–3638, 2016.
 [68] D. Vlasic, M. Brand, H. Pfister, and J. Popović. Face transfer with multilinear models. ACM Trans. Graph., 24(3):426–433, July 2005.
 [69] C. Wu, D. Bradley, P. Garrido, M. Zollhöfer, C. Theobalt, M. Gross, and T. Beeler. Modelbased teeth reconstruction. ACM Trans. Graph., 35(6):220:1–220:13, 2016.
 [70] L. Yin, X. Wei, Y. Sun, J. Wang, and M. J. Rosato. A 3d facial expression database for facial behavior research. In International Conference on Automatic Face and Gesture Recognition (FGR06), pages 211–216, 2006.
 [71] M. Zollhöfer, J. Thies, P. Garrido, D. Bradley, T. Beeler, P. Pérez, M. Stamminger, M. Nießner, and C. Theobalt. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Comput. Graph. Forum (Eurographics State of the Art Reports 2018), 37(2), 2018.
Comments
There are no comments yet.