1 Introduction
During the past few years, we have witnessed significant improvements in various face analysis tasks such as face detection
[20, 43] and 2D facial landmark localization on static images [41, 22, 7, 39, 44, 5, 6, 37]. This is primarily attributed to the fact that the community has made a considerable effort to collect and annotate facial images captured under unconstrained conditions [25, 46, 10, 33, 32](commonly referred to as “inthewild”) and to the discriminative methodologies that can capitalise on the availability of such large amount of data. Nevertheless, discriminative techniques cannot be applied for 3D facial shape estimation “inthewild”, due to lack of groundtruth data.
3D facial shape estimation from single images has attracted the attention of many researchers the past twenty years. The two main lines of research are (i) fitting a 3D Morphable Model (3DMM) [12, 13] and (ii) applying Shape from Shading (SfS) techniques [35, 36, 23]. The 3DMM fitting proposed in the work of Blanz and Vetter [12, 13] was among the first modelbased 3D facial recovery approaches. The method requires the construction of a 3DMM which is a statistical model of facial texture and shape in a space where there are explicit correspondences. The first 3DMM was built using 200 faces captured in wellcontrolled conditions displaying only the neutral expression. That is the reason why the method was only shown to work on realworld, but not “inthewild”, images. Stateoftheart SfS techniques capitalise on special multilinear decompositions that find an approximate spherical harmonic decomposition of the illumination. Furthermore, in order to benefit from the large availability of “inthewild” images, these methods jointly reconstruct large collections of images. Nevertheless, even thought the results of [35, 23] are quite interesting, given that there is no prior of the facial surface, the methods only recover 2.5D representations of the faces and particular smooth approximations of the facial normals.
3D facial shape recovery from a single image under “inthewild” conditions is still an open and challenging problem in computer vision mainly due to the fact that:

The general problem of extracting the 3D facial shape from a single image is an illposed problem which is notoriously difficult to be solved without the use of any statistical priors for the shape and texture of faces. That is, without prior knowledge regarding the shape of the object athand there are inherent ambiguities present in the problem. The pixel intensity at a location in an image is the result of a complex combination of the underlying shape of the object, the surface albedo and normal characteristics, camera parameters and the arrangement of scene lighting and other objects in the scene. Hence, there are potentially infinite solutions to the problem.

Learning statistical priors of the 3D facial shape and texture for “inthewild” images is currently very difficult by using modern acquisition devices. That is, even though there is a considerable improvement in 3D acquisition devices, they still cannot operate in arbitrary conditions. Hence, all the current 3D facial databases have been captured in controlled conditions.
With the available 3D facial data, it is feasible to learn a powerful statistical model of the facial shape that generalises well for both identity and expression [15, 31, 14]. However, it is not possible to construct a statistical model of the facial texture that generalises well for “inthewild” images and is, at the same time, in correspondence with the statistical shape model. That is the reason why current stateoftheart 3D face reconstruction methodologies rely solely on fitting a statistical 3D facial shape prior on a sparse set of landmarks [3, 17].
In this paper, we make a number of contributions that enable the use of 3DMMs for “inthewild” face reconstruction (Fig. 1). In particular, our contributions are:

We propose a methodology for learning a statistical texture model from “inthewild” facial images, which is in full correspondence with a statistical shape prior that exhibits both identity and expression variations. Motivated by the success of featurebased (e.g., HOG [16], SIFT [26]) Active Appearance Models (AAMs) [4, 5] we further show how to learn featurebased texture models for 3DMMs. We show that the advantage of using the “inthewild” featurebased texture model is that the fitting strategy gets simplified since there is not need to optimize with respect to the illumination parameters.

By capitalising on the recent advancements in fitting statistical deformable models [30, 38, 5, 2], we propose a novel and fast algorithm for fitting “inthewild” 3DMMs. Furthermore, we make the implementation of our algorithm publicly available, which we believe can be of great benefit to the community, given the lack of robust opensource implementations for fitting 3DMMs.

Due to lack of groundtruth data, the majority of the 3D face reconstruction papers report only qualitative results. In this paper, in order to provide quantitative evaluations, we collected a new dataset of 3D facial surfaces, using Kinect Fusion [19, 29], which has many “inthewild” characteristics, even though it is captured indoors.

We release an open source implementation of our technique as part of the Menpo Project. [1]
The remainder of the paper is structured as follows. In Section 2 we elaborate on the construction of our “inthewild” 3DMM, whilst in Section 3 we outline the proposed optimization for fitting “inthewild” images with our model. Section 4 describes our new dataset, the first of its kind, to provide images with a groundtruth 3D facial shape that exhibit many “inthewild” characteristics. We outline a series of quantitative and qualitative experiments in Section 5, and end with conclusions in Section 6.
2 Model Training
A 3DMM consists of three parametric models: the
shape, camera and texture models.2.1 Shape Model
Let us denote the 3D mesh (shape) of an object with vertexes as a vector
(1) 
where are the objectcentered Cartesian coordinates of the th vertex. A 3D shape model can be constructed by first bringing a set of 3D training meshes into dense correspondence so that each is described with the same number of vertexes and all samples have a shared semantic ordering. The corresponded meshes,
, are then brought into a shape space by applying Generalized Procrustes Analysis and then Principal Component Analysis (PCA) is performed which results in
, where is the mean shape vector and is the orthonormal basis after keeping the first principal components. This model can be used to generate novel 3D shape instances using the function as(2) 
where are the shape parameters.
2.2 Camera Model
The purpose of the camera model is to map (project) the objectcentered Cartesian coordinates of a 3D mesh instance into 2D Cartesian coordinates on an image plane. In this work, we employ a pinhole camera model, which utilizes a perspective transformation. However, an orthographic projection model can also be used in the same way.
Perspective projection. The projection of a 3D point into its 2D location in the image plane involves two steps. First, the 3D point is rotated and translated using a linear view transformation, under the assumption that the camera is still
(3) 
where and are the 3D rotation and translation components, respectively. Then, a nonlinear perspective transformation is applied as
(4) 
where is the focal length in pixel units (we assume that the and components of the focal length are equal) and is the principal point that is set to the image center.
Quaternions. We parametrize the 3D rotation with quaternions [24, 40]. The quaternion uses four parameters in order to express a 3D rotation as
(5) 
Note that by enforcing a unit norm constraint on the quaternion vector, i.e. , the rotation matrix constraints of orthogonality with unit determinant are withheld. Given the unit norm property, the quaternion can be seen as a threeparameter vector and a scalar . Most existing works on 3DMM parametrize the rotation matrix using the three Euler angles that define the rotations around the horizontal, vertical and camera axes. Even thought Euler angles are more naturally interpretable, they have strong disadvantages when employed within an optimization procedure, most notably the solution ambiguity and the gimbal lock effect. Parametrization based on quaternions overcomes these disadvantages and further ensures computational efficiency, robustness and simpler differentiation.
Camera function. The projection operation performed by the camera model of the 3DMM can be expressed with the function , which applies the transformations of Eqs. 3 and 4 on the points of provided 3D mesh with
(6) 
being the vector of camera parameters with length . For abbreviation purposes, we represent the camera model of the 3DMM with the function as
(7) 
where is a 3D mesh instance using Eq. 2.
2.3 “IntheWild” FeatureBased Texture Model
The generation of an “inthewild” texture model is a key component of the proposed 3DMM. To this end, we take advantage of the existing large facial “inthewild” databases that are annotated in terms of sparse landmarks. Assume that for a set of “inthewild” images , we have access to the associated camera and shape parameters . Let us also define a densefeature extraction function
(8) 
where is the number of channels of the featurebased image. For each image, we first compute its featurebased representation as and then use Eq. 7 to sample it at each vertex location to build back a vectorized texture sample . This texture sample will be nonsensical for some regions mainly due to selfocclusions present in the mesh projected in the image space . To alleviate these issues, we cast a ray from the camera to each vertex and test for selfintersections with the triangulation of the mesh in order to learn a pervertex occlusion mask for the projected sample.
Let us create the matrix by concatenating the grossly corrupted featurebased texture vectors with missing entries that are represented by the masks . To robustly build a texture model based on this heavily contaminated incomplete data, we need to recover a lowrank matrix representing the clean facial texture and a sparse matrix accounting for gross but sparse nonGaussian noise such that . To simultaneously recover both and from incomplete and grossly corrupted observations, the Principal Component Pursuit with missing values [34] is solved
(9)  
s.t. 
where denotes the nuclear norm, is the matrix norm and is a regularizer. represents the set of locations corresponding to the observed entries of (i.e., ). Then, is defined as the projection of the matrix on the observed entries , namely and otherwise. The unique solution of the convex optimization problem in Eq. 9 is found by employing an Alternating Direction Method of Multipliersbased algorithm [11].
The final texture model is created by applying PCA on the set of reconstructed featurebased textures acquired from the previous procedure. This results in , where is the mean texture vector and is the orthonormal basis after keeping the first principal components. This model can be used to generate novel 3D featurebased texture instances with the function as
(10) 
where are the texture parameters.
Finally, an iterative procedure is used in order to refine the texture. That is, we started with the 3D fits provided by using only the 2D landmarks [21]. Then, a texture model is learned using the above procedure. The texture model was used with the proposed 3DMM fitting algorithm on the same data and texture model was refined.
3 Model Fitting
We propose to fit the 3DMM on an input image using GaussNewton iterative optimization. To this end, herein, we first formulate the cost function and then present two optimization procedures.
3.1 Cost Function
The overall cost function of the proposed 3DMM formulation consists of a texturebased term, an optional error term based on sparse 2D landmarks and optional regularization terms on the parameters.
Texture reconstruction cost. The main term of the optimization problem is the one that aims to estimate the shape, texture and camera parameters that minimize the norm of the difference between the image featurebased texture that corresponds to the projected 2D locations of the 3D shape instance and the texture instance of the 3DMM. Let us denote by the featurebased representation with channels of an input image using Eq. 8. Then, the texture reconstruction cost is expressed as
(11) 
Note that denotes the operation of sampling the featurebased input image on the projected 2D locations of the 3D shape instance acquired by the camera model (Eq. 7).
Regularization. In order to avoid overfitting effects, we augment the cost function with two optional regularization terms over the shape and texture parameters. Let us denote as and
the diagonal matrices with the eigenvalues in their main diagonal for the shape and texture models, respectively. Based on the PCA nature of the shape and texture models, it is assumed that their parameters follow normal prior distributions, i.e.
and . We formulate the regularization terms as the of the parameters’ vectors weighted with the corresponding inverse eigenvalues, i.e.(12) 
where and are constants that weight the contribution of the regularization terms in the cost function.
2D landmarks cost. In order to rapidly adapt the camera parameters in the cost of Eq. 11, we further expand the optimization problem with the term
(13) 
where denotes a set of sparse 2D landmark points () defined on the image coordinate system and returns the vector of 2D projected locations of these sparse landmarks. Intuitively, this term aims to drive the optimization procedure using the selected sparse landmarks as anchors for which we have the optimal locations . This optional landmarksbased cost is weighted with the constant .
Overall cost function. The overall 3DMM cost function is formulated as the sum of the terms in Eqs. 11, 12, 13, i.e.
(14)  
The landmarks term as well as the regularization terms are optional and aim to facilitate the optimization procedure in order to converge faster and to a better minimum. Note that thanks to the proposed “inthewild” featurebased texture model, the cost function does not include any parametric illumination model similar to the ones in the relative literature [12, 13], which greatly simplifies the optimization.
3.2 GaussNewton Optimization
Inspired by the extensive literature in LucasKanade 2D image alignment [8, 28, 30, 38, 5, 2], we formulate a GaussNewton optimization framework. Specifically, given that the camera projection model is applied on the image part of Eq. 14, the proposed optimization has a “forward” nature.
Parameters update. The shape, texture and camera parameters are updated in an additive manner, i.e.
(15) 
where , and are their increments estimated at each fitting iteration. Note that in the case of the quaternion used to parametrize the 3D rotation matrix, the update is performed as the multiplication
(16)  
However, we will still denote it as an addition for simplicity. Finally, we found that it is beneficial to keep the focal length constant in most cases, due to its ambiguity with .
Linearization. By introducing the additive incremental updates on the parameters of Eq. 14, the cost function is expressed as
(17)  
Note that the texture reconstruction and landmarks constraint terms of this cost function are nonlinear due to the camera model operation. We need to linearize them around using first order Taylor series expansion at . The linearization for the image term gives
(18)  
where and are the image Jacobians with respect to the shape and camera parameters, respectively. Note that most dense featureextraction functions are nondifferentiable, thus we simply compute the gradient of the multichannel feature image . Similarly, the linearization on the sparse landmarks projection term gives
(19) 
where and are the camera Jacobians. Please refer to the supplementary material for more details on the computation of these derivatives.
3.2.1 Simultaneous
Herein, we aim to simultaneously solve for all parameters’ increments. By substituting Eqs. 18 and 19 in Eq. 17 we get
(20)  
Let us concatenate the parameters and their increments as and . By taking the derivative of the final linearized cost function with respect to and equalizing with zero, we get the solution
(21) 
where is the Hessian with
(22)  
and
(23)  
are the residual terms. The computational complexity of the Simultaneous algorithm per iteration is dominated by the texture reconstruction term as , which in practice is too slow.
3.2.2 ProjectOut
We propose to use a ProjectOut optimization approach that is much faster than the Simultaneous. The main idea is to optimize on the orthogonal complement of the texture subspace which will eliminate the need to solve for the texture parameters increment at each iteration. By substituting Eqs. 18 and 19 into Eq. 17 and removing the incremental update on the texture parameters as well as the texture parameters regularization term, we end up with the problem
(24)  
The solution of Eq. 24 with respect to is readily given by
(25) 
By plugging Eq. 25 into Eq. 24, we get
(26)  
where is the orthogonal complement of the texture subspace that functions as the “projectout” operator with denoting the unitary matrix. Note that in order to derive Eq. 26, we use the properties and . By differentiating Eq. 26 and equalizing to zero, we get the solution
(27)  
where
(28)  
are the Hessian matrices and
(29)  
are the residual terms. The texture parameters can be estimated at the end of the iterative procedure using Eq. 25.
Note that the most expensive operation is . However, if we first do and then multiply this result with , the total cost becomes . The same stands for . Consequently, the cost per iteration is which is much faster than the Simultaneous algorithm.
Residual masking. In practice, we apply a mask on the texture reconstruction residual of the GaussNewton optimization, in order to speedup the 3DMM fitting. This mask is constructed by first acquiring the set of visible vertexes using zbuffering and then randomly selecting of them. By keeping the number of vertexes small (), we manage to greatly speedup the fitting process without any accuracy penalty.
4 KFITW Dataset
For the evaluation of the 3DMM, we have constructed KFITW, the first dataset of 3D faces captured under relatively unconstrained conditions. The dataset consists of different subjects recorded under various illumination conditions performing a range of expressions (neutral, happy, surprise). We employed the KinectFusion [19, 29] framework to acquire a 3D representation of the subjects with a Kinect v1 sensor.
The fused mesh for each subject serves as a 3D face groundtruth in which we can evaluate our algorithm and compare it to other methods. A voxel grid of size was utilized to get the detailed 3D scans of the faces. In order to accurately reconstruct the entire surface of the faces, a circular motion scanning pattern was carried out. Each subject was instructed to stay still in a fixed pose during the entire scanning process. The frame rate for every subject was constant to frames per second. After getting the 3D scans from the KinectFusion framework we fit our shape model in a nonrigid manner to get a clear mesh with a distinct number of vertexes for the evaluation process. Finally, each mesh was manually annotated with the iBUG 49 sparse landmark set.
5 Experiments
To train our model, which we label as ITW, we use a variant of the Basel Face Model (BFM) [31] that we trained to contain both identities drawn from the original BFM model along with expressions provided by [15]. We trained the “inthewild” texture model on the images of iBUG, LFPW & AFW datasets [32] as described in Sec. 2.3 using the 3D shape fits provided by [45]. Additionally, we elect to use the projectout formulation for the throughout our experiments due its superior runtime performance and equivalent fitting performance to the simultaneous one.
5.1 3D Shape Recovery
Herein, we evaluate our “inthewild” 3DMM (ITW) in terms of 3D shape estimation accuracy against two popular stateoftheart alternative 3DMM formulations. The first one is a classic 3DMM with the original Basel laboratory texture model and full lighting equation which we term Classic. The second is the textureless linear model proposed in [17, 18] which we refer to as Linear. For Linear code we use the Surrey Model with related blendshapes along with the implementation given in [18].
We use the groundtruth annotations provided in the KFITW dataset to initialize and fit all three techniques to the “inthewild” style images in the dataset. The mean mesh of each model under test is landmarked with the same 49point markup used in the dataset, and is registered against the ground truth mesh by performing a Procrustes alignment using the sparse annotations followed by NonRigid Iterative Closest Point (NICP) to iteratively deform the two surfaces until they are brought into correspondence. This provides a permodel ‘groundtruth’ for the 3D shape recovery problem for each image under test. Our error metric is the pervertex dense error between the recovered shape and the modelspecific corresponded groundtruth fit, normalized by the interocular distance for the test mesh. Fig. 4 shows the cumulative error distribution for this experiment for the three models under test. Table 1 reports the corresponding Area Under the Curve (AUC) and failure rates. The Classic model struggles to fit to the “inthewild” conditions present in the test set, and performs the worst. The texturefree Linear model does better, but the ITW model is most able to recover the facial shapes due to its ideal feature basis for the “inthewild” conditions.
Figure 6 demonstrates qualitative results on a wide range of fits of “inthewild” images drawn from the Helen and 300W datasets [32, 33] that qualitatively highlight the effectiveness of the proposed technique. We note that in a wide variety of expression, identity, lighting and occlusion conditions our model is able to robustly reconstruct a realistic 3D facial shape that stands up to scrutiny.
Method  AUC  Failure Rate (%) 

ITW  0.678  1.79 
Linear  0.615  4.02 
Classic  0.531  13.9 
5.2 Quantitative Normal Recovery
As a second evaluation, we use our technique to find perpixel normals and compare against two well established ShapefromShading (SfS) techniques: PSNL [9] and IMM [23]. For experimental evaluation we employ images of 100 subjects from the Photoface database [42]. As a set of four illumination conditions are provided for each subject then we can generate groundtruth facial surface normals using calibrated 4source Photometric Stereo [27]. In Fig. 5 we show the cumulative error distribution in terms of the mean angular error. ITW slightly outperforms IMM even though both IMM and PSNL use all four available images of each subject.
6 Conclusion
We have presented a novel formulation of 3DMMs reimagined for use in “inthewild” conditions. We capitalise on the annotated “inthewild” facial databases to propose a methodology for learning an “inthewild” featurebased texture model suitable for 3DMM fitting without having to optimise for illumination parameters. Furthermore, we propose a novel optimisation procedure for 3DMM fitting. We show that we are able to recover shapes with more detail than is possible using purely landmarkdriven approaches. Our newly introduced “inthewild” KinectFusion dataset allows for the first time a quantitative evaluation of 3D facial reconstruction techniques in the wild, and on these evaluations we demonstrate that our in the wild formulation is state of the art, outperforming classical 3DMM approaches by a considerable margin.
References
 [1] J. Alaborti Medina, E. Antonakos, J. Booth, P. Snape, and S. Zafeiriou. Menpo: A comprehensive platform for parametric image alignment and visual deformable models. In Proceedings of the ACM International Conference on Multimedia, MM ’14, pages 679–682, New York, NY, USA, 2014. ACM.
 [2] J. Alaborti Medina and S. Zafeiriou. A unified framework for compositional fitting of active appearance models. International Journal of Computer Vision, pages 1–39, 2016.
 [3] O. Aldrian and W. A. Smith. Inverse rendering of faces with a 3d morphable model. IEEE transactions on pattern analysis and machine intelligence, 35(5):1080–1093, 2013.
 [4] E. Antonakos, J. AlabortiMedina, G. Tzimiropoulos, and S. Zafeiriou. Hog active appearance models. In Proceedings of IEEE International Conference on Image Processing, pages 224–228. IEEE, 2014.
 [5] E. Antonakos, J. AlabortiMedina, G. Tzimiropoulos, and S. Zafeiriou. Featurebased lucaskanade and active appearance models. IEEE Transactions on Image Processing, 24(9):2617–2632, September 2015.

[6]
E. Antonakos, J. AlabortiMedina, and S. Zafeiriou.
Active pictorial structures.
In
Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition
, pages 5435–5444, Boston, MA, USA, June 2015. IEEE.  [7] A. Asthana, S. Zafeiriou, S. Cheng, and M. Pantic. Incremental face alignment in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1859–1866, 2014.
 [8] S. Baker and I. Matthews. Lucaskanade 20 years on: A unifying framework. International journal of computer vision, 56(3):221–255, 2004.
 [9] R. Basri, D. Jacobs, and I. Kemelmacher. Photometric stereo with general, unknown lighting. IJCV, 72(3):239–257, 2007.
 [10] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar. Localizing parts of faces using a consensus of exemplars. IEEE transactions on pattern analysis and machine intelligence, 35(12):2930–2940, 2013.
 [11] D. P. Bertsekas. Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
 [12] V. Blanz and T. Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194. ACM Press/AddisonWesley Publishing Co., 1999.
 [13] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE Transactions on pattern analysis and machine intelligence, 25(9):1063–1074, 2003.
 [14] J. Booth, A. Roussos, S. Zafeiriou, A. Ponniah, and D. Dunaway. A 3d morphable model learnt from 10,000 faces. In CVPR, 2016.
 [15] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou. Facewarehouse: A 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics, 20(3):413–425, 2014.
 [16] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893. IEEE, 2005.
 [17] P. Huber, Z.H. Feng, W. Christmas, J. Kittler, and M. Rätsch. Fitting 3d morphable face models using local features. In Image Processing (ICIP), 2015 IEEE International Conference on, pages 1195–1199. IEEE, 2015.
 [18] P. Huber, G. Hu, R. Tena, P. Mortazavian, W. P. Koppen, W. Christmas, M. Rätsch, and J. Kittler. A multiresolution 3d morphable face model and fitting framework. In Proceedings of the 11th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, 2016.
 [19] S. Izadi, D. Kim, O. Hilliges, D. Molyneaux, R. Newcombe, P. Kohli, J. Shotton, S. Hodges, D. Freeman, A. Davison, et al. Kinectfusion: realtime 3d reconstruction and interaction using a moving depth camera. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pages 559–568. ACM, 2011.
 [20] V. Jain and E. LearnedMiller. Fddb: A benchmark for face detection in unconstrained settings. Technical Report UMCS2010009, University of Massachusetts, Amherst, 2010.
 [21] A. Jourabloo and X. Liu. Largepose face alignment via cnnbased dense 3d model fitting. In CVPR, 2016.
 [22] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1867–1874, 2014.
 [23] I. KemelmacherShlizerman. Internet based morphable model. In Proceedings of the IEEE International Conference on Computer Vision, pages 3256–3263, 2013.
 [24] J. B. Kuipers et al. Quaternions and rotation sequences, volume 66. Princeton university press Princeton, 1999.
 [25] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang. Interactive facial feature localization. In European Conference on Computer Vision, pages 679–692. Springer, 2012.
 [26] D. G. Lowe. Object recognition from local scaleinvariant features. In Computer vision, 1999. The proceedings of the seventh IEEE international conference on, volume 2, pages 1150–1157. Ieee, 1999.
 [27] D. Marr and H. K. Nishihara. Representation and recognition of the spatial organization of threedimensional shapes. Royal Society of London B: Biological Sciences, 200(1140):269–294, 1978.
 [28] I. Matthews and S. Baker. Active appearance models revisited. International Journal of Computer Vision, 60(2):135–164, 2004.
 [29] R. A. Newcombe, S. Izadi, O. Hilliges, D. Molyneaux, D. Kim, A. J. Davison, P. Kohi, J. Shotton, S. Hodges, and A. Fitzgibbon. Kinectfusion: Realtime dense surface mapping and tracking. In Mixed and augmented reality (ISMAR), 2011 10th IEEE international symposium on, pages 127–136. IEEE, 2011.
 [30] G. Papandreou and P. Maragos. Adaptive and constrained algorithms for inverse compositional active appearance model fitting. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008.
 [31] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter. A 3d face model for pose and illumination invariant face recognition. In Advanced video and signal based surveillance, 2009. AVSS’09. Sixth IEEE International Conference on, pages 296–301. IEEE, 2009.
 [32] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces inthewild challenge: Database and results. Image and Vision Computing, Special Issue on Facial Landmark Localisation ”InTheWild”, 47:3–18, 2016.
 [33] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces inthewild challenge: The first facial landmark localization challenge. In Proceedings of IEEE Int’l Conf. on Computer Vision (ICCVW 2013), 300 Faces intheWild Challenge (300W), Sydney, Australia, December 2013.
 [34] F. Shang, Y. Liu, J. Cheng, and H. Cheng. Robust principal component analysis with missing data. In Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management, CIKM ’14, pages 1149–1158, New York, NY, USA, 2014. ACM.
 [35] P. Snape, Y. Panagakis, and S. Zafeiriou. Automatic construction of robust spherical harmonic subspaces. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 91–100, 2015.
 [36] P. Snape and S. Zafeiriou. Kernelpca analysis of surface normals for shapefromshading. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1059–1066, 2014.
 [37] G. Trigeorgis, P. Snape, M. Nicolaou, E. Antonakos, and S. Zafeiriou. Mnemonic descent method: A recurrent process applied for endtoend face alignment. In Proceedings of IEEE International Conference on Computer Vision & Pattern Recognition, Las Vegas, NV, USA, June 2016. IEEE.
 [38] G. Tzimiropoulos and M. Pantic. Optimization problems for fast aam fitting inthewild. In Proceedings of the IEEE international conference on computer vision, pages 593–600, 2013.
 [39] G. Tzimiropoulos and M. Pantic. Gaussnewton deformable part models for face alignment inthewild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1851–1858, 2014.
 [40] M. Wheeler and K. Ikeuchi. Iterative estimation of rotation and translation using the quaternion: School of computer science, 1995.
 [41] X. Xiong and F. De la Torre. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 532–539, 2013.
 [42] S. Zafeiriou, M. Hansen, G. Atkinson, V. Argyriou, M. Petrou, M. Smith, and L. Smith. The photoface database. In CVPR, pages 132–139, June 2011.
 [43] S. Zafeiriou, C. Zhang, and Z. Zhang. A survey on face detection in the wild: past, present and future. Computer Vision and Image Understanding, 138:1–24, 2015.
 [44] S. Zhu, C. Li, C. Change Loy, and X. Tang. Face alignment by coarsetofine shape searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4998–5006, 2015.
 [45] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3d solution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
 [46] X. Zhu and D. Ramanan. Face detection, pose estimation, and landmark localization in the wild. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 2879–2886. IEEE, 2012.