I Introduction
This paper addresses the problem of reconstructing a 3D face from monocular in the wild videos. While the problem has been studied in the past, existing algorithms rely either on RGBD data or have not demonstrated their robustness on realistic inthewild videos.
From the algorithms working on monocular video sequences, the algorithm of Garrido et al. [1] requires manual, subjectspecific training and labelling, and their algorithm has only been evaluated on a limited set of videos under rather controlled conditions, with frontal poses and on videos in HD quality. Ichim et al. [2] also require subject specific training and manual labelling by an experienced labeler, taking 1 to 7 minutes per subject, which is a tedious process, and their resulting model is personspecific. Jeni et al. [3] use rendered 3D meshes to train their algorithm, which do not contain the variations that occur in 2D inthewild images, for example, the meshes have to be rendered on random backgrounds. Furthermore, they only evaluate by crossvalidation on the same 3D data their algorithm has been trained on. Cao et al. [4, 5] reconstruct only shape, without using texture, and do not evaluate on inthewild videos. Cao et al. [6] don’t require user specific training, but present only results in controlled, frontal and high image resolution and require CUDA to achieve realtime performance.
In contrast to these approaches, we present an approach that requires no subject specific training and evaluate it on a challenging 2D inthewild video data set. We are the first to carry out such an evaluation of a 3D face reconstruction algorithm on inthewild data with challenging pose and light variations as well as limited resolution and show the robustness of our algorithm. While many of these works focus on face reenactment, we focus on a highquality texture representation of the subject in front of the camera.
In addition to subjectspecific manual training being a tedious step, creating a personalised face model offline
is not possible where the subject can not be seen beforehand, e.g. for face recognition in the wild, customer tracking for behaviour analysis or various HCI scenarios. Our approach runs in near realtime on a CPU.
This paper presents the following contributions. By combining cascaded regression with 3D Morphable Face Model fitting, we obtain realtime face tracking and semidense 3D shape estimate from lowquality consumer 2D webcam videos. We present an approach to fuse the face texture from multiple video frames to yield a holistic textured face model. We demonstrate the applicability of our method to inthewild videos on the newly released 300VW video database that includes challenging scenarios like speeches and TV shows. Furthermore, we present preliminary results of a medianbased superresolution approach that can be applied when the whole video is available in advance. Finally, our method is available as opensource software on GitHub.
Ii Method
In general, reconstructing a 3D face from 2D data is an illconditioned problem. To make this task feasible, our approach incorporates a 3D Morphable Face Model (3DMM) as prior knowledge about faces.
In this section, we first briefly introduce the 3D Morphable Face Model we use. We then present our 3D face reconstruction approach and the texture fusion.
Iia 3D Morphable Face Model
A 3D Morphable Model (3DMM) is based on three dimensional meshes of faces that have been registered to a reference mesh, i.e. are in dense correspondence. A face is represented by a vector
, containing the x, y and z components of the shape, and a vector , containing the pervertex RGB colour information. is the number of mesh vertices. The 3DMM consists of two PCA models, one for the shape and one for the colour information, of which we only use the shape model in this paper. Each PCA model(1) 
consists of the mean of the model, , a set of principal components
, and the standard deviations
. is the number of principal components present in the model. Novel faces can be generated by calculating(2) 
for the shape, where are a set of 3D face instance coordinates in the shape PCA space.
IiB Face Tracking
To track the face in each frame of a video, we use a cascadedregression based approach, similar to Feng et al. [7], to regress a set of sparse facial landmark points. The goal is to find a regressor:
(3) 
where is a vector of extracted image features from the input image, given the current model parameters, and is the predicted model parameter update. This mapping is learned from a training dataset using a series of linear regressors
(4) 
where is the projection matrix and is the offset (bias) of the th regressor. and extracts HOG features from the image.
When run on a video stream, the regression is initialise at the location from the previous frame but with the model’s mean landmarks, which acts as a regularisation.
IiC 3D Model Fitting
In subsequent steps, the 3D Morphable Model is fitted to the subject in a frame. This section describes our camera model, the PCA shape fitting, and subsequent refinement using contour landmarks and facial expressions.
Camera model With the 2D landmark locations and their known correspondences in the 3D Morphable Model, we estimate the pose of the camera. We assume an affine camera model and implement the Gold Standard Algorithm of Hartley & Zisserman [8], which finds a least squares approximation of a camera matrix given a number of 2D  3D point pairs.
First, the detected 2D landmark points and the corresponding 3D model points (both represented in homogeneous coordinates) are normalised by similarity transforms that translate the centroid of the image and model points to the origin and scale them so that the RootMeanSquare distance from their origin is for the landmark and for the model points respectively: with , and with . Using landmark points, we then compute a normalised camera matrix using the Gold Standard Algorithm [8] and obtain the final camera matrix after denormalising: .
Shape fitting Given the estimated camera, the 3D shape model is fitted to the sparse set of 2D landmarks to produce an identityspecific semidense 3D shape. We find the most likely vector of PCA shape coefficients by minimising the following cost function:
(5) 
where is the number of landmarks, are detected or labelled 2D landmarks in homogeneous coordinates,
is an optional variance for these landmark points, and
is the projection of the 3D Morphable Model shape to 2D using the estimated camera matrix. More specifically, , where is the th row of and is a matrix that has copies of the camera matrix on its diagonal, andis a modified PCA basis matrix that consists of a subselection of the rows that correspond to the landmark points that the shape is fitted to. Additionally, a row of zeros is inserted after every third row to accommodate for homogeneous coordinates, and the basis vectors are multiplied with the square root of their respective eigenvalue. The cost function in (
5) can be brought into a standard linear least squares formulation. For details of the algorithm, we refer the reader to [9] and [10].IiD Expression Fitting
To model expressions, we use a set of expression blendshapes that have been computed from 3D expression scans. A linear combination of these blendshapes is added to the PCA model, so a shape is represented as:
(6) 
where is the th column of .
To solve for the blendshape coefficients, we use a nonnegative least squares solver that minimises the distance between the current estimated model projection and the 2D landmarks. Because we solve for and at the same time, we run the PCA shape and the expression fitting alternating until they reach stable values  usually they converge within ten iterations.
The result is identity specific shape coefficients and expression blendshapes . Besides allowing to model expressions present in the subject in front of the camera, it can be used to remove a facial expression from a subject, or to rerender it with a different expression. Figure 2 shows a frame with a strong expression, the expressionneutralised face, and a rerendering with a synthesised expression.
IiE Contour Refinement
In general, the outer face contours present in the 2D image do not correspond to unique contours on the 3D model. At the same time, these contours are important for an accurate face reconstruction, as they define the boundary region of the face. This problem has had limited attention in the research community, but for example Bas et al. [11] recently provided an excellent overview describing the problem in more detail.
To deal with this problem of contour correspondences, we introduce a simple contour fitting that fits the frontfacing face contour given semifixed 2D3D correspondences. We assume that the frontfacing contour (that is, the half of the contour closer to the camera, for example the right face contour when a subject looks to left) corresponds to the outline of the model. We thus define a set of vertices along the outline of the 3D face model, and then, given an initial fit, search for the closest vertex in that list for each detected 2D contour point.
Given a 2D contour landmark , the optimal corresponding 3D vertex is chosen as:
(7) 
where is the currently estimated projection matrix from 3D to 2D.
Using a whole set of potential 3D contour vertices makes the method robust against varying roll and pitch angles, as well as against vertical inaccuracies of the contour from the landmark regressor. Once these optimal contour correspondences are found, they are used as additional corresponding points in the algorithm described in the previous sections.
IiF Texture Reconstruction
Once an accurate modelfit is obtained, we remap the image texture from a frame to an isomap that puts each pixel into a globally registered representation. The isomap is a texture map, created by projecting the 3D model triangles to 2D while preserves the geodesic distance between vertices. The mapping is computed only once, so the isomaps of each frames are in dense correspondence with each other.
Inspired by [12], we compute a weighting for each point in the isomap that is given by the angle of the camera viewing direction and the normal of the 3D mesh’s triangle of the current point : . Thus, vertices that are facing away from the camera receive a lower weighting than vertices directly facing the camera, and selfoccluded regions are discarded. In contrast to [12], our approach does not depend on the colour model or a colour or light model fitting. Figure 3 shows an example image and the resulting weighting for each pixel.
To reconstruct the texture value at each pixel location, we calculate a weighted average of all frames up to the current one, each pixel weighed by its triangle’s computed of a particular frame. This average can be computed very efficiently, i.e. by adding the values of the current frame to the previous average and normalising accordingly, without having to recompute the values for all previous frames. While more complex fusion techniques could be applied, our method is particularly suited for realtime application and in that it allows the computation of an incremental texture model on a video stream, without having knowledge of the whole video in advance.
Iii Experiments
Iiia Landmark Accuracy
First, we evaluate the proposed approach on the ibugHelen test set [13], to be able to compare the landmark accuracy to other approaches in literature. We train a model using the algorithm of Section IIB, using FHOG features and 5 cascaded linear regressors in series. On the official ibug68 landmarks set, we achieve an average error of 0.049, measured in percent of the distance between the outer eye corners, as defined by the official ibug protocol (which they refer to as intereyedistance, IED). The algorithm was initialised with bounding boxes given by the ibug face detector. Table I shows a comparison with recent state of the art methods.
To evaluate the accuracy of our tracking and the landmarks used for the shape reconstruction on inthewild videos, we evaluate the proposed approach on the public part of the 300VW dataset [14]. Across all videos, our tracking achieves an average error of 0.047. Figure 5 shows the accuracy of each individual landmark. Our approach achieves competitive results even on challenging video sequences. Given that all 300VW data is annotated semiautomatically, and the groundtruth contour landmarks are not welldefined and vary largely along the face contour, we believe this to be very close to the optimum achievable accuracy.
It is noteworthy that all the results in this paper were achieved by training on databases from different sources than 300VW, no images from 300VW were used in the training at any point.
IiiB Face Reconstruction
Our main experiment is concerned with reconstructing the 3D face and texture from inthewild video sequences. Since for such video sequences, no 3D groundtruth is available, we evaluate on the texture map, which account for shape as well as texture reconstruction accuracy. We create a groundtruth isomap for ten 300VW videos, by manually merging a left, frontal and right view, generated from accurate manual landmarks. We then compare our fully automatic reconstruction with these reference isomaps.
Figure 4 shows results of ibug 300VW reconstructions. Our pipeline copes well with changing background, challenging poses, and, to some degree, varying illumination. The weighted fusion works well in these challenging conditions and results in a holistic, visually appealing reconstruction of the full face. Using an averagebased fusion results in slight blur, but produces consistent results.
IiiC Superresolution Texture Fusion
To evaluate the future potential of the proposed approach, we experiment with a medianbased superresolution approach to fuse the texture, assuming that we have knowledge of all frames of a video in advance. We employ a simplified version of the technique proposed by Maier et al. [17] for RGBD data and adopt it to work in our scenario, resulting in a modelbased superresolution approach for texture fusion. Instead of averaging as described in Section IIF, the fused colour value of a pixel is computed as the weighted median of all observed colour values with their associated weights , and it is then computed as:
(8) 
At the same time, while remapping the texture from the original frame to the isomap, we use a superresolution scale factor of . Figure 6 shows an example superresolved isomap using this approach, computed offline. The approach does not work in realtime yet and requires the whole video to be available in advance. We plan to extend the approach to work in an incremental manner on livevideo streams.
Iv Conclusion
We presented an approach for realtime 3D face reconstruction from monocular inthewild videos. The algorithm is competitive in landmark tracking and succeeds at reconstructing a shape and textural face representation, fusing different frames and viewangles. In comparison with existing work, the proposed algorithm requires no subjectspecific or manual training, reconstructs texture as well as a semidense shape, and it is evaluated on a true inthewild video database.
Furthermore, the 3D face model and the fitting library are available at https://github.com/patrikhuber/eos. In future work, we plan to adopt the medianbased superresolution approach to work on realtime video streams.
Acknowledgments
This work is in part supported by the Centre for Vision, Speech and Signal Processing of the University of Surrey, UK. Partial support from the BEAT project (European Union’s Seventh Framework Programme, grant agreement 284989) and the EPSRC Programme Grant EP/N007743/1 is gratefully acknowledged.
References
 [1] P. Garrido, L. Valgaert, C. Wu, and C. Theobalt, “Reconstructing detailed dynamic face geometry from monocular video,” ACM Trans. Graph., vol. 32, no. 6, pp. 158:1–158:10, Nov. 2013. [Online]. Available: http://doi.acm.org/10.1145/2508363.2508380
 [2] A. E. Ichim, S. Bouaziz, and M. Pauly, “Dynamic 3D avatar creation from handheld video input,” ACM Trans. Graph., vol. 34, no. 4, pp. 45:1–45:14, Jul. 2015.
 [3] L. Jeni, J. Cohn, and T. Kanade, “Dense 3D face alignment from 2D videos in realtime,” in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE International Conference and Workshops on, vol. 1, May 2015, pp. 1–8.
 [4] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3D shape regression for realtime facial animation,” ACM Trans. Graph., vol. 32, no. 4, pp. 41:1–41:10, Jul. 2013. [Online]. Available: http://doi.acm.org/10.1145/2461912.2462012
 [5] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regression for realtime facial tracking and animation,” ACM Trans. Graph., vol. 33, no. 4, pp. 43:1–43:10, Jul. 2014. [Online]. Available: http://doi.acm.org/10.1145/2601097.2601204
 [6] C. Cao, D. Bradley, K. Zhou, and T. Beeler, “Realtime highfidelity facial performance capture,” ACM Trans. Graph., vol. 34, no. 4, p. 46, 2015. [Online]. Available: http://doi.acm.org/10.1145/2766943
 [7] Z.H. Feng, P. Huber, J. Kittler, W. Christmas, and X.J. Wu, “Random cascadedregression copse for robust facial landmark detection,” IEEE Signal Processing Letters, vol. 22, no. 1, pp. 76–80, Jan 2015. [Online]. Available: http://dx.doi.org/10.1109/LSP.2014.2347011

[8]
R. I. Hartley and A. Zisserman,
Multiple View Geometry in Computer Vision
, 2nd ed. Cambridge University Press, 2004.  [9] O. Aldrian and W. A. P. Smith, “Inverse rendering of faces with a 3D Morphable Model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 5, pp. 1080–1093, 2013.
 [10] P. Huber, G. Hu, R. Tena, P. Mortazavian, W. P. Koppen, W. Christmas, M. Rätsch, and J. Kittler, “A multiresolution 3D Morphable Face Model and fitting framework,” in International Conference on Computer Vision Theory and Applications (VISAPP), 2016. [Online]. Available: http://dx.doi.org/10.5220/0005669500790086
 [11] A. Bas, W. A. P. Smith, T. Bolkart, and S. Wuhrer, “Fitting a 3D morphable model to edges: A comparison between hard and soft correspondences,” CoRR, vol. abs/1602.01125, 2016. [Online]. Available: http://arxiv.org/abs/1602.01125
 [12] R. T. A. van Rootseler, L. J. Spreeuwers, and R. N. J. Veldhuis, “Using 3D Morphable Models for face recognition in video,” in Proceedings of the 33rd WIC Symposium on Information Theory in the Benelux, 2012.
 [13] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces inthewild challenge: Database and results,” Image and Vision Computing, pp. –, 2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0262885616000147
 [14] J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, and M. Pantic, “The first facial landmark tracking inthewild challenge: Benchmark and results,” in 2015 IEEE International Conference on Computer Vision Workshop, ICCV Workshops 2015, Santiago, Chile, December 713, 2015. IEEE, 2015, pp. 1003–1011. [Online]. Available: http://dx.doi.org/10.1109/ICCVW.2015.132

[15]
X. Xiong and F. De la Torre, “Supervised descent method and its applications
to face alignment,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2013, pp. 532–539.  [16] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 2328, 2014. IEEE, 2014, pp. 1867–1874. [Online]. Available: http://dx.doi.org/10.1109/CVPR.2014.241
 [17] R. Maier, J. Stückler, and D. Cremers, “Superresolution keyframe fusion for 3D modeling with highquality textures,” in 2015 International Conference on 3D Vision, 3DV 2015, Lyon, France, October 1922, 2015. IEEE, 2015, pp. 536–544. [Online]. Available: http://dx.doi.org/10.1109/3DV.2015.66
Comments
There are no comments yet.