3D head and 3D hair reconstruction models have been popular topics in computer vision over the past years as their usefulness spans a broad range of applications. Current 3D reconstructions [guo2020towards, tran2018nonlinear, tuan2017regressing, zhu2017face, cao2018stabilized, kim2018deep, Alldieck2018, Beeler2010, Agrawal_2020_WACV] heavily rely on facial landmarks, usually leaving the back of the head unaccounted for. Thus, the resulting head reconstructions tend to be very precise in terms of facial features but include only a statistical approximation of the rest of the head. 3D hair reconstruction [Zhou_2018_ECCV, 3DHair_from_video_Ira], which often also relies on these ‘face-centric’ 3D head fitting models, additionally focuses on rendering a photo-realistic appearance of the hair, and not on representing the actual clinical condition of the person’s hair. For clinical applications, the reconstructed shape has to resemble the complete head shape of the person as close as possible. Furthermore, to map information over the entire scalp, such as hair or skin variations, to track changes over time, we need to recover the alignment of the head for the entire pose range, including top and back views where the face is occluded and traditional facial landmark detectors cannot be applied. This has been achieved with high-end multi-camera setups [alexander2013digital, Cao2018, fyffe_multi-view_2017] or 3D scanners [debevec_facialScans, lee1995realistic]. However, reconstructing accurate head shape, requires setups that are not easily scalable to the clinical practice nor to the general population.
To create 3D models of the head that may be used for future augmented or mixed reality applications, we present a 3D morphable model (3DMM) head fitting framework that only requires a video taken with a smartphone or off-the-shelf digital camera, to recover the surface geometry of the person’s head. 3DMMs offer a consistent 3D mesh topology that will easily allow, in the future, to compare a specific location on the person’s scalp at different time-points as well as between subjects. Solving this problem for any hair volume scenario is very challenging as it requires estimating the shape of the scalp even when it is entirely occluded. The main focus of this work is on reconstructing the 3D geometry of subjects whose scalp outline is either partially visible due to areas of low hair density or contoured closely by hair with minimal volume. Given a video of a person’s head taken with a single moving camera, we fit a 3DMM of the complete head in a two-stage fitting approach using traditional facial landmarks as well as adaptive scalp features. As outlined in Figure 1, to recover the head geometry, we fit a 3DMM of the head via a two-stage fitting algorithm. Both fitting steps are done via an iterative optimization that uses the SfM 3D head reconstruction and camera poses to refine the 3DMM with respect to the camera poses. Stage 1 fits the frontal views using facial landmarks to yield estimated facial landmarks for all frames/camera poses; Stage 2 uses images sampled from the complete pose range, estimated facial landmarks together with the scalp features for all-pose fitting. The output is a 3D head geometry that captures the surface of the complete head and closely contours the scalp region. This model can then be anchored over the entire pose range for future mixed reality applications. Our contributions can be summarized as follows. We propose: i) a two-stage 3D head fitting algorithm that incorporates face and scalp information by including also new adaptive scalp features to encode the shape of the entire head; ii) a solution to enable 3DMM fitting across the complete pose range that combines traditional 3DMM fitting with SfM to get the best of both worlds: accurate camera poses and reconstruction. As shown in the results section, the approach works on a variety of videos taken by different people, with different smartphones and in a variety of static environments from living rooms to outdoor spaces.
2 Related Work
In recent years, there has been significant progress in recovering 3D shapes from 2D images [goel2020shape] to recover faces [blanz1999morphable, SMPL:2015] as well as animals and objects like birds, cars, zebras, airplanes, [badger20203d, goel2020shape, kanazawa2018learning, zuffi2019three]. A common approach is to use 3DMMs to represent the 3D shape as an average shape plus a per-instance predicted deformation. There are several popular 3DMM models for the human head [huber2016multiresolution, bfm09]. However, few models exist for the complete head (face and scalp) ([cao2013facewarehouse, Dai2019, FLAME:SiggraphAsia2017, ploumpis2020towards]). The Liverpool Head Model (LHM) [Dai2019], accounting for head shape information from over 1200 subjects, was among the first to focus on modeling also the accurate shape of the scalp portion of the head. Its successor, the Universal Head Model (UHM) [ploumpis2020towards], also added more facial details via the existing Large Scale Facial Model [Booth2018_IJCV]
. 3DMMs have also been utilized to develop deep learning models for fitting 3D face/head models from individual images[cao2018stabilized, guo2020towards, kim2018deep, tran2018nonlinear, tuan2017regressing, zhu2017face]. However, they are usually limited to pose ranges dictated by state-of-the-art facial landmark detectors. The fitting problem in images where faces cannot be detected has yet to be addressed.
3D hair reconstruction, from strand-accurate geometric reconstruction of the individual hair [Nam2019_hair3D] to overall hair volume reconstruction from single images [Zhou_2018_ECCV] or videos [3DHair_from_video_Ira] also relies on underlying head models and hair masks segmentation. [Zhou_2018_ECCV] present a single-view hair reconstruction by fitting a 3DMM similarly to [hu2017avatar]. To estimate head shape, [3DHair_from_video_Ira] fuse the Basel face model [blanz1999morphable] with the generic full head model [cao2013facewarehouse] to get the complete 3D head. Again, the retrieval of the 3D head is based on facial landmark fitting alone and the back of the head is usually left unaccounted for. Additionally, all of these models tend to focus on outlining the hair rest regardless of scalp boundaries or variations in hair density.
Structure-from-Motion (SfM) [Hartley:2003:MVG:861369] with Multi-View-Stereo (MVS) [furukawa15] aim to reconstruct a 3D model of a scene using a set of images taken from different viewpoints. Meshroom [Meshroom], for example, is a popular 3D photogrammetry-based software that combines SfM and MVS [Jancosek2011, Moulon2012]
to reconstruct a 3D scene and recover estimated camera position for each frame. Another alternative for open source SfM+MVS package is COLMAP[schonberger2016structure, schoenberger2016mvs]. While usually requiring a static environment, they allow for the computation of correspondences between images with high accuracy exploiting camera motion, so they are robust to input variations and can handle arbitrary backgrounds and large pose variations.
Given a video of a person’s head taken with a single moving camera, we want to recover the head geometry (face and scalp without hair volume). As outlined in Figure 1, we start by extracting initial facial landmarks for each applicable input frame. In parallel, we apply an available photogrammetry pipeline, leveraging SfM and MVS, to get generic 3D dense reconstruction of the scene while recovering camera information for every frame. The 3D reconstruction is used to obtain the silhouettes (the projection of the dense reconstruction on each frame), which in turn can be used to extract scalp features for any image. Extracted keypoints, together with and cameras are then inputted into a two-stage fitting algorithm: Stage 1 fits the 3DMM for frontal views using facial landmarks to yield estimated facial landmarks; Stage 2 then uses these sampled images, estimated facial landmarks together with the scalp features and the camera poses, for all-pose fitting.
We extract uniformly sampled frames from each video. We start by detecting facial landmarks and applying a SfM and MVS pipeline to get a 3D dense reconstruction of the person’s head and recover camera information for every frame.
3D generic reconstruction. We use Meshroom [Meshroom] to create a 3D dense reconstruction of the scene and recover perspective camera matrices for each frame. Since we take close-up videos of people, mesh structures of the background will be comprised of large triangles, and they may include several disconnected components. For all videos, the largest connected component always includes the person’s head, neck and torso, so we discard the remaining smaller components. We used mesh filtering to remove larger triangles to obtain a generic 3D reconstruction of the head, neck and upper torso regions of the person. We then extract the corresponding silhouette by projecting the mesh onto each frame .
Initial facial landmarks. To extract facial landmarks, we start by separating out ‘frontal’ frames (covering approximately the frontal half of the head, thus including also profile ones) in which a face can be detected using the Dlib library [dlib09]. We then apply an open source 3D facial landmark detector [bulat2017far] to determine landmarks on the nose, eyes, mouth, and jawlines. This allows us to obtain a small set of images to encode the geometry of the head using facial landmarks.
3.2 Universal Head Model
To recover the head geometry, we fit a 3DMM of the head. Since we are interested in fitting the complete head (face and scalp), we chose the Universal Head Model (UHM) [ploumpis2020towards]. The UHM model is a PCA-based morphable head model where each head , defined as the
position vector of itsvertices, can be represented as the average head shape plus a per instance deformation:
where is the shape parameters vector associated with the principal components of the UHM model with the number of shape parameters.
3.3 Two Stage Fitting Algorithm
Frontal fitting using facial landmarks (Stage 1). Using a frontal frame, we first determine the facial landmark vertices on the SfM mesh and use them to obtain , the aligned UHM mesh with , where is initially set to the mean model for and is the similarity transformation between and obtained via [umeyama1991least] as shown in Figure 2. Once the alignment is known, we can then use the perspective cameras estimated during SfM to obtain the projection of the UHM mesh for any image . Due to the jagged nature of the SfM mesh, however, we first have to ensure that is the optimal rigid transformation in the SfM space with respect to the estimated perspective cameras.
Let be the 3D coordinates of the keypoints of the UHM model () and be the camera projection function for image . To find the optimized similarity matrix for a given , we minimize the error between keypoints on the images and the corresponding image coordinates of the UHM keypoints:
using the Levenberg-Marquardt optimization algorithm [more1978levenberg]. The UHM model updated at every iteration via the Umeyama algorithm provides a good initial solution for the optimization algorithm.
Let be the resulting transformation between the UHM canonical coordinate system and the coordinate system of image where is the camera projection matrix. Note that the image coordinates obtained by are the same as the coordinates of divided by the depth coordinate (z axis) of . We can then solve for the optimal UHM fit, by extending the algorithm described by [hu2017efficient] to multiple images and minimizing the distance between 3D points on the UHM mesh and the back-projected keypoints detected in the images:
is the diagonal matrix of the eigenvalues of the UHM model,is a regularization term to prevent extreme unnatural shapes, and the back-projected keypoints detected in the images. () is the backprojection function that reconstructs a 3D point given pixel coordinates into the camera coordinate system for and the corresponding depth estimated from the depth coordinate of the corresponding landmark of the UHM model. The equation can now be linearized to solve for the optimal parameters :
where is a matrix of the transformed principal components of the UHM model, is the number of keypoints (a subset of V), denotes the corresponding stacked vector of dimension , and is a diagonal matrix with the eigenvalues. We iterate this process to obtain frontal shape parameters .
Figure 3 shows sample output of Stage 1 based only on frontal fitting. While a good staring point, it fails to capture the shape of the scalp for the subject. We thus need additional features to add constraints on the fitted mesh so that it passes through these points and thus close to the scalp. Note that we can now use to estimate landmark positions on all images regardless of the pose. Hence, we repeat the fitting procedure for the complete pose range using the estimates of the landmark positions from and additional scalp features as described next.
All-pose fitting with adaptive scalp features (Stage 2). Let denote the UHM region of the head above the ears as shown in Figure 4. For each image in the sampled images from the complete pose range, we define 3D scalp feature points on the UHM model as the left, right, and top-most point of , obtained by using the pose retrieved and the mesh fitted with the facial landmarks alone. Sample facial landmarks and scalp feature points are visualized for different views in Figure 4, with blue and green dots respectively. Then, we find the corresponding 2D feature points by taking the silhouette obtained by projecting the generic mesh from Section 3.1 onto each image , and selecting the corresponding points lying on the edge of the corresponding upper mask portion . We then rerun the align and fit optimization algorithm described in Equations 2-5 where keypoints now include both scalp features and facial landmarks to obtain the final shape corresponding to shape parameters that accounts for the scalp shape as well. Note that the scalp features detected on the UHM model are iteratively updated according to the alignment from the previous iteration. Finally, unlike [hu2017efficient], we use the perspective cameras from the SfM+MVS pipeline and we use nonlinear optimization to optimize the 6D pose of the UHM model in a multi-image setup rather than to estimate the camera in a single image setup. As shown in Figure 5, the final fit now contours the head closely correcting the meshes obtained from Stage 1 (Figure 3).
Dataset. To test our algorithm, we created an initial dataset of subjects (ages 8-98) spanning multiple ethnicities. We asked the subjects to have a third person take 2 videos of their head at different times and locations with their smartphone devices, setting the resolution to a minimum of 1080p. Instructions for taking the video were given as follows: choose a location with diffuse illumination (such as natural light outdoors or a room with multiple ceiling lights) and, while keeping the head perfectly still, move the phone around the head to cover all sides. Then, by either moving concentrically upward or tracing 90-degree arcs, glide the phone over the top to cover any remaining head surface. For consistency, we recommended phones be kept in portrait mode (although landscape mode was also acceptable) and close to the head to maximize pixel density for the head region. Finally, we asked that anyone with long hair tied it in a low ponytail(s). Because videos were taken by different people at different locations, as can be noticed in the images throughout the paper, the guidelines were only loosely followed, providing us with a range of challenging backgrounds, illuminations, and head poses. All videos were converted to 1080p and saved to image frames for processing.
Implementation details. For pre-processing, we uniformly sample frames per video. Videos for which Meshroom failed to recover a mesh were discarded from further processing. The number of frontal images used for the initial fit varied between depending on the video. We use half for determining and save the rest for evaluation purposes. For UHM fitting, we use all principal components of the original model. We run both stages of the fitting algorithm for iterations with regularization value . To determine the regularization term , we construct a synthetic dataset by sampling random shape vectors () to create 10 synthetic head shapes. Note that, since our method is geometric (we use the keypoints as input), we only need a synthetic mesh and not a photorealistic model. We run our method for and measure the difference between the estimated output shape and actual shape that was initialized.
was chosen empirically as it optimized both cosine similarity ofand , where is the GT shape parameters and is the estimated shape. The frames used for the all-pose fitting are obtained by sampling the images around the head every degrees (setting at the frontal/portrait frame used for the initial alignment) and restricting the absolute value of elevation to lie within degrees. This ensures that only the upper scalp portion is used to extract features (accidentally including the neck area or ponytails). The complete pipeline takes mins for each 1080p video on an Intel Xeon 32-core processor, with 128 GB RAM and 4 2080TI Nvidia GPUs: Meshroom takes hr to process 250 frames. The two-stage fitting algorithm without the Levenberg-Marquardt nonlinear optimization takes 3-5 secs per frame; and the nonlinear optimization takes secs for all the images per iteration.
Evaluations. Figure 6 shows sample qualitative results. For each person, we visualize the output fitted 3D heads mapped back on images from various viewing points (and head poses). closely contours the shape of the head for all subjects. Recall our focus to fit the head (face and scalp), so neck misalignment may occur since neck bending is not modeled by the UHM. We are unaware of a video dataset that go around the complete head of each person (while also keeping the person still). Thus, in the absence of a 3D benchmark, we complement qualitative results with the following quantitative evaluations:
i) 2D anthropometric ground-truth (GT) evaluation. For head measurements, we created a simple anthropometric 2D GT by manually measuring width, length, and height in portrait () and lateral () images to compute head proportions (height over width and length over height, for the portrait and lateral views, respectively) and measure whether the reconstructed shapes adhered to the expected ratios. Measurements for each subject were taken twice for every dimension and averaged to minimize errors before comparing them with the automated ones. For each subject, we recorded the percent difference between GT ratios and the ratios automatically measured for the fits obtained at different steps of our method in Table 1. shows poorer accuracy than , as it fits based on facial landmarks alone while relying on shape priors for the rest of the scalp. significantly improves after Stage 2 using our adaptive scalp features all-pose fitting.
ii) Comparison with state-of-the-art (SOTA) 3D face reconstruction methods on withheld views. We compare our final fits with two SOTA 3DMM fitting models: 3DDFA [3ddfa_cleardusk, guo2020towards, zhu2017face] and Deepface 3D reconstruction [deng2019accurate] on a total of images from the 25 subjects used for training and testing. The goal of this evaluation is to show how the proposed method compares with the SOTA and not to outperform these methods. SOTA methods, while better for the face, cannot be used for the scalp. Table 2 shows the root mean squared (RMS) 2D Euclidean distance (in pixels) between the projected landmarks and the ones detected by the facial detector used in Section 3.1. Both SOTA methods were evaluated on single image fitting (our multi-image one) which generally leads to better reprojection error. RMS for is computed using landmarks projected by on new images sampled from the complete pose ranges, not the original images that were used in the fitting. Additionally, in our videos, though we recommended a neutral expression, people closed their eyes or were smiling. Both Deepface and 3DDFA handle expressions leading to better metrics for the face. We find comparable results for jawline landmarks on their own, since the expressions observed in the videos affect the jawlines to a much lower degree.
iii) 3D comparison of fitted scalp . In Table 3, we compute the Chamfer distance to compare the fitted scalp mesh with the original dense reconstruction for the scalp portion of the head . We exclude the facial portion because the high degree of curvature and edges contained on the faces can lead to extremely jagged meshes. For each vertex belonging to , aligned in the SfM canonical space, we measure the distance between this vertex and its corresponding closest vertex in . We then compute the mean distance between the UHM scalp vertices and ’s vertices and divide it by the width of the head of the UHM model (). We qualitatively show the benefit of our fitting pipeline instead of directly using the dense reconstruction in Figure 7. Illumination conditions, image quality, and slight motion can lead to poor 3D reconstructions. In our experiments, we find that the dense reconstruction from the SfM+MVS pipeline can be unreliable for regions of high curvature such as jawlines or nose, while still providing reliable camera information. Our approach is robust to such scenarios and further justifies use of nonlinear optimization in the two-stage fitting algorithm.
iv) 3D comparison with 3D LiDAR reconstruction. We include 3D GT evaluation on a small subset of our original subjects by capturing the shape of the scalp with a tight head scarf to minimize further any remaining hair volume. Using an Intel RealSense L515 LiDAR camera, we took videos of four subjects, who were physically located near our lab. We generated the complete 3D LiDAR mesh using the Open3D library [Zhou2018]. We then measured the Chamfer distance between our output and the GT LiDAR mesh . As shown in Table 4, we find similar trends with Table 3 and our final meshes are closer to the LiDAR scans than to .
Finally, we show the full potential of our method by evaluating mesh consistency and demonstrating how once a 3D mesh for a person’s head is obtained, it can be used for other videos of the same person, in particular the ones in which the required assumptions to recover the scalp shape no longer need to hold. This applies to both different hairstyles and larger hair volumes. To measure reconstruction consistency between two meshes from different videos of the same person, the average vertex displacement error over all vertices for the entire head and for the separate face and scalp regions yielded , , and errors on the overall head, face, and scalp areas, respectively. In Figure 8, we qualitatively show that once we have obtained the shape of an individual we need not determine the shape again and can simply determine and for using just a few frontal images to obtain accurate mapping for all poses.
5 Conclusion and Future Work
We present a new framework to recover a person’s head geometry from a video taken with a single handheld camera. Results on subjects show that the method works on a variety of videos. In all experiments, results quantitatively indicate that scalp features included in the two-stage algorithm lead to better fitting than the 3DMM prior alone. The current approach suffers from a few limitations. First, the implementation is computationally expensive as each video can take up to 2 hours using high-end hardware. Second, requiring the head to remain still while moving the camera does not allow a person to take the video on their own. Third, it does not include ear keypoints since we did not find a reliable automated method to detect features for the ear. Finally, it relies on the 3D generic reconstruction as well as facial landmark detector as initial inputs. If one or both happen to fail, then the video has to be retaken.
The proposed head fitting method yields a high degree of consistency between fitted heads from different videos of the same person. Because the alignment with the 3D surface mesh can be done starting with any 3D head model, the pipeline presented can also be used to align subsequent videos by using a 3D head fitted from a different video (of the same person). This is particularly useful for certain clinical applications such as mapping hair loss information since specific types of hair loss require having the hair parted in different ways to expose the hair loss. Finally, as a first step towards creating an end-to-end system to solve the all-pose head fitting problem, this approach can be used to generate training data for deep learning models to enable fitting morphable models across all pose ranges. The fitted 3D heads will be used, in future work, for mapping texture information onto the subject’s head surface for tracking clinically valuable information. The consistent mesh topology inherent in the UHM model will simplify tracking as it implicitly provides point-to-point correspondences for the entire head surface, setting the path for many clinical applications.
Acknowledgements. This research was in part funded by a research grant by the Edwin and Fannie Gray Hall Center for Human Appearance, at the University of Pennsylvania, awarded to Drs. Bernardis and Daniilidis.