Single-Image 3D Face Reconstruction under Perspective Projection

05/09/2022
by   Yueying Kao, et al.
49

In 3D face reconstruction, orthogonal projection has been widely employed to substitute perspective projection to simplify the fitting process. This approximation performs well when the distance between camera and face is far enough. However, in some scenarios that the face is very close to camera or moving along the camera axis, the methods suffer from the inaccurate reconstruction and unstable temporal fitting due to the distortion under the perspective projection. In this paper, we aim to address the problem of single-image 3D face reconstruction under perspective projection. Specifically, a deep neural network, Perspective Network (PerspNet), is proposed to simultaneously reconstruct 3D face shape in canonical space and learn the correspondence between 2D pixels and 3D points, by which the 6DoF (6 Degrees of Freedom) face pose can be estimated to represent perspective projection. Besides, we contribute a large ARKitFace dataset to enable the training and evaluation of 3D face reconstruction solutions under the scenarios of perspective projection, which has 902,724 2D facial images with ground-truth 3D face mesh and annotated 6DoF pose parameters. Experimental results show that our approach outperforms current state-of-the-art methods by a significant margin.

READ FULL TEXT VIEW PDF

page 13

page 14

page 19

page 20

page 21

10/13/2021

A Literature Review of 3D Face Reconstruction From a Single Image

This paper is a brief survey of the recent literature on 3D face reconst...
04/13/2022

Towards Metrical Reconstruction of Human Faces

Face reconstruction and tracking is a building block of numerous applica...
08/22/2017

What does 2D geometric information really tell us about 3D face shape?

A face image contains geometric cues in the form of configurational info...
11/07/2020

MaskBot: Real-time Robotic Projection Mapping with Head Motion Tracking

The projection mapping systems on the human face is limited by the laten...
05/18/2019

Learning Perspective Undistortion of Portraits

Near-range portrait photographs often contain perspective distortion art...
11/27/2020

PCLs: Geometry-aware Neural Reconstruction of 3D Pose with Perspective Crop Layers

Local processing is an essential feature of CNNs and other neural networ...
12/26/2019

Autonomous Removal of Perspective Distortion for Robotic Elevator Button Recognition

Elevator button recognition is considered an indispensable function for ...

1 Introduction

Figure 1: Orthogonal projection vs. Perspective projection. (a) and (b) show rendered images with a same 3D face by changing the pose parameters that represent the size in the two projections, respectively. The rendered 2D faces are only zoomed in and out with scale variation in orthogonal projection, while in perspective projection there exists obvious distortion by changing , especially in very near distance. (c) shows 3D face reconstruction under the two projections respectively. Orthogonal projection based methods explain the distortion by the shape changes, while perspective projection methods provide the same shape and explain it by different pose parameters.

3D face reconstruction [29, 13, 3, 12, 42]

has draw much attention recently in computer vision and computer graphics communities, due to the increasing demand from many applications, such as virtual glasses try-on and make-up in AR, video editing and animation. Most of 3D face reconstruction methods 

[41, 13, 3, 12, 16] employ the orthogonal projection [9, 15] to approximate the real-world perspective projection, which works well when the size of face is small compared to the distance from the camera (roughly 1/20 of the camera distance). However, the scenarios of face capture become more complicated with the popular of selfies, virtual glasses try-on and makeup, etc. When the subject is very close to the camera, the rendered 2D faces in orthogonal projection are only zoomed in, while in perspective projection there exists obvious distortion in the rendered 2D faces, especially in the very close distance, as shown in Fig. 1(a) and (b). Under the approximation of orthogonal projection, the distortion by perspective projection is explained by the shape changes, leading to two significant problems: 1) This distortion is not modeled in the shape models, and the distorted faces are often outside of the shape space, which leads to the unstable temporal fitting. 2) When the subject moves along the camera axis , the orthogonal projection based methods predict different face shapes in different distances due to the distortion, while perspective projection methods provide the consistent shape across frames since it explains the distortions with different 6DoF pose, as shown in Fig. 1(c). Even introducing 6DoF pose apparently improves the accuracy and robustness of 3D face reconstruction, which benefits many VR/AR applications, 6DoF pose estimation for faces is still a challenging problem. Since we should capture the distortion by perspective projection from the large variations of face appearances under complicated environment. Besides, 6DoF pose estimation in other objects [26, 18, 33] always assumes a pre-defined 3D shape but we only have a face image as input.

To address the problem, we propose a new approach to recover 3D facial geometry under perspective projection by estimating the 6DoF pose, i.e. 3D orientation and translation, simultaneously from a single RGB image (see Section 3). Specifically, 6DoF estimation depends on 3 sub-tasks: 1) Reconstructing 3D face shape in canonical space, 2) Estimating pixel-wise 2D-3D correspondence between the canonical 3D space and image space and 3) Calculating 6DoF pose parameters by the Perspective-n-Point (PnP) algorithm [23]

. In this paper, we propose a deep learning based Perspective Network (PerspNet) to achieve the goal in one propagation. For the 3D face shape reconstruction, an encoder-decoder architecture is designed to regress a UV position map 

[13] from image pixels, which records the specific position information of a 3D face in canonical space. For the 2D-3D correspondence, we construct a sophisticated point description features including aligned-pixel features, 3D features and 2D position features, and map it to a 2D-3D correspondence matrix where each row records the corresponding 3D points in canonical space of one pixel on the input image. With 3D shape and correspondence matrix, 6DoF face pose can be robustly estimated by performing PnP.

To realize 6DoF pose estimation of 3D faces, we need the face images, 3D shapes and annotated camera parameters for each training sample. However, current datasets cannot satisfy our requirement. Most of existing data provide the low-quality 3D face by the optimization-based fitting algorithm in a weak supervised manner, such as AFLW2000-3D [41]. Some datasets, such as BIWI [11], NoW [31], FaceScape [37], lack either exact 3D shape for each 2D image or pose variations. Therefore, we construct a real large 3D face dataset called ARKitFace dataset, with 902,724 2D facial images of 500 subjects taken in diverse of expressions, ages and poses. For each 2D facial image, ground-truth 3D face mesh and 6DoF pose annotations are provided, as shown in Fig. 2. Compared with other 3D face datasets [41, 31, 11, 37], the ARKitFace is a very large amount dataset for single-image 3D face reconstruction under perspective projection.

Figure 2: Some examples of ARKitFace dataset. Each sample contains (a) 2D image, (b) 3D mesh and (c) 6DoF pose annotation. (d) shows more examples.

To summarize, the main contributions of this work are:

  • We explore a new problem of 3D face reconstruction under perspective projection, which will provide the exact 3D face shape and location in 3D camera space.

  • We propose PerspNet to reconstruct 3D face shape and 2D-3D correspondence simultaneously, by which the 6DoF face pose can be estimated by PnP.

  • To enable the training and evaluation of PerspNet, we collect ARKitFace dataset, a large-scale 3D dataset with ground-truth 3D face mesh and 6DoF pose annotations for each image.

  • Experimental results on ARKitFace dataset and a public BIWI dataset show that our approach outperforms the state-of-the-art methods. The code and all data will be released to public under the authorization of all the subjects***code and data will be at www.to-be-released.com..

2 Related Work

2.1 3D Face Reconstruction

3D face reconstruction from a single RGB image is essentially an ill-posed problem. Most methods tackle this problem by estimating the parameters of a statistical face model [16, 3, 41, 13, 12, 29, 42]. Although these methods achieve remarkable results, they suffer from the same fundamental limitation: an orthogonal or weak perspective camera model is utilized when reconstructing the 3D face shape. The deformation caused by perspective projection, especially in the near distance, has to be compensated by face shape. Thus, we insist on following the rule of perspective projection and believe there is still substantial room for improvement on the task of 3D face reconstruction.

2.2 Head/Face Pose Estimation

Due to the prevalence of deep learning, great progresses have been achieved in head pose estimation. The main idea is to regress the euler angles of head pose directly based on deep CNNs. QuatNet [20] addresses the non-stationary property of head pose and proposed to train a quaternions regressor to avoid the ambiguity problem in euler angles. FSA-Net [38] adopts a fine-grained structure mapping for spatially feature grouping to improve the performance of head pose estimation. EVA-GCN [36] views the head pose estimation as a graph regression problem, and leverages the Graph Convolutional Networks to model the complex nonlinear mappings between the graph typologies and the head pose angles. All of the above methods are proposed under the assumption of orthogonal projection, thus only 3DoF (3D rotation) is predicted and the error caused by perspective deformation can not be avoided.

Laterly, Chang et al.  [8] regress directly the 6DoF parameters of human faces under perspective projection from a face photo. Albiero et al. [1] propose a Faster-RCNN [28] like framework named img2pose to predict the full 6DoF of human faces in the setting of perspective projection. Firstly, the original full image is fed into the RPN module, and all faces in the image are detected. Then the local 6DoF of each detected faces is predicted by the ROI head. Finally the local 6DoF is converted to global 6DoF using the information of facial bounding box and camera intrinsic matrix. Since img2pose ignores the influence of perspective deformation, the error of 6DoF will be amplified during local-to-global conversion. In addition, the 6DoF pose annotations in their training data are calculated with predicted five landmarks and a mean 3D mesh, not real 6DoF pose.

2.3 3D Face Datasets

Although a large number of facial images are available, the corresponding 3D annotations are expensive and difficult to obtain. For 3D face shape reconstruction, several 3D datasets are built, such as Bosphorus [32], BFM [25], FaceWarehouse [6], MICC [2], 3DFAW [27], BP4D [39], NoW dataset [31]. In these datasets, there are either limited data or no face pose annotation.

To obtain head/face 6DoF pose, some datasets synthesize 3D ground-truth, i.e. the parameters of a statistical face model, by the optimization-based fitting algorithm in a weak supervised manner, such as 300W-LP and AFLW2000-3D [41]. The synthesized 3D ground-truth is coarse because only the reconstruction error of 2D sparse facial landmarks is considered. Despite the expensiveness of 3D annotations, researchers have collected several 3D face datasets using professional imaging devices for the sake of high precision on the tasks of 3D face reconstruction and head pose estimation. For example, BIWI dataset [11] is captured by a Kinect sensor with global rotation and translation to the RGB camera. However, the number of individuals is limited, and facial images with only neutral expression are recorded. Most importantly, the size of BIWI dataset is too small for learning deep networks. FaceScape [37] is a large-scale dataset recorded by multi-view camera system in an extremely constrained environment. It contains sufficient individuals and multiple facial expressions. However, the head pose is fixed by limited number of camera locations and the lighting condition is constant. Models that are trained on such dataset do not have good generalization ability to the real scenario in the wild. Different from these datasets, we aim to collect a 3D dataset with 3D mesh and 6DoF pose estimation in different conditions, such as expression, age and 6D pose variations.

3 Proposed Method

In this paper, we propose a novel framework for 3D face reconstruction under perspective projection from a single 2D face image. In previous single-image 3D face reconstruction method [41, 13, 3, 12, 16], scaled orthographic projection camera model is adopted to project 3D face shape into image space. We denote the 3D face shape as representing 3D vertices (points) on the surface of the 3D shape in world coordinate system (canonical space). This projection process is usually formulated as

(1)

where denotes projected 2D coordinates in 2D image of , is the orthographic 3D-2D projection matrix, is isotropic scale and denotes 2D translation. Different from orthogonal projection, in perspective projection, 3D face shape is firstly transformed from the world coordinate system to the camera coordinate system by using 6DoF face pose ,

(2)

with known intrinsic camera parameters , where , and represent the 3D rotation , the 3D translation , and 3D face vertices in camera coordinate system. Then the is projected to image space by , where represents the distance from each vertex to camera. This illustrates that the distance mainly leads to the difference of the two projections.

Figure 3: The framework of our proposed method based on PerspNet. 3D features and 2D image features are extracted from encoder-decoder architectures respectively from a 2D facial image. The 3D features are fed into 3D face shape reconstruction module to predict 3D face shape information in world coordinate system. The 2D facial image features, 2D position encoding features and 3D features are fused learn the correspondence between 2D pixels and 3D points in reconstructed 3D face shape. With the corresponding 2D pixels and 3D points in a face, the 6DoF pose of the face can be computed by a PnP algorithm. In addition, 2D image features are also fed into 2D face segmentation, which is used to extract 2D pixels in face regions when testing.

In this work, we focus on 3D face reconstruction under perspective projection, especially in very near distance. Given an RGB image , the goal is to find a method to recover the 3D face shape and estimate its 6DoF pose : . It is achieved by a framework based on a new deep learning network, PerspNet, as shown in Fig. 3. The proposed method contains two sub-tasks: 3D face shape reconstruction and 6DoF face pose estimation. For the 3D face shape reconstruction, we design a UV position map [13] regression method. For 6DoF face pose estimation, we use a two-stage pipeline [26, 18, 33], that first chooses 2D pixels in 2D image and learns their corresponding 3D points in reconstructed 3D face shape and 6D face pose parameters can be calculated by a PnP [23] algorithm.

3.1 Perspective Network (PerspNet)

Specifically, PerspNet consists of four modules: feature extraction, 3D face shape

reconstruction, 2D-3D correspondence learning and 2D face segmentation . Given a 2D facial image , PerspNet predicts . With the input, an encoder and two decoders are trained to extract 3D features and 2D image features respectively. The 3D features are fed into 3D face shape reconstruction module to regress the UV position map, which represents the 3D face shape information in canonical space. Then the 2D features includes 2D facial image features and encoded 2D position features , and then fuse with 3D features to learn the correspondence between 2D pixels in 2D image and 3D points in 3D face shape. With the correspondence, the 6DoF pose of the face can be computed. In addition, 2D image features are also fed into 2D face segmentation module, which is used to extract 2D observed pixels in face regions during testing.

3D Face Shape Reconstruction. Different from the UV formulation in [13] which is defined in image coordinate system, our UV position map records the 3D coordinates of 3D facial structure with the canonical pose, which only represents the facial shape. Specifically,

(3)

where S is the rendered UV position map, represents UV coordinates recording the 2D locations of 3D vertices in UV map, and is denoted as triangles in a 3D face mesh. A fully convolutional encoder-decoder architecture is utilized to regress the UV position map, as shown in Fig. 3

. To supervise the 3D face shape prediction, a weighted L1 loss function

is used to measure the difference between ground-truth position map and the network output ,

(4)

where is a weight matrix for , and we set , others in to 0. Then we extract the 3D vertices of the face from UV position map using UV coordinates.

2D-3D Correspondence Learning. To estimate the 6DoF pose of the face in 2D images, we use a two-stage pipeline that first learns the correspondence of 2D pixels in 2D image and 3D points in reconstructed 3D face shape, and then compute 6D face pose parameters using a PnP algorithm. We design a 2D-3D correspondence learning module in PerspNet for the first stage, as shown in Fig. 3

. We build a correspondence probability matrix

where each row records the corresponding 3D points in canonical space of one pixel on the input image. Here is the number of face region pixels selected from a 2D image and is the number of vertices in 3D face shape. The estimating of M is as:

(5)

To learn the correspondence between 2D and 3D points, we extract 2D features and 3D features respectively.

For 2D features , we firstly extract the image features from another fully convolutional decoder architecture after the encoder network from the same facial image. For each pixel in , we also extract 2D position features in 2D images and fuse as its local feature . The position features are encoded by a 2D position encoding method, which is an extension of 1D position encoding from [34]. 2D global features are also learned by feeding 2D local features

into Multi-Layer Perceptron (MLP) layers, followed with a global average pooling layer. Then the 2D local

and global features are fused as 2D features .

As for the 3D features , since the UV position map contains 3D geometry information, we also extract 3D global features from the UV position regression network followed by MLP layers and a global average pooling layer. Then the 2D features and 3D features

are fused and fed into MLP layers and a softmax layer. In this way, the correspondence matrix

is predicted from the network.

To achieve the ground-truth , for each pixel in the 2D face region, we compute its barycentric coordinates [35] based on three vertices of its located triangle. The barycentric coordinates can be taken as corresponding probability between this 2D pixel and the three vertices in the 3D face mesh. Other values in are set to 0. Each row of the matrix represents the distribution over the correspondences between -th pixels in and the vertices . We utilize a Kullback-Leibler (KL) divergence loss for the correspondence matrix . In addition, since the matrix is very sparse, we also minimize its entropy to regularize the matrix. The final loss for is

(6)

Here and are ground-truth and predicted correspondence matrix, and is a constant weight. With the predicted matrix and reconstructed 3D face , the corresponding 3D points for each 2D pixels are obtained by matrix multiplication, L1 loss is also applied to supervise the :

(7)

where ground-truth is achieved by perspective projection.

2D Face Segmentation. 2D face segmentation module in the proposed network is trained for selecting 2D pixels from face regions in the inference phase. When training the whole network, the image pixels are randomly chosen from ground-truth face segmentation mask. While in the testing phase, the image pixels are randomly chosen from predicted face mask. 2D face segmentation task follows the 2D image feature extraction network and a 2-class softmax loss is used.

Training Objective. We train out whole network with a multi-task loss. The final loss function is

(8)

where and are the weights of the four losses respectively. Experimental results reveal that jointly training these tasks boosts the performance of each other.

3.2 6DoF Face Pose Estimation

Based on the output of the network, the final 6DoF pose estimation can be computed. Given the chosen 2D pixel coordinates in the original full 2D image, their corresponding 3D points coordinates from reconstructed faces in world coordinate and the camera intrinsic parameters , we apply a PnP algorithm [23] with Random Sample Consensus (RANSAC) [14] to compute the 6D face pose parameters, . Perspective-n-Point is the problem of estimating the pose of a calibrated camera given a set of 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6DoF which are made up of the rotation (roll, pitch, and yaw) and 3D translation of the camera with respect to the world. With the estimated 6DoF face pose, the reconstructed 3D face shapes can be projected to 2D images. It is worth noting that, directly regressing 6DoF pose parameters from a single image by CNN is also feasible, but it achieves much worse performance than our method due to the nonlinearity of the rotation space [26]. It will be validated in our experiments.

Dataset Sub. Num Image Num 3D Mesh Num Exp. Num camera Vert. Num 6DoF Pose
Bosphorus[32] 105 4,666 4,666 35 Mega 35K No
MICC[2] 53 53 203 5 3DMD 40K No
3DFAW[27] 26 26 26 Neutral DI3D 20K No
BP4D[39] 41 328 328 8 3DMD 70k No
AFLW2000-3D[41] 2,000 2,000 3DMM - - 53,149 No
BIWI[11] 20 15,678 24 Neutral Kinect 6,918 Yes
NoW[31] 100 2,054 100 Neutral iPhone X 58,668 No
FaceScape[37] 938 1,275,680 18,760 20 DSLR 2M fixed pose
ARKitFace 500 902,724 902,724 33 iPhone 11 1,220 Yes
Table 1: Comparing ARKitFace with other 3D Face datasets. Exp. and Vert. are abbreviations of the annotation number of categories of expressions and number of vertices for 3D mesh, respectively.

4 ARKitFace Dataset

The ARKitFace dataset is established by this work in order to train and evaluate both 3D face shape and 6DoF in the setting of perspective projection. A total of 500 volunteers, aged 9 to 60, are invited to record the dataset. They sit in a random environment, and the 3D acquisition equipment is fixed in front of them, with a distance ranging from about 0.3m to 0.9m. Each subject is asked to perform 33 specific expressions with two head movements (from looking left to looking right / from looking up to looking down). 3D acquisition equipment we used is an iPhone 11. The shape and location of human face are tracked by structured light sensor. The triangle mesh and 6DoF information of the RGB images are obtained by built-in ARKit toolbox. The triangle mesh is made up of 1,220 vertices and 2,304 triangles. In total, 902,724 2D facial images (resolution or ) with ground-truth 3D mesh and 6DoF pose annotation are collected. An example is shown in Fig. 2. Distributions of age, gender, and each pose parameter of 6DoF on ARKitFace dataset are shown in Fig. 4. We can observe that our dataset has balanced gender, diverse age and the 6DoF pose varation. Comparisons between different datasets shown in Table 1 reveal that ARKitFace surpasses the existing datasets in terms of scale, 3D exact shape annotations and diversity of poses.

Authorization: All the 500 subjects consent to use their data. We will release all the subjects with 2D facial images, 3D mesh and 6DoF pose annotation under the authorization of all the subjects. We will not release their personal privacy information, including age, gender etc.

Figure 4: Distributions of age, gender, and each pose parameter of 6DoF on ARKitFace dataset.

5 Experiments

5.1 Implementation Details

Our PerspNet is implemented by PyTorch 

[24]. During training, the PerspNet takes the input image cropped from a full 2D image and resized to , based on the ground-truth face segmentation mask. To augment the data with large poses, we utilize face profiling method [40] to generate the profile view of faces from medium-pose samples for three euler angles. We enlarge all the three angles to max and min -. We also apply online data augmentation including random cropping, resizing and color jittering during training. We use a pre-trained ResNet-18 [17] architecture, where the final encoded feature map is . To regress the UV position map , the first decoder is implemented by 5 up-sampling layers and its output is a feature map, which is regarded as the 3D features. The number of point clouds in 3D face shape is 1220. To extract 2D image features and segment 2D face region from background, the second decoder consists of five up-sampling layers and each up-sampled feature map is concatenated with a feature map which have the same size in the encoder backbone network. The size of 2D image features is also . The number of randomly sampled 2D pixels for face region is 1024. If point count in face region is insufficient, we sample these pixels by repetition. The 2D and 3D global feature sizes are all At the training phase, the 2D pixels are randomly chosen from ground-truth face mask. We set the weights of losses respectively,

. The initial learning rate is set as 0.0001 and is updated linearly after 10 epochs. We train all models for 20 epochs. All the networks are trained and evaluated on the ARKitFace dataset with ground-truth bounding box. At the testing phase, the 2D pixels are randomly chosen from segmented face region. The PnP algorithm is implemented in OpenCV 

[4].

5.2 Dataset

To validate our proposed method, we conducted experiments on our collected ARKitFace dataset and a public BIWI dataset.

ARKitFace. We randomly use 400 people in our dataset as training data with a total of 717,840 2D facial images and annotations, leaving 100 people with totally 184,884 samples for testing.

BIWI. BIWI [11] contains 24 videos of 20 subjects in an indoor environment. Each video is coupled with a neutral 3D face mesh of a specific person. There are totally 15, 678 frames with a wide range of face poses in this dataset. Paired RGB and Depth images are provided for each frame. We only use the RGB images as inputs in this work. This benchmark provides ground-truth labels for rotation (rotation matrix) and translation for full 6DoF. Since there is not each 3D face mesh for each frame, we can not evaluate our method on 3D face reconstruction task. We only evaluate our method for 6DoF pose estimation task. In addition, we train our method on training set of ARKitFace dataset, and test our method on the entire BIWI dataset following the previous methods [20, 38, 1].

5.3 Evaluation Metric

For the 3D face shape reconstruction, we follow previous works [12, 31], median distance and average distance between predicted 3D mesh vertices and ground-truth 3D mesh vertices are utilized. For the 6DoF face pose estimation, we follow previous head/face pose estimation methods [1, 38, 20], and convert the rotation matrixes to 3 euler angles,Yaw, Pitch, Roll and compute the mean absolute error (MAE) for 6DoF, . and , the rotational and translational MAE are also computed. Furthermore, to validate the 6DoF face pose in a metric, we adopt the average 3D distance (ADD) metric [19] used for object pose evaluation. Given the ground-truth rotation and translation and the estimated rotation and translation , the ADD computes the mean of the pairwise distances between the ground-truth 3D face model points transformed based on the ground-truth pose and the estimated pose: .

Method Yaw Pitch Roll ADD
img2pose(retrain) [1] 5.07 7.32 4.25 5.55 1.39 3.72 15.95 7.02 20.54
Direct 6DoF Regress 1.86 2.72 1.03 1.87 2.80 5.23 19.16 9.06 21.39
PerspNet w/o PE 1.01 1.53 0.61 1.05 1.17 2.39 11.77 5.11 12.34
PerspNet w/o 1.04 1.45 0.60 1.03 1.09 2.13 10.28 4.50 10.89
PerspNet (ours) 0.99 1.43 0.55 0.99 0.97 2.12 9.45 4.18 10.01
Table 2: Comparisons with different methods for 6DoF Face Pose Estimation on ARKitFace test dataset.
Method Yaw Pitch Roll ADD
Dlib (68 points)[22] 16.76 13.80 6.19 12.25 - - - - -
3DDFA[40] 36.18 12.25 8.78 19.07 - - - - -
FAN (12 points)[5] 8.53 7.48 7.63 7.88 - - - - -
Hopenet ()[30] 4.81 6.61 3.27 4.90 - - - - -
QuatNet[20] 4.01 5.49 2.94 4.15 - - - - -
FSA-NET[38] 4.56 5.21 3.07 4.28 - - - - -
HPE[21] 4.57 5.18 3.12 4.29 - - - - -
TriNet[7] 3.05 4.76 4.11 3.97 - - - - -
RetinaFace R-50(5 points)[10] 4.07 6.42 2.97 4.49 - - - - -
img2pose[1] 4.57 3.55 3.24 3.79 - - - - -
Direct 6DoF Regress 16.49 14.03 5.81 12.11 62.36 85.01 366.52 171.30 562.38
PerspNet w/o PE 3.63 3.81 3.48 3.64 6.03 9.11 77.87 31.00 142.20
PerspNet w/o 3.67 3.52 3.26 3.48 5.57 8.53 75.23 29.78 136.16
PerspNet (ours) 3.10 3.37 2.38 2.95 4.15 6.43 46.69 19.09 100.09
Table 3: Comparisons with different methods for 6DoF Face Pose Estimation on BIWI dataset.

5.4 Evaluation for 6DoF Face Pose Estimation

Comparison with the state-of-the-art methods. To compare with the state-of-the-art methods on head or face pose estimation, we firstly retrain the most recent state-of-the-art method, img2pose [1], on ARKitFace training set, and compute the performance on testing data. As shown in Table 2, our method outperforms it significantly. In addition, the public BIWI dataset is used as a cross-data evaluation to test our final network. Since some faces with large angles can not be detected, we follow the img2pose [1] method and test on 13,219 images with detected bbox. We use the code of [5] to detect the face 68 landmarks and crop facial region. All the results are shown in Table 3. We can see that our method performs much better than previous methods, especially the recent img2pose method [1]. The experiment also demonstrates that our method and dataset can be well generalized to the data in different domain. The inference time of our proposed model is 11.7ms with a P100 GPU.

Ablation Studies. We build several baselines to evaluate the components that contribute to our performance. Since img2pose [1] directly regresses the 6 pose parameters, we build a baseline, Direct 6DoF Regress, to regress the 6 pose parameters after the backbone network. Other baselines include our PerspNet without the position encoding features (PerspNet w/o PE), which is used to validate the effectiveness of the 2D position encoding features, and our PerspNet without loss (PerspNet w/o ), which is built to evaluate the corresponding components. The results are shown in Table 2 and Table 3. We can observe that our two-stage method outperforms direct regression method significantly, and 2D position encoding features are helpful and is effective for face pose estimation task. Moreover, to explain the influence of segmentation mask, the GT face mask is used during the inference time. Its results on are 0.72, 1.10, 0.54, 0.79, 0.92, 1.47, 9.59, 3.99, 9.99 respectively. It shows that the method with the ground-truth face mask performs better than that with predicted segmentation mask, which indicates that more accurate segmentation results are needed.

Figure 5: Qualitative results for 6DoF pose estimation and 3D face shape reconstruction on ARKitFace dataset.
Method Median(mm) Mean(mm)
PRNet [13] 1.97 2.05
3DDFA_v2 [16] 2.35 2.31
PerspNet (ours) 1.72 1.76
Table 4: Results for 3D face shape reconstruction on ARKitFace.

5.5 Evaluation on 3D Face Shape Reconstruction

To validate the proposed method on 3D face shape reconstruction task, we display the results on ARKitFace test data in Table 4. For comparison, we also train a single-task PRNet [13] with the same encoder-decoder UV regression network in our multi-task network on ARKitFace training data. As shown in Table 4, our multi-task network outperforms the single-task PRNet, which reveals that the pose estimation task contributes to the improvement of 3D face shape reconstruction task. In addition, we compare our method with other SOTA like 3DDFA_v2 [16] in Table 4. We can see that our method still achieves the best performance of 3D face reconstruction.

5.6 Qualitative Results

We display qualitative results for 6DoF pose estimation and 3D reconstruction on ARKitFace and BIWI dataset in Fig. 5 and Fig. 6

(a), where img2pose, our predicted results and GT results participate in the comparison. The predicted face pose with GT 3D mesh, the predicted face pose with predicted 3D mesh and their error map are demonstrated, respectively. We can see that our results are effective and outperforms img2pose, especially in large pose. We also show some qualitative results on in-the-wild images from WIDER FACE dataset in Fig. 

6 (b). It shows that our method also performs well in in-the-wild images.

Figure 6: Qualitative results for 6DoF pose estimation and 3D face shape reconstruction on (a) BIWI dataset and (b) WIDER FACE validation images.

6 Conclusion

We explore 3D face reconstruction under perspective projection from a single RGB image for 3D face AR applications. We introduce a novel framework, in which a deep learning network, PerspNet, is proposed, for 3D face shape reconstruction, corresponding learning between 2D pixels and 3D points in 3D face models, and 2D face region segmentation. With 2D pixels in facial images and corresponding 3D points in reconstructed 3D face mesh, 6DoF face pose is estimated by a PnP method. This 6DoF face pose is used for perspective projection transformation. To enable our PerspNet, we build a large-scale 3D face dataset, ARKitFace dataset, annotating 2D facial images, 3D face mesh and 6DoF pose. Experiments demonstrate the effectiveness of our approach for 3D face shape reconstruction and 6DoF pose estimation. As most of the face analysis methods, our method and data may raise privacy concerns when misused. Therefore, the release of the data is fully authorized by the subjects. We wish this work would spur the future researches including 3D face reconstruction and face pose estimation.

References

  • [1] V. Albiero, X. Chen, X. Yin, G. Pang, and T. Hassner (2021) Img2pose: face alignment and detection via 6dof, face pose estimation. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 7617–7627. Cited by: §0.A.1, §2.2, §5.2, §5.3, §5.4, §5.4, Table 2, Table 3.
  • [2] A. D. Bagdanov, A. Del Bimbo, and I. Masi (2011) The florence 2d/3d hybrid face dataset. In Proceedings of the 2011 joint ACM workshop on Human Gesture and Behavior Understanding, pp. 79–80. Cited by: §2.3, Table 1.
  • [3] Z. Bai, Z. Cui, X. Liu, and P. Tan (2021) Riggable 3d face reconstruction via in-network optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6216–6225. Cited by: §1, §2.1, §3.
  • [4] G. Bradski and A. Kaehler (2000) OpenCV. Dr. Dobb’s journal of software tools 3. Cited by: §5.1.
  • [5] A. Bulat and G. Tzimiropoulos (2017) How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proceedings of the International Conference on Computer Vision, Cited by: §5.4, Table 3.
  • [6] C. Cao, Y. Weng, S. Zhou, Y. Tong, and K. Zhou (2013) Facewarehouse: a 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20 (3), pp. 413–425. Cited by: §2.3.
  • [7] Z. Cao, Z. Chu, D. Liu, and Y. Chen (2021)

    A vector-based representation to enhance head pose estimation

    .
    In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1188–1197. Cited by: Table 3.
  • [8] F. Chang, A. T. Tran, T. Hassner, I. Masi, R. Nevatia, and G. Medioni (2019) Deep, landmark-free fame: face alignment, modeling, and expression estimation. International Journal of Computer Vision 127 (6), pp. 930–956. Cited by: §2.2.
  • [9] C. V. CS (2021) The geometry of perspective projection. https://www.cse.unr.edu/~bebis/CS791E/Notes/PerspectiveProjection.pdf. Cited by: §1.
  • [10] J. Deng, J. Guo, E. Ververas, I. Kotsia, and S. Zafeiriou (2020) Retinaface: single-shot multi-level face localisation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5203–5212. Cited by: Table 3.
  • [11] G. Fanelli, M. Dantone, J. Gall, A. Fossati, and L. Van Gool (2013) Random forests for real time 3d face analysis. International Journal of Computer Vision 101 (3), pp. 437–458. Cited by: §1, §2.3, Table 1, §5.2.
  • [12] Y. Feng, H. Feng, M. J. Black, and T. Bolkart (2021) Learning an animatable detailed 3d face model from in-the-wild images. ACM Transactions on Graphics 40 (4), pp. 1–13. Cited by: §1, §2.1, §3, §5.3.
  • [13] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European Conference on Computer Vision, pp. 534–551. Cited by: §1, §1, §2.1, §3.1, §3, §3, §5.5, Table 4.
  • [14] M. A. Fischler and R. C. Bolles (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §3.2.
  • [15] D. Forsyth and J. Ponce (2011) Computer vision: a modern approach.. Prentice hall. Cited by: §1.
  • [16] J. Guo, X. Zhu, Y. Yang, F. Yang, Z. Lei, and S. Z. Li (2020) Towards fast, accurate and stable 3d dense face alignment. In Proceedings of the European Conference Computer Vision, pp. 152–168. Cited by: §1, §2.1, §3, §5.5, Table 4.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §5.1.
  • [18] Y. He, W. Sun, H. Huang, J. Liu, H. Fan, and J. Sun (2020) Pvn3d: a deep point-wise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 11632–11641. Cited by: §1, §3.
  • [19] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab (2012) Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Proceedings of the Asian Conference on Computer Vision, pp. 548–562. Cited by: §5.3.
  • [20] H. Hsu, T. Wu, S. Wan, W. H. Wong, and C. Lee (2018) Quatnet: quaternion-based head pose estimation with multiregression loss. IEEE Transactions on Multimedia 21 (4), pp. 1035–1046. Cited by: §2.2, §5.2, §5.3, Table 3.
  • [21] B. Huang, R. Chen, W. Xu, and Q. Zhou (2020) Improving head pose estimation using two-stage ensembles with top-k regression. Image and Vision Computing 93, pp. 103827. Cited by: Table 3.
  • [22] V. Kazemi and J. Sullivan (2014) One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1867–1874. Cited by: Table 3.
  • [23] V. Lepetit, F. Moreno-Noguer, and P. Fua (2009) Epnp: an accurate o (n) solution to the pnp problem. International Journal of Computer Vision 81 (2), pp. 155. Cited by: §1, §3.2, §3.
  • [24] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) Pytorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32, pp. 8026–8037. Cited by: §5.1.
  • [25] P. Paysan, R. Knothe, B. Amberg, S. Romdhani, and T. Vetter (2009)

    A 3d face model for pose and illumination invariant face recognition

    .
    In IEEE International Conference on Advanced Video and Signal based Surveillance, pp. 296–301. Cited by: §2.3.
  • [26] S. Peng, Y. Liu, Q. Huang, X. Zhou, and H. Bao (2019) Pvnet: pixel-wise voting network for 6dof pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570. Cited by: §1, §3.2, §3.
  • [27] R. K. Pillai, L. A. Jeni, H. Yang, Z. Zhang, L. Yin, and J. F. Cohn (2019) The 2nd 3d face alignment in the wild challenge (3dfaw-video): dense reconstruction from video.. In ICCV Workshops, pp. 3082–3089. Cited by: §2.3, Table 1.
  • [28] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28, pp. 91–99. Cited by: §2.2.
  • [29] S. Romdhani and T. Vetter (2005) Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 986–993. Cited by: §1, §2.1.
  • [30] N. Ruiz, E. Chong, and J. M. Rehg (2018) Fine-grained head pose estimation without keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2074–2083. Cited by: Table 3.
  • [31] S. Sanyal, T. Bolkart, H. Feng, and M. J. Black (2019) Learning to regress 3d face shape and expression from an image without 3d supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7763–7772. Cited by: §1, §2.3, Table 1, §5.3.
  • [32] A. Savran, N. Alyüz, H. Dibeklioğlu, O. Çeliktutan, B. Gökberk, B. Sankur, and L. Akarun (2008) Bosphorus database for 3d face analysis. In European Workshop on Biometrics and Identity Management, pp. 47–56. Cited by: §2.3, Table 1.
  • [33] M. Tian, M. H. Ang, and G. H. Lee (2020) Shape prior deformation for categorical 6d object pose and size estimation. In Proceedings of the European Conference on Computer Vision, pp. 530–546. Cited by: §1, §3.
  • [34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.1.
  • [35] E. W. Weisstein (2003) Barycentric coordinates. https://mathworld. wolfram. com/. Cited by: §0.A.3, §3.1.
  • [36] M. Xin, S. Mo, and Y. Lin (2021) EVA-gcn: head pose estimation based on graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1462–1471. Cited by: §2.2.
  • [37] H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, and X. Cao (2020-06) FaceScape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.3, Table 1.
  • [38] T. Yang, Y. Chen, Y. Lin, and Y. Chuang (2019) Fsa-net: learning fine-grained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1087–1096. Cited by: §2.2, §5.2, §5.3, Table 3.
  • [39] X. Zhang, L. Yin, J. F. Cohn, S. Canavan, M. Reale, A. Horowitz, P. Liu, and J. M. Girard (2014) Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32 (10), pp. 692–706. Cited by: §2.3, Table 1.
  • [40] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li (2016) Face alignment across large poses: a 3d solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 146–155. Cited by: §5.1, Table 3.
  • [41] X. Zhu, X. Liu, Z. Lei, and S. Z. Li (2017) Face alignment in full pose range: a 3d total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 78–92. Cited by: §1, §1, §2.1, §2.3, Table 1, §3.
  • [42] X. Zhu, F. Yang, D. Huang, C. Yu, H. Wang, J. Guo, Z. Lei, and S. Z. Li (2020) Beyond 3dmm space: towards fine-grained 3d face reconstruction. In Proceedings of the European Conference Computer Vision, pp. 343–358. Cited by: §1, §2.1.

Appendix 0.A Supplementary Materials

0.a.1 Additional Qualitative Results

Fig. 7, Fig. 8 and Fig. 9 present more qualitative results. In Fig. 8, there are results constructed by our pose and ground-truth (GT) mesh, our pose and our mesh, GT pose and GT mesh. Our pose and GT mesh means the projected faces in silver are rendered with our predicted pose and GT mesh. Our pose and our mesh means the projected faces in silver are rendered with our predicted pose and predicted mesh. GT pose and our mesh means the projected faces in silver are rendered with our predicted pose and predicted mesh. We also give the distance error map coupled with the rendered face. We can observe that our predicted results are much similar to GT face and the error often exists in large pose.

In Fig. 7, there are results constructed by img2pose [1]

, our pose and GT mesh, our pose and our mesh, GT pose and GT mesh. For img2pose, we use their code to render these results. We can see that our method predicts better alignment even with our pose and our mesh than img2pose. In img2pose, they simulate the process of camera visual field from focusing on the whole image to focusing on the local bounding box, and converts the global 6DoF to local 6DoF by specific linear transformations. Since the perspective distortion of the local facial appearance is ignored, there exists ambiguity in local 6DoF. While our proposed PerspNet selects 2D pixel points from the full image and predicts their corresponding 3D vertices in canonical space. With the known camera intrinsic matrix, the final predicted 6DoF can be recovered by PnP more accurately. Hence better performance can be achieved by our proposed method. The images displayed also show the robustness of our method across dataset, especially in large pose.

We also show some qualitative results on in-the-wild images from WIDER FACE dataset in Fig. 9 with our predicted pose and our predicted mesh. It shows that our method also performs well in in-the-wild images.

Figure 7: Qualitative results for 6DoF pose estimation and 3D face shape reconstruction on BIWI dataset.
Figure 8: Qualitative results for 6DoF pose estimation and 3D face shape reconstruction on ARKitFace dataset.
Figure 9: Qualitative results for 6DoF pose estimation and 3D face shape reconstruction on ARKitFace dataset.
Figure 10: More samples on ARKitFace dataset.

0.a.2 Details about ARKitFace Dataset

More samples on ARKitFace dataset are provided in Fig. 10 with projected 3D face in silver. All the rendered 3D faces are projected by a perspective transformation with our pose annotations.

0.a.3 Implenmental details about 2D-3D Correspondence Matrix

To supervise the matrix , we compute its ground truth for each face based on barycentric coordinates [35]. In this work, our 3D face mesh is a triangle mesh with 3D vertices and triangles. Each triangle consists of 3 vertices and 3 edges. When a 3D face mesh is projected to a 2D image, a pixel in the 2D face region only belongs to a triangle with three vertices,, as shown in Fig. 11. Its barycentric coordinate in this triangle, with one additional condition , can be taken as corresponding probability between these 2D pixels and the three vertices in the 3D face mesh.

Figure 11: A cropped 2D facial image with its projected 3D face triangles mesh in black. The barycentric coordinate of a pixel (green) in 2D facial image is calculated by three projected vertices of a triangle.