1 Introduction
3D face reconstruction [29, 13, 3, 12, 42]
has draw much attention recently in computer vision and computer graphics communities, due to the increasing demand from many applications, such as virtual glasses tryon and makeup in AR, video editing and animation. Most of 3D face reconstruction methods
[41, 13, 3, 12, 16] employ the orthogonal projection [9, 15] to approximate the realworld perspective projection, which works well when the size of face is small compared to the distance from the camera (roughly 1/20 of the camera distance). However, the scenarios of face capture become more complicated with the popular of selfies, virtual glasses tryon and makeup, etc. When the subject is very close to the camera, the rendered 2D faces in orthogonal projection are only zoomed in, while in perspective projection there exists obvious distortion in the rendered 2D faces, especially in the very close distance, as shown in Fig. 1(a) and (b). Under the approximation of orthogonal projection, the distortion by perspective projection is explained by the shape changes, leading to two significant problems: 1) This distortion is not modeled in the shape models, and the distorted faces are often outside of the shape space, which leads to the unstable temporal fitting. 2) When the subject moves along the camera axis , the orthogonal projection based methods predict different face shapes in different distances due to the distortion, while perspective projection methods provide the consistent shape across frames since it explains the distortions with different 6DoF pose, as shown in Fig. 1(c). Even introducing 6DoF pose apparently improves the accuracy and robustness of 3D face reconstruction, which benefits many VR/AR applications, 6DoF pose estimation for faces is still a challenging problem. Since we should capture the distortion by perspective projection from the large variations of face appearances under complicated environment. Besides, 6DoF pose estimation in other objects [26, 18, 33] always assumes a predefined 3D shape but we only have a face image as input.To address the problem, we propose a new approach to recover 3D facial geometry under perspective projection by estimating the 6DoF pose, i.e. 3D orientation and translation, simultaneously from a single RGB image (see Section 3). Specifically, 6DoF estimation depends on 3 subtasks: 1) Reconstructing 3D face shape in canonical space, 2) Estimating pixelwise 2D3D correspondence between the canonical 3D space and image space and 3) Calculating 6DoF pose parameters by the PerspectivenPoint (PnP) algorithm [23]
. In this paper, we propose a deep learning based Perspective Network (PerspNet) to achieve the goal in one propagation. For the 3D face shape reconstruction, an encoderdecoder architecture is designed to regress a UV position map
[13] from image pixels, which records the specific position information of a 3D face in canonical space. For the 2D3D correspondence, we construct a sophisticated point description features including alignedpixel features, 3D features and 2D position features, and map it to a 2D3D correspondence matrix where each row records the corresponding 3D points in canonical space of one pixel on the input image. With 3D shape and correspondence matrix, 6DoF face pose can be robustly estimated by performing PnP.To realize 6DoF pose estimation of 3D faces, we need the face images, 3D shapes and annotated camera parameters for each training sample. However, current datasets cannot satisfy our requirement. Most of existing data provide the lowquality 3D face by the optimizationbased fitting algorithm in a weak supervised manner, such as AFLW20003D [41]. Some datasets, such as BIWI [11], NoW [31], FaceScape [37], lack either exact 3D shape for each 2D image or pose variations. Therefore, we construct a real large 3D face dataset called ARKitFace dataset, with 902,724 2D facial images of 500 subjects taken in diverse of expressions, ages and poses. For each 2D facial image, groundtruth 3D face mesh and 6DoF pose annotations are provided, as shown in Fig. 2. Compared with other 3D face datasets [41, 31, 11, 37], the ARKitFace is a very large amount dataset for singleimage 3D face reconstruction under perspective projection.
To summarize, the main contributions of this work are:

We explore a new problem of 3D face reconstruction under perspective projection, which will provide the exact 3D face shape and location in 3D camera space.

We propose PerspNet to reconstruct 3D face shape and 2D3D correspondence simultaneously, by which the 6DoF face pose can be estimated by PnP.

To enable the training and evaluation of PerspNet, we collect ARKitFace dataset, a largescale 3D dataset with groundtruth 3D face mesh and 6DoF pose annotations for each image.

Experimental results on ARKitFace dataset and a public BIWI dataset show that our approach outperforms the stateoftheart methods. The code and all data will be released to public under the authorization of all the subjects^{*}^{*}*code and data will be at www.tobereleased.com..
2 Related Work
2.1 3D Face Reconstruction
3D face reconstruction from a single RGB image is essentially an illposed problem. Most methods tackle this problem by estimating the parameters of a statistical face model [16, 3, 41, 13, 12, 29, 42]. Although these methods achieve remarkable results, they suffer from the same fundamental limitation: an orthogonal or weak perspective camera model is utilized when reconstructing the 3D face shape. The deformation caused by perspective projection, especially in the near distance, has to be compensated by face shape. Thus, we insist on following the rule of perspective projection and believe there is still substantial room for improvement on the task of 3D face reconstruction.
2.2 Head/Face Pose Estimation
Due to the prevalence of deep learning, great progresses have been achieved in head pose estimation. The main idea is to regress the euler angles of head pose directly based on deep CNNs. QuatNet [20] addresses the nonstationary property of head pose and proposed to train a quaternions regressor to avoid the ambiguity problem in euler angles. FSANet [38] adopts a finegrained structure mapping for spatially feature grouping to improve the performance of head pose estimation. EVAGCN [36] views the head pose estimation as a graph regression problem, and leverages the Graph Convolutional Networks to model the complex nonlinear mappings between the graph typologies and the head pose angles. All of the above methods are proposed under the assumption of orthogonal projection, thus only 3DoF (3D rotation) is predicted and the error caused by perspective deformation can not be avoided.
Laterly, Chang et al. [8] regress directly the 6DoF parameters of human faces under perspective projection from a face photo. Albiero et al. [1] propose a FasterRCNN [28] like framework named img2pose to predict the full 6DoF of human faces in the setting of perspective projection. Firstly, the original full image is fed into the RPN module, and all faces in the image are detected. Then the local 6DoF of each detected faces is predicted by the ROI head. Finally the local 6DoF is converted to global 6DoF using the information of facial bounding box and camera intrinsic matrix. Since img2pose ignores the influence of perspective deformation, the error of 6DoF will be amplified during localtoglobal conversion. In addition, the 6DoF pose annotations in their training data are calculated with predicted five landmarks and a mean 3D mesh, not real 6DoF pose.
2.3 3D Face Datasets
Although a large number of facial images are available, the corresponding 3D annotations are expensive and difficult to obtain. For 3D face shape reconstruction, several 3D datasets are built, such as Bosphorus [32], BFM [25], FaceWarehouse [6], MICC [2], 3DFAW [27], BP4D [39], NoW dataset [31]. In these datasets, there are either limited data or no face pose annotation.
To obtain head/face 6DoF pose, some datasets synthesize 3D groundtruth, i.e. the parameters of a statistical face model, by the optimizationbased fitting algorithm in a weak supervised manner, such as 300WLP and AFLW20003D [41]. The synthesized 3D groundtruth is coarse because only the reconstruction error of 2D sparse facial landmarks is considered. Despite the expensiveness of 3D annotations, researchers have collected several 3D face datasets using professional imaging devices for the sake of high precision on the tasks of 3D face reconstruction and head pose estimation. For example, BIWI dataset [11] is captured by a Kinect sensor with global rotation and translation to the RGB camera. However, the number of individuals is limited, and facial images with only neutral expression are recorded. Most importantly, the size of BIWI dataset is too small for learning deep networks. FaceScape [37] is a largescale dataset recorded by multiview camera system in an extremely constrained environment. It contains sufficient individuals and multiple facial expressions. However, the head pose is fixed by limited number of camera locations and the lighting condition is constant. Models that are trained on such dataset do not have good generalization ability to the real scenario in the wild. Different from these datasets, we aim to collect a 3D dataset with 3D mesh and 6DoF pose estimation in different conditions, such as expression, age and 6D pose variations.
3 Proposed Method
In this paper, we propose a novel framework for 3D face reconstruction under perspective projection from a single 2D face image. In previous singleimage 3D face reconstruction method [41, 13, 3, 12, 16], scaled orthographic projection camera model is adopted to project 3D face shape into image space. We denote the 3D face shape as representing 3D vertices (points) on the surface of the 3D shape in world coordinate system (canonical space). This projection process is usually formulated as
(1) 
where denotes projected 2D coordinates in 2D image of , is the orthographic 3D2D projection matrix, is isotropic scale and denotes 2D translation. Different from orthogonal projection, in perspective projection, 3D face shape is firstly transformed from the world coordinate system to the camera coordinate system by using 6DoF face pose ,
(2) 
with known intrinsic camera parameters , where , and represent the 3D rotation , the 3D translation , and 3D face vertices in camera coordinate system. Then the is projected to image space by , where represents the distance from each vertex to camera. This illustrates that the distance mainly leads to the difference of the two projections.
In this work, we focus on 3D face reconstruction under perspective projection, especially in very near distance. Given an RGB image , the goal is to find a method to recover the 3D face shape and estimate its 6DoF pose : . It is achieved by a framework based on a new deep learning network, PerspNet, as shown in Fig. 3. The proposed method contains two subtasks: 3D face shape reconstruction and 6DoF face pose estimation. For the 3D face shape reconstruction, we design a UV position map [13] regression method. For 6DoF face pose estimation, we use a twostage pipeline [26, 18, 33], that first chooses 2D pixels in 2D image and learns their corresponding 3D points in reconstructed 3D face shape and 6D face pose parameters can be calculated by a PnP [23] algorithm.
3.1 Perspective Network (PerspNet)
Specifically, PerspNet consists of four modules: feature extraction, 3D face shape
reconstruction, 2D3D correspondence learning and 2D face segmentation . Given a 2D facial image , PerspNet predicts . With the input, an encoder and two decoders are trained to extract 3D features and 2D image features respectively. The 3D features are fed into 3D face shape reconstruction module to regress the UV position map, which represents the 3D face shape information in canonical space. Then the 2D features includes 2D facial image features and encoded 2D position features , and then fuse with 3D features to learn the correspondence between 2D pixels in 2D image and 3D points in 3D face shape. With the correspondence, the 6DoF pose of the face can be computed. In addition, 2D image features are also fed into 2D face segmentation module, which is used to extract 2D observed pixels in face regions during testing.3D Face Shape Reconstruction. Different from the UV formulation in [13] which is defined in image coordinate system, our UV position map records the 3D coordinates of 3D facial structure with the canonical pose, which only represents the facial shape. Specifically,
(3) 
where S is the rendered UV position map, represents UV coordinates recording the 2D locations of 3D vertices in UV map, and is denoted as triangles in a 3D face mesh. A fully convolutional encoderdecoder architecture is utilized to regress the UV position map, as shown in Fig. 3
. To supervise the 3D face shape prediction, a weighted L1 loss function
is used to measure the difference between groundtruth position map and the network output ,(4) 
where is a weight matrix for , and we set , others in to 0. Then we extract the 3D vertices of the face from UV position map using UV coordinates.
2D3D Correspondence Learning. To estimate the 6DoF pose of the face in 2D images, we use a twostage pipeline that first learns the correspondence of 2D pixels in 2D image and 3D points in reconstructed 3D face shape, and then compute 6D face pose parameters using a PnP algorithm. We design a 2D3D correspondence learning module in PerspNet for the first stage, as shown in Fig. 3
. We build a correspondence probability matrix
where each row records the corresponding 3D points in canonical space of one pixel on the input image. Here is the number of face region pixels selected from a 2D image and is the number of vertices in 3D face shape. The estimating of M is as:(5) 
To learn the correspondence between 2D and 3D points, we extract 2D features and 3D features respectively.
For 2D features , we firstly extract the image features from another fully convolutional decoder architecture after the encoder network from the same facial image. For each pixel in , we also extract 2D position features in 2D images and fuse as its local feature . The position features are encoded by a 2D position encoding method, which is an extension of 1D position encoding from [34]. 2D global features are also learned by feeding 2D local features
into MultiLayer Perceptron (MLP) layers, followed with a global average pooling layer. Then the 2D local
and global features are fused as 2D features .As for the 3D features , since the UV position map contains 3D geometry information, we also extract 3D global features from the UV position regression network followed by MLP layers and a global average pooling layer. Then the 2D features and 3D features
are fused and fed into MLP layers and a softmax layer. In this way, the correspondence matrix
is predicted from the network.To achieve the groundtruth , for each pixel in the 2D face region, we compute its barycentric coordinates [35] based on three vertices of its located triangle. The barycentric coordinates can be taken as corresponding probability between this 2D pixel and the three vertices in the 3D face mesh. Other values in are set to 0. Each row of the matrix represents the distribution over the correspondences between th pixels in and the vertices . We utilize a KullbackLeibler (KL) divergence loss for the correspondence matrix . In addition, since the matrix is very sparse, we also minimize its entropy to regularize the matrix. The final loss for is
(6) 
Here and are groundtruth and predicted correspondence matrix, and is a constant weight. With the predicted matrix and reconstructed 3D face , the corresponding 3D points for each 2D pixels are obtained by matrix multiplication, L1 loss is also applied to supervise the :
(7) 
where groundtruth is achieved by perspective projection.
2D Face Segmentation. 2D face segmentation module in the proposed network is trained for selecting 2D pixels from face regions in the inference phase. When training the whole network, the image pixels are randomly chosen from groundtruth face segmentation mask. While in the testing phase, the image pixels are randomly chosen from predicted face mask. 2D face segmentation task follows the 2D image feature extraction network and a 2class softmax loss is used.
Training Objective. We train out whole network with a multitask loss. The final loss function is
(8) 
where and are the weights of the four losses respectively. Experimental results reveal that jointly training these tasks boosts the performance of each other.
3.2 6DoF Face Pose Estimation
Based on the output of the network, the final 6DoF pose estimation can be computed. Given the chosen 2D pixel coordinates in the original full 2D image, their corresponding 3D points coordinates from reconstructed faces in world coordinate and the camera intrinsic parameters , we apply a PnP algorithm [23] with Random Sample Consensus (RANSAC) [14] to compute the 6D face pose parameters, . PerspectivenPoint is the problem of estimating the pose of a calibrated camera given a set of 3D points in the world and their corresponding 2D projections in the image. The camera pose consists of 6DoF which are made up of the rotation (roll, pitch, and yaw) and 3D translation of the camera with respect to the world. With the estimated 6DoF face pose, the reconstructed 3D face shapes can be projected to 2D images. It is worth noting that, directly regressing 6DoF pose parameters from a single image by CNN is also feasible, but it achieves much worse performance than our method due to the nonlinearity of the rotation space [26]. It will be validated in our experiments.
Dataset  Sub. Num  Image Num  3D Mesh Num  Exp. Num  camera  Vert. Num  6DoF Pose 
Bosphorus[32]  105  4,666  4,666  35  Mega  35K  No 
MICC[2]  53  53  203  5  3DMD  40K  No 
3DFAW[27]  26  26  26  Neutral  DI3D  20K  No 
BP4D[39]  41  328  328  8  3DMD  70k  No 
AFLW20003D[41]  2,000  2,000  3DMM      53,149  No 
BIWI[11]  20  15,678  24  Neutral  Kinect  6,918  Yes 
NoW[31]  100  2,054  100  Neutral  iPhone X  58,668  No 
FaceScape[37]  938  1,275,680  18,760  20  DSLR  2M  fixed pose 
ARKitFace  500  902,724  902,724  33  iPhone 11  1,220  Yes 
4 ARKitFace Dataset
The ARKitFace dataset is established by this work in order to train and evaluate both 3D face shape and 6DoF in the setting of perspective projection. A total of 500 volunteers, aged 9 to 60, are invited to record the dataset. They sit in a random environment, and the 3D acquisition equipment is fixed in front of them, with a distance ranging from about 0.3m to 0.9m. Each subject is asked to perform 33 specific expressions with two head movements (from looking left to looking right / from looking up to looking down). 3D acquisition equipment we used is an iPhone 11. The shape and location of human face are tracked by structured light sensor. The triangle mesh and 6DoF information of the RGB images are obtained by builtin ARKit toolbox. The triangle mesh is made up of 1,220 vertices and 2,304 triangles. In total, 902,724 2D facial images (resolution or ) with groundtruth 3D mesh and 6DoF pose annotation are collected. An example is shown in Fig. 2. Distributions of age, gender, and each pose parameter of 6DoF on ARKitFace dataset are shown in Fig. 4. We can observe that our dataset has balanced gender, diverse age and the 6DoF pose varation. Comparisons between different datasets shown in Table 1 reveal that ARKitFace surpasses the existing datasets in terms of scale, 3D exact shape annotations and diversity of poses.
Authorization: All the 500 subjects consent to use their data. We will release all the subjects with 2D facial images, 3D mesh and 6DoF pose annotation under the authorization of all the subjects. We will not release their personal privacy information, including age, gender etc.
5 Experiments
5.1 Implementation Details
Our PerspNet is implemented by PyTorch
[24]. During training, the PerspNet takes the input image cropped from a full 2D image and resized to , based on the groundtruth face segmentation mask. To augment the data with large poses, we utilize face profiling method [40] to generate the profile view of faces from mediumpose samples for three euler angles. We enlarge all the three angles to max and min . We also apply online data augmentation including random cropping, resizing and color jittering during training. We use a pretrained ResNet18 [17] architecture, where the final encoded feature map is . To regress the UV position map , the first decoder is implemented by 5 upsampling layers and its output is a feature map, which is regarded as the 3D features. The number of point clouds in 3D face shape is 1220. To extract 2D image features and segment 2D face region from background, the second decoder consists of five upsampling layers and each upsampled feature map is concatenated with a feature map which have the same size in the encoder backbone network. The size of 2D image features is also . The number of randomly sampled 2D pixels for face region is 1024. If point count in face region is insufficient, we sample these pixels by repetition. The 2D and 3D global feature sizes are all At the training phase, the 2D pixels are randomly chosen from groundtruth face mask. We set the weights of losses respectively,. The initial learning rate is set as 0.0001 and is updated linearly after 10 epochs. We train all models for 20 epochs. All the networks are trained and evaluated on the ARKitFace dataset with groundtruth bounding box. At the testing phase, the 2D pixels are randomly chosen from segmented face region. The PnP algorithm is implemented in OpenCV
[4].5.2 Dataset
To validate our proposed method, we conducted experiments on our collected ARKitFace dataset and a public BIWI dataset.
ARKitFace. We randomly use 400 people in our dataset as training data with a total of 717,840 2D facial images and annotations, leaving 100 people with totally 184,884 samples for testing.
BIWI. BIWI [11] contains 24 videos of 20 subjects in an indoor environment. Each video is coupled with a neutral 3D face mesh of a specific person. There are totally 15, 678 frames with a wide range of face poses in this dataset. Paired RGB and Depth images are provided for each frame. We only use the RGB images as inputs in this work. This benchmark provides groundtruth labels for rotation (rotation matrix) and translation for full 6DoF. Since there is not each 3D face mesh for each frame, we can not evaluate our method on 3D face reconstruction task. We only evaluate our method for 6DoF pose estimation task. In addition, we train our method on training set of ARKitFace dataset, and test our method on the entire BIWI dataset following the previous methods [20, 38, 1].
5.3 Evaluation Metric
For the 3D face shape reconstruction, we follow previous works [12, 31], median distance and average distance between predicted 3D mesh vertices and groundtruth 3D mesh vertices are utilized. For the 6DoF face pose estimation, we follow previous head/face pose estimation methods [1, 38, 20], and convert the rotation matrixes to 3 euler angles,Yaw, Pitch, Roll and compute the mean absolute error (MAE) for 6DoF, . and , the rotational and translational MAE are also computed. Furthermore, to validate the 6DoF face pose in a metric, we adopt the average 3D distance (ADD) metric [19] used for object pose evaluation. Given the groundtruth rotation and translation and the estimated rotation and translation , the ADD computes the mean of the pairwise distances between the groundtruth 3D face model points transformed based on the groundtruth pose and the estimated pose: .
Method  Yaw  Pitch  Roll  ADD  

img2pose(retrain) [1]  5.07  7.32  4.25  5.55  1.39  3.72  15.95  7.02  20.54 
Direct 6DoF Regress  1.86  2.72  1.03  1.87  2.80  5.23  19.16  9.06  21.39 
PerspNet w/o PE  1.01  1.53  0.61  1.05  1.17  2.39  11.77  5.11  12.34 
PerspNet w/o  1.04  1.45  0.60  1.03  1.09  2.13  10.28  4.50  10.89 
PerspNet (ours)  0.99  1.43  0.55  0.99  0.97  2.12  9.45  4.18  10.01 
Method  Yaw  Pitch  Roll  ADD  

Dlib (68 points)[22]  16.76  13.80  6.19  12.25           
3DDFA[40]  36.18  12.25  8.78  19.07           
FAN (12 points)[5]  8.53  7.48  7.63  7.88           
Hopenet ()[30]  4.81  6.61  3.27  4.90           
QuatNet[20]  4.01  5.49  2.94  4.15           
FSANET[38]  4.56  5.21  3.07  4.28           
HPE[21]  4.57  5.18  3.12  4.29           
TriNet[7]  3.05  4.76  4.11  3.97           
RetinaFace R50(5 points)[10]  4.07  6.42  2.97  4.49           
img2pose[1]  4.57  3.55  3.24  3.79           
Direct 6DoF Regress  16.49  14.03  5.81  12.11  62.36  85.01  366.52  171.30  562.38 
PerspNet w/o PE  3.63  3.81  3.48  3.64  6.03  9.11  77.87  31.00  142.20 
PerspNet w/o  3.67  3.52  3.26  3.48  5.57  8.53  75.23  29.78  136.16 
PerspNet (ours)  3.10  3.37  2.38  2.95  4.15  6.43  46.69  19.09  100.09 
5.4 Evaluation for 6DoF Face Pose Estimation
Comparison with the stateoftheart methods. To compare with the stateoftheart methods on head or face pose estimation, we firstly retrain the most recent stateoftheart method, img2pose [1], on ARKitFace training set, and compute the performance on testing data. As shown in Table 2, our method outperforms it significantly. In addition, the public BIWI dataset is used as a crossdata evaluation to test our final network. Since some faces with large angles can not be detected, we follow the img2pose [1] method and test on 13,219 images with detected bbox. We use the code of [5] to detect the face 68 landmarks and crop facial region. All the results are shown in Table 3. We can see that our method performs much better than previous methods, especially the recent img2pose method [1]. The experiment also demonstrates that our method and dataset can be well generalized to the data in different domain. The inference time of our proposed model is 11.7ms with a P100 GPU.
Ablation Studies. We build several baselines to evaluate the components that contribute to our performance. Since img2pose [1] directly regresses the 6 pose parameters, we build a baseline, Direct 6DoF Regress, to regress the 6 pose parameters after the backbone network. Other baselines include our PerspNet without the position encoding features (PerspNet w/o PE), which is used to validate the effectiveness of the 2D position encoding features, and our PerspNet without loss (PerspNet w/o ), which is built to evaluate the corresponding components. The results are shown in Table 2 and Table 3. We can observe that our twostage method outperforms direct regression method significantly, and 2D position encoding features are helpful and is effective for face pose estimation task. Moreover, to explain the influence of segmentation mask, the GT face mask is used during the inference time. Its results on are 0.72, 1.10, 0.54, 0.79, 0.92, 1.47, 9.59, 3.99, 9.99 respectively. It shows that the method with the groundtruth face mask performs better than that with predicted segmentation mask, which indicates that more accurate segmentation results are needed.
5.5 Evaluation on 3D Face Shape Reconstruction
To validate the proposed method on 3D face shape reconstruction task, we display the results on ARKitFace test data in Table 4. For comparison, we also train a singletask PRNet [13] with the same encoderdecoder UV regression network in our multitask network on ARKitFace training data. As shown in Table 4, our multitask network outperforms the singletask PRNet, which reveals that the pose estimation task contributes to the improvement of 3D face shape reconstruction task. In addition, we compare our method with other SOTA like 3DDFA_v2 [16] in Table 4. We can see that our method still achieves the best performance of 3D face reconstruction.
5.6 Qualitative Results
We display qualitative results for 6DoF pose estimation and 3D reconstruction on ARKitFace and BIWI dataset in Fig. 5 and Fig. 6
(a), where img2pose, our predicted results and GT results participate in the comparison. The predicted face pose with GT 3D mesh, the predicted face pose with predicted 3D mesh and their error map are demonstrated, respectively. We can see that our results are effective and outperforms img2pose, especially in large pose. We also show some qualitative results on inthewild images from WIDER FACE dataset in Fig.
6 (b). It shows that our method also performs well in inthewild images.6 Conclusion
We explore 3D face reconstruction under perspective projection from a single RGB image for 3D face AR applications. We introduce a novel framework, in which a deep learning network, PerspNet, is proposed, for 3D face shape reconstruction, corresponding learning between 2D pixels and 3D points in 3D face models, and 2D face region segmentation. With 2D pixels in facial images and corresponding 3D points in reconstructed 3D face mesh, 6DoF face pose is estimated by a PnP method. This 6DoF face pose is used for perspective projection transformation. To enable our PerspNet, we build a largescale 3D face dataset, ARKitFace dataset, annotating 2D facial images, 3D face mesh and 6DoF pose. Experiments demonstrate the effectiveness of our approach for 3D face shape reconstruction and 6DoF pose estimation. As most of the face analysis methods, our method and data may raise privacy concerns when misused. Therefore, the release of the data is fully authorized by the subjects. We wish this work would spur the future researches including 3D face reconstruction and face pose estimation.
References

[1]
(2021)
Img2pose: face alignment and detection via 6dof, face pose estimation.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 7617–7627. Cited by: §0.A.1, §2.2, §5.2, §5.3, §5.4, §5.4, Table 2, Table 3.  [2] (2011) The florence 2d/3d hybrid face dataset. In Proceedings of the 2011 joint ACM workshop on Human Gesture and Behavior Understanding, pp. 79–80. Cited by: §2.3, Table 1.
 [3] (2021) Riggable 3d face reconstruction via innetwork optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6216–6225. Cited by: §1, §2.1, §3.
 [4] (2000) OpenCV. Dr. Dobb’s journal of software tools 3. Cited by: §5.1.
 [5] (2017) How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In Proceedings of the International Conference on Computer Vision, Cited by: §5.4, Table 3.
 [6] (2013) Facewarehouse: a 3d facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics 20 (3), pp. 413–425. Cited by: §2.3.

[7]
(2021)
A vectorbased representation to enhance head pose estimation
. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, pp. 1188–1197. Cited by: Table 3.  [8] (2019) Deep, landmarkfree fame: face alignment, modeling, and expression estimation. International Journal of Computer Vision 127 (6), pp. 930–956. Cited by: §2.2.
 [9] (2021) The geometry of perspective projection. https://www.cse.unr.edu/~bebis/CS791E/Notes/PerspectiveProjection.pdf. Cited by: §1.
 [10] (2020) Retinaface: singleshot multilevel face localisation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5203–5212. Cited by: Table 3.
 [11] (2013) Random forests for real time 3d face analysis. International Journal of Computer Vision 101 (3), pp. 437–458. Cited by: §1, §2.3, Table 1, §5.2.
 [12] (2021) Learning an animatable detailed 3d face model from inthewild images. ACM Transactions on Graphics 40 (4), pp. 1–13. Cited by: §1, §2.1, §3, §5.3.
 [13] (2018) Joint 3d face reconstruction and dense alignment with position map regression network. In Proceedings of the European Conference on Computer Vision, pp. 534–551. Cited by: §1, §1, §2.1, §3.1, §3, §3, §5.5, Table 4.
 [14] (1981) Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communications of the ACM 24 (6), pp. 381–395. Cited by: §3.2.
 [15] (2011) Computer vision: a modern approach.. Prentice hall. Cited by: §1.
 [16] (2020) Towards fast, accurate and stable 3d dense face alignment. In Proceedings of the European Conference Computer Vision, pp. 152–168. Cited by: §1, §2.1, §3, §5.5, Table 4.
 [17] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §5.1.
 [18] (2020) Pvn3d: a deep pointwise 3d keypoints voting network for 6dof pose estimation. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 11632–11641. Cited by: §1, §3.
 [19] (2012) Model based training, detection and pose estimation of textureless 3d objects in heavily cluttered scenes. In Proceedings of the Asian Conference on Computer Vision, pp. 548–562. Cited by: §5.3.
 [20] (2018) Quatnet: quaternionbased head pose estimation with multiregression loss. IEEE Transactions on Multimedia 21 (4), pp. 1035–1046. Cited by: §2.2, §5.2, §5.3, Table 3.
 [21] (2020) Improving head pose estimation using twostage ensembles with topk regression. Image and Vision Computing 93, pp. 103827. Cited by: Table 3.
 [22] (2014) One millisecond face alignment with an ensemble of regression trees. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1867–1874. Cited by: Table 3.
 [23] (2009) Epnp: an accurate o (n) solution to the pnp problem. International Journal of Computer Vision 81 (2), pp. 155. Cited by: §1, §3.2, §3.
 [24] (2019) Pytorch: an imperative style, highperformance deep learning library. Advances in Neural Information Processing Systems 32, pp. 8026–8037. Cited by: §5.1.

[25]
(2009)
A 3d face model for pose and illumination invariant face recognition
. In IEEE International Conference on Advanced Video and Signal based Surveillance, pp. 296–301. Cited by: §2.3.  [26] (2019) Pvnet: pixelwise voting network for 6dof pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4561–4570. Cited by: §1, §3.2, §3.
 [27] (2019) The 2nd 3d face alignment in the wild challenge (3dfawvideo): dense reconstruction from video.. In ICCV Workshops, pp. 3082–3089. Cited by: §2.3, Table 1.
 [28] (2015) Faster rcnn: towards realtime object detection with region proposal networks. Advances in Neural Information Processing Systems 28, pp. 91–99. Cited by: §2.2.
 [29] (2005) Estimating 3d shape and texture using pixel intensity, edges, specular highlights, texture constraints and a prior. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 2, pp. 986–993. Cited by: §1, §2.1.
 [30] (2018) Finegrained head pose estimation without keypoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 2074–2083. Cited by: Table 3.
 [31] (2019) Learning to regress 3d face shape and expression from an image without 3d supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7763–7772. Cited by: §1, §2.3, Table 1, §5.3.
 [32] (2008) Bosphorus database for 3d face analysis. In European Workshop on Biometrics and Identity Management, pp. 47–56. Cited by: §2.3, Table 1.
 [33] (2020) Shape prior deformation for categorical 6d object pose and size estimation. In Proceedings of the European Conference on Computer Vision, pp. 530–546. Cited by: §1, §3.
 [34] (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.1.
 [35] (2003) Barycentric coordinates. https://mathworld. wolfram. com/. Cited by: §0.A.3, §3.1.
 [36] (2021) EVAgcn: head pose estimation based on graph convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1462–1471. Cited by: §2.2.
 [37] (202006) FaceScape: a largescale high quality 3d face dataset and detailed riggable 3d face prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §2.3, Table 1.
 [38] (2019) Fsanet: learning finegrained structure aggregation for head pose estimation from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1087–1096. Cited by: §2.2, §5.2, §5.3, Table 3.
 [39] (2014) Bp4dspontaneous: a highresolution spontaneous 3d dynamic facial expression database. Image and Vision Computing 32 (10), pp. 692–706. Cited by: §2.3, Table 1.
 [40] (2016) Face alignment across large poses: a 3d solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 146–155. Cited by: §5.1, Table 3.
 [41] (2017) Face alignment in full pose range: a 3d total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 78–92. Cited by: §1, §1, §2.1, §2.3, Table 1, §3.
 [42] (2020) Beyond 3dmm space: towards finegrained 3d face reconstruction. In Proceedings of the European Conference Computer Vision, pp. 343–358. Cited by: §1, §2.1.
Appendix 0.A Supplementary Materials
0.a.1 Additional Qualitative Results
Fig. 7, Fig. 8 and Fig. 9 present more qualitative results. In Fig. 8, there are results constructed by our pose and groundtruth (GT) mesh, our pose and our mesh, GT pose and GT mesh. Our pose and GT mesh means the projected faces in silver are rendered with our predicted pose and GT mesh. Our pose and our mesh means the projected faces in silver are rendered with our predicted pose and predicted mesh. GT pose and our mesh means the projected faces in silver are rendered with our predicted pose and predicted mesh. We also give the distance error map coupled with the rendered face. We can observe that our predicted results are much similar to GT face and the error often exists in large pose.
In Fig. 7, there are results constructed by img2pose [1]
, our pose and GT mesh, our pose and our mesh, GT pose and GT mesh. For img2pose, we use their code to render these results. We can see that our method predicts better alignment even with our pose and our mesh than img2pose. In img2pose, they simulate the process of camera visual field from focusing on the whole image to focusing on the local bounding box, and converts the global 6DoF to local 6DoF by specific linear transformations. Since the perspective distortion of the local facial appearance is ignored, there exists ambiguity in local 6DoF. While our proposed PerspNet selects 2D pixel points from the full image and predicts their corresponding 3D vertices in canonical space. With the known camera intrinsic matrix, the final predicted 6DoF can be recovered by PnP more accurately. Hence better performance can be achieved by our proposed method. The images displayed also show the robustness of our method across dataset, especially in large pose.
We also show some qualitative results on inthewild images from WIDER FACE dataset in Fig. 9 with our predicted pose and our predicted mesh. It shows that our method also performs well in inthewild images.
0.a.2 Details about ARKitFace Dataset
More samples on ARKitFace dataset are provided in Fig. 10 with projected 3D face in silver. All the rendered 3D faces are projected by a perspective transformation with our pose annotations.
0.a.3 Implenmental details about 2D3D Correspondence Matrix
To supervise the matrix , we compute its ground truth for each face based on barycentric coordinates [35]. In this work, our 3D face mesh is a triangle mesh with 3D vertices and triangles. Each triangle consists of 3 vertices and 3 edges. When a 3D face mesh is projected to a 2D image, a pixel in the 2D face region only belongs to a triangle with three vertices,, as shown in Fig. 11. Its barycentric coordinate in this triangle, with one additional condition , can be taken as corresponding probability between these 2D pixels and the three vertices in the 3D face mesh.