Facial landmark detection is widely used as a preliminary task in face related computer vision applicationSun et al. (2013); Liu et al. (2017); Fabian Benitez-Quiroz et al. (2016); Li et al. (2017); Blanz and Vetter (2003); Dou et al. (2017); Kittler et al. (2016); Hu et al. (2017); Thies et al. (2016). Although there are several public datasets with labeled facial landmark available Sagonas et al. (2013); Köstinger et al. (2011); Tzimiropoulos (2015); Shen et al. (2015), the process to mark the precise location of all the landmarks in large scale image or video collections is very time consuming, which renders fully-supervised training of
Deep Neural Networks(DNN) based facial landmark detector tedious and costly Dong and Yang (2019).
The situation has motivated researchers to leverage the semi-supervised learning paradigm by making use of both labeled and unlabeled data. Many of these methods operate by propagating labeling/supervision information to unlabeled data via pseudo-labels. For example,Jeong et al. presented a self-training method that takes many iterations to predict labels for the unlabeled data by gradually increasing the number of utilized samples and to retrain the model. Jeong et al.. Mao et al.; Blum and Mitchell presented a multi-model collaboration framework to use multiple complementary models to obtain high quality pseudo-labels better and to avoid optimizer from falling into local optimum Mao et al. (2009); Blum and Mitchell (1998). In general, due to the noise in the unlabeled data, the performances of these approaches vary a lot depending on the way the pseudo-labels are generated and utilized.
has become an effective strategy for semi-supervised learning, where the output of a student model is enforced to be consistent with a teacher model for unlabeled input. By asynchronously updating the two models, the consistency constraint helps to mine high quality pseudo-labels that can bring significant improvement in semi-supervised learning tasks.
The strategy inspired us to propose a facial landmark detection model based on asynchronous-learning with consistency constraint from multi-source supervision signals. The method consists of the teacher-student model where a radical student is updated with raw multi-source and a calm teacher is updated with more stable gradient. Specifically, the sources of input include firstly a set of facial landmark targets predicted through a facial Motion Field Estimation
Motion Field Estimation(MFE) module. Secondly, we use detection method to obtain the second set of facial landmark targets through the Landmark Heatmap Regression (LHR). And thirdly, another set of facial landmark targets is obtained through face center detection and Contour Landmark Offset Regression (CLOR). With the three sets of signals from different sources, the radical student uses all three types of signals to update parameters, while the calm teacher only uses facial MFE and LHR to update parameters. To allow the teacher model to accept part of the student’s suggestions, an exponential moving average strategy is additionally used to update the teacher’s parameters again, before the teacher model instructs the student model. In this way, the overall framework can utilize the consistency constraint between sources and models to achieve satisfactory facial landmark detection performance. Fig.2 shows an overview of our asynchronous-learning framework. In summary, major contributions of the paper include:
Three improved sets of supervision source to train the teacher-student network model, including the facial MFE, the LHR and the CLOR;
A teacher-student network model with asynchronous-learning to effectively smooth the learning and to obtain improved performance as shown in our experiments.
2 Related Work
Fully-supervised Facial Landmark Detection. Fully supervised facial landmark detection can be categorized into two types: coordinate regression Xiong and De la Torre (2013); Cao et al. (2014) and landmark heatmap regression Wei et al. (2016); Dong et al. (2018); Newell et al. (2016) according to the type of supervision signal. Zhou et al. leverages the idea of ”object as point” to regress anchor of object. Based on this idea, we design CLOR task and obtained another type of facial landmark supervision signal. These different source of supervision signals are our model’s operation objects.
Semi-supervised Facial Landmark Detection
Semi-supervised facial landmark detection in video aims to use less annotation data to improve the performance of the entire video. Most of the existing arts focuses on how to mine pseudo labels to expand the training set.
Dong et al. uses a differentiable optical flow estimation method to obtain pseudo labels of subsequent frames, and contributes a way to estimate facial motion field, and completes the whole tasks through a two-stage training process, which performs pseudo label mining implicitly. When it fine-tunes the detector with tracking-based results in the second stage, it tries to solve the problem proposed by Khan et al. that detection results have no drift but low accuracy.
Fully supervised facial landmark detection mentioned above are for single-frame images, which are also applicable to facial landmark detection in video. Based on these fully supervised method, some self-training or co-training approaches simply leverage confidence score or an unsupervised loss to mine qualified samples. Due to the complementarity between multiple models, researchers proposed to leverage multiple models to promote each other’s performance. Hinton et al. (2015); Lee et al. (2018) as classic teacher-student models aim to let student model fit teacher’s output. Dong and Yang (2019) contains two different networks and leverages pseudo-labels with high quality envaluated by teacher to train student. Different from these multi models, our TSAL framework consists of two networks with exactly the same structure. And we use supervised signals with different levels of disturbance from different sources to train student and teacher, and implicitly mine high quality pseudo-labels through a mechanism of asynchronously updating network parameters.
Different from method of regressing face coordinates directly, we design a Contour Landmark Offset Regression (CLOR) task to detect facial landmarks . Inspired by Zhou et al. (2019), we regard face as a point, and use a Gaussian template
to represent face as an isotropic 2D Gaussian distribution. And we leverage another parallel branch to regress the offsetsfrom remaining landmarks to the center directly. However, this simple yet designed task CLOR
has a large detection variance on consecutive frames. Fortunately,can maintain a stable facial structure.
Landmark Heatmap Regression (LHR) based methods Dong et al. (2018b); Wu et al. (2018) often regress a -channel landmark heatmap firstly, and then parses landmark coordinates in post process, where is the number of landmarks. However, when face encounters occlusion, the response of landmarks is weakened. When we parse heatmap into coordinates in post process, the facial structure will be deformed. Therefore, we attempt to apply constraint to correct from CLOR, and regard it as a kind of signal for unlabeled images to supervised from LHR.
LHR-based method and CLOR
-based method are both detection-based methods in facial landmark detection. In continuous video frames, if mutual information of inter-frames is fully utilized, the facial landmark detection will be more accurate. Thus we leverage an unsupervised learning method of motion field estimation proposed inZhu et al. (2019) to estimate the facial motion field. However, just as Khan et al. (2017) mentioned, tracking-based method has drift although with a high accuracy. Therefore, we try to eliminate this drift by maintain the consistency between detection-based landmark detection results and tracking-based landmark detection results.
The core of improving performance of detector is to make full use of unlabeled images by mining higher-quality pseudo labels to participate in training. In the whole system, we obtain two supervision signals from different sources, one comes from detection source and the other comes from tracking source. However, noise of detection source signal is obvious. In order to smooth disturbance in supervision signal, we let the two supervision signals run on two asynchronously updated models, and use recursive average filtering to filter out the noise in supervision signal.
3.2 The Same Pipeline of Teacher and Student
Input of the system are frames of the video, denoted as , and only has labele . Teacher and student of TSAL framework have exactly the same structure, but have different back propagation. For ease of description, we indiscriminately express the same pipeline of teacher and student in this section.
We have a encoder-decoder network to perform facial LHR and facial CLOR. One of output of decoder is facial landmark heatmap , the other output is face center heatmap , and another output is offset from remaining landmarks to the center. Thus we obtain two groups of facial landmark coordinates and from CLOR and LHR respectively. And motion field estimation module outputs inter-frame motion field estimation , so the tracking-based landmark detection results are represented as , where is the guide (landmark coordinates) of the first frame.
Therefore, the total loss function of supervised detection and unsupervised tracking is defined as follows:
where is facial structure invariant constraint we proposed to alleviate large variance of CLOR-based results, and represents the loss of motion field deviation suppression. is multi-channel landmark heatmap regression error by Dong et al., is landmark coordinate regression error by Zhou et al., is facial motion estimation loss proposed in Zhu et al. (2019). is a piecewise exponential climbing function helps the system gradually adapt to unsupervised signals.
Facial Structure Invariant Constraint
Through LHR and CLOR, we have obtained two groups of facial landmark coordinates and respectively. When the facial motion range is too large or there is an object occlusion, has a deviation when parsing coordinates from heatmap, which deforms the facial landmark structure. While is heatmap-parsing free, so the facial landmark structure of is stable, but the variance of the in adjacent frame is large. We hope to merge and reasonably to alleviate the structure deformation of facial landmarks of in complex situations.
Therefore, we assume that on continuous frames, remaining N-1 facial landmarks are always attracted by the continuously changing face center (as Fig.3 shown), that is, normalized offset modulus sum of landmarks to the face center in the current frame is a constant, which makes to maintain the facial landmark structure while having a small variance between adjacent frames. The facial structure invariant constraint about facial landmark is defined as follows:
where is width of face bounding box in i-th frame. Through coordinates , we have = - , = -
. In final test session, the results are obtained through interpolation ofand , which is , and we set in our final test.
Motion Field Deviation Suppression
When using obtained by LHR as a facial motion filed guide to generate facial landmark of subsequent frames , the inaccuracy of the motion field estimated by facial motion filed estimation module will make inconsistent with detection-based results , which is reason why facial landmarks of subsequent frames predicted by tracking-based method have drift.
In this case, we propose Motion field deviation suppression to maintain the consistency between facial landmark detection results obtained by tracking-based method and detection-based method. Motion field deviation suppression updates parameters of facial motion filed estimation module, and the motion field estimation deviation constraint loss is calculated by:
Specifically, we represent the same pipeline loss of student and teacher as and respectively.
3.3 Different Pipelines of Teacher and Student
Through the same pipeline mentioned above, student and teacher obtained two groups of fine-tuned supervision signals from different source respectively. As two types pseudo-label for unlabeled images, they have different levels of disturbance. In order to seek a balance between two supervision signals, we use two model asynchronous learning and recursive average filtering methods to filter out the noise in the supervision signal, and effectively ensemble two supervision signals to update network parameters, which is the source of motivation for mining much more significant pseudo-labels.
Outputs of motion field estimation module in teacher and student are facial motion field and respectively, then we obtain tracking-based landmark coordinates of teacher and student, which are represented as and . The other supervision signals of teacher and student are CLOR-based and . And teacher merely uses tracking-based stable signal to update parameters, that’s to say, final supervision signal for LHR-based output is . Then the loss function of teacher using pseudo-heatmap supervision can be defined as:
While supervision signal of radical student is also affected by . We use linear interpolation to merge and , and final supervision signal for LHR-based output is . Then the loss function of student using pseudo-heatmap supervision can be defined as
Therefore, from the same pipeline and different pipelines mentioned above, we express total loss of calm teacher and radical student as follows:
where is a function of training iteration mentioned in the same pipeline above, used to control the proportion of unsupervised signals.
3.4 Parameter Updating
Student and teacher asynchronously update their network parameters and with different supervised signals from different sources in same task through and respectively. Since the pseudo-label supervision signal used by student with more disturbance, in order to allow teacher to accept student’s suggestions reasonably, we use a recursive average filtering method to smooth the noise in student.
Specifically, we maintain a queue of length to store recent iteration student’s parameters, denoted as ,, …, . And we define at training step as the teacher of successive weights:
is a smoothing coefficient hyperparameter, andmeans calculating students’ parameter average value in queue . Tarvainen and Valpola (2017) is a special case when our queue length is 1. And when we set , following Tarvainen and Valpola (2017), we let =0.999. Such a recursive average filtering method helps teachers absorb suggestions from students to promote teacher stably.
The first image dataset used was the 300W dataset Sagonas et al. (2013) which is a combination of five other datasets, including the LFPW, the AFW, the HELEN, the XM2VTS, and the IBUG dataset. Following prior works, our training set included the training of LFPW, HELEN as well as the full set of AFW, in which there was 3148 images in total. The common test subset consisted of 554 test images from LFPW and HELEN, and the challenging test subset consisted of 135 images from IBUG. The full test set was the union of the common and the challenging subsets, with 689 images in total.
The second image dataset used was the AFLW dataset Köstinger et al. (2011) that consists of 25993 faces from 21997 real-world images Lv et al. (2017); Dong et al. (2018b). Following Zhu et al., the dataset was partitioned into two different subsets, AFLW-Full and AFLW-Front respectively. The two subsets have the same training set, but with different testing samples: AFLW-Full contains 4386 test samples, while AFLW-Front only uses 1165 samples from AFLW-Full as testing set.
The video dataset used was the 300VW dataset Shen et al. (2015) that contains 50 training videos with 95192 frames. The test set consists of three (A, B and C) with 62135, 32805 and 26338 frames, respectively, and subset C is the most challenging one. Following Khan et al. (2017), we report the results on subset C.
We selected as our training set 20 videos from the 300VW training dataset (1st, 2nd, 7th, 13th, 19th, 20th, 22th, 25th, 28th, 33th, 37th, 41th, 44th, 47th, 57th, 119th, 138th, 160th, 205th and 225th) with different brightness, scenes, facial motion amplitude, face scale, occlusion, and gender. We set a stride of 10 frames which means that only the first frame in every 10-frame sequences was annotated, and the remaining 9 frames were used as unlabeled data.
For image datasets (300W and AFLW) the Normalized Mean Error (NME) normalized by inter-pupil distance and face size was used respectively, while for video dataset (300VW), the mean Area Under the Area Under the Curve (AUC)@0.08 error Khan et al. (2017) was employed.
The input image was resized to 256x256. We used the Adam optimizer for training with 140 epochs, with an initial learning rate of, decayed by and in the 90th and 120th epochs. The setting of the power value in Eq. (1),(6) and (7) followed Tarvainen and Valpola (2017), and the climbing period was from 1 to 60 epochs, the retention period was from 61 to 110 epochs, and the decay period was from 111 to 140 epochs. The batch size was set to 8 for both the teacher and the student model, and we used random flip, random translation, random angle rotate, and color jitter for data augmentation. All our experiments were conducted on a workstation with 2.4GHz Intel Core CPUs and 4 NVIDIA GTX 1080Ti GPUs.
4.3 Comparison with State-of-the-Art (Sota)
Results on 300W and AFLW
As shown in Table 1, compared with SDM Xiong and De la Torre (2013), TCDCN Zhang et al. (2014), CPM Wei et al. (2016),LAB Wu et al. (2018), Wingloss Feng et al. (2017) and PFLD1x Guo et al. (2019), our model presented clear improvement in the 300W dataset. Similarly, in the AFLW dataset, our model also outperformed SDM Xiong and De la Torre (2013), LAB Wu et al. (2018), LBF Ren et al. (2014), CCL Zhu et al. (2016), Two stage Lv et al. (2017), SAN Dong et al. (2018a) and DSRN Miao et al. (2018).
Results on 300VW
As shown in Table2, compared with DGCM Khan et al. (2017), SBR Dong et al. (2018b) and TS Dong and Yang (2019), our model achieved the best accuracy for the 300VW dataset. It is worth noting that, although TS Dong and Yang (2019) also employs the teacher-student architecture, with the supervision from multiple sources, our model uses a simpler structure to achieve even higher detection accuracy.
4.4 Ablation Study
To understand the effectiveness of each key components in our model, including the face invariance constraint, the facial motion field deviation suppression, and the dual-model asynchronous learning strategy, we conducted several ablation experiments as reported in Table 3.
The Facial Structure Invariant Constraint:
Compared with the baseline model accuracy in Table 3, when the SIC module was on in either the single model or the student-teacher dual-model, a clear performance gain can be observed, which shows that by enforcing the facial structure to be consistent in consecutive frames, SIC made the CLOR-based detection results more stable and accurate.
The Facial Motion Field Deviation Suppression:
It can also be seen from Table3 that, by adding MFDS to the MFE module, the NME score on 300W and the mean AUC@0.08 error rate on 300VW were improved. Therefore, the use of LHR-based detection results to supervise the training of facial MFE can decrease the deviation of motion estimation. The quantitative evaluation of pseudo-label quality in Fig.5 also verifies the effectiveness of field deviation suppression.
In Table 3, the performance of the teacher model was always better than the student model. This indicates that there was indeed a disturbance in updating the student model after merging the CLOR-based signal and tracking-based signal, and the applied recursive average filtering did improved the quality of teacher model than only using tracking-based signal.
To summarize, it is revealed that: 1) Facial structure invariant constraint is benefit to burst landmark detection results from CLOR; 2) Motion field deviation suppression successfully maintains the consistency between tracking-based detection results and detection-based detection results, alleviating the problem of field drift; And 3) Dual-model asynchronous learning and recursive filtering method helped mining high-quality pseudo-labels.
4.5 Qualitative Analysis
To further verify the effectiveness of facial invariant constraint and motion field deviation suppression, we visualize facial landmarks detection results of several faces on 300W test datasets in Fig.5. Compared with baseline, we can see that facial invariant constraint and motion field deviation suppression make the contour of facial landmarks more compact, which can effectively maintain the facial landmark structure invariant and improve the detection accuracy of landmarks. Combining facial structure invariant constraint and motion field deviation suppression benefits to reduce drift of facial motion estimation.
During training, we have a hyper parameter in Eq.5 that needs to be pre-defined. is used to control the fusion of tracking-based and CLOR-based detection results to generate multi-source supervision signals of student. Fig.6 shows the NME of 300W test datasets with different under the teacher-student asynchronous learning framework, using the facial SIC to correct CLOR-based detection results, and using MFDS to correct tracking-based detection results, from which we can find that: (1) The worst result comes from =0. At this time, the student’s supervision signal comes entirely from CLOR-based detection results, which means that there is a large disturbance in the CLOR-based detection results. (2) The best result comes from =0.6. At this time, the best balance is achieved between CLOR-based and tracking-based detection results. (3) When A is too large, NME starts to increase instead, which indicates that the unity of teacher-student supervision signals in the asynchronous learning framework will inhibit the promotion of teacher.
In this paper, we propose a teacher-student asynchronous learning framework for facial landmark detection, which shows effectiveness when mining pseudo-labels of unlabeled face video frames. Additionally, we propose a facial structure invariant constraint to fine tune Contour Landmark Offset Regression-based coordinates and a motion field deviation suppression method to maintain the consistency between detection-based and tracking-based landmark coordinates, which improves performance of our model significant.
- MixMatch: a holistic approach to semi-supervised learning. In NIPS 32, pp. 5049–5059. Cited by: §1.
- Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (9), pp. 1063–1074. Cited by: §1.
- Combining labeled and unlabeled data with co-training. New York, NY. Cited by: §1.
- Face alignment by explicit shape regression. International Journal of Computer Vision 107 (2), pp. 177–190. Cited by: §2.
- Style aggregated network for facial landmark detection. In Proceedings of CVPR, Cited by: §2.
- Style aggregated network for facial landmark detection. External Links: Cited by: §4.3.
- Teacher supervises students how to learn from partially labeled images for facial landmark detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2, §4.3.
- Supervision-by-Registration: an unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of CVPR, pp. 360–368. Cited by: §2, §3.1, §3.2, §4.1, §4.3.
- End-to-end 3d face reconstruction with deep neural networks. In Proceedings of CVPR, Cited by: §1.
- EmotioNet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of CVPR, Cited by: §1.
Wing loss for robust facial landmark localisation with convolutional neural networks. CoRR abs/1711.06753. External Links: Cited by: §4.3.
- PFLD: a practical facial landmark detector. External Links: Cited by: §4.3.
- Distilling the knowledge in a neural network. External Links: Cited by: §2.
- Efficient 3d morphable face model fitting. Pattern Recognition 67, pp. 366 – 379. Cited by: §1.
- Consistency-based semi-supervised learning for object detection. In NIPS 32, pp. 10759–10768. Cited by: §1.
- Synergy between face alignment and tracking via discriminative global consensus optimization. In 2017 ICCV, Vol. , pp. 3811–3819. Cited by: §2, §3.1, §4.1, §4.2, §4.3.
- 3D morphable face models and their applications. In Articulated Motion and Deformable Objects, F. J. Perales and J. Kittler (Eds.), pp. 185–206. Cited by: §1.
- Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In 2011 ICCV Workshops, Vol. , pp. 2144–2151. Cited by: §1, §4.1.
- Teacher and student joint learning for compact facial landmark detection network. In MultiMedia Modeling, K. Schoeffmann, T. H. Chalidabhongse, C. W. Ngo, S. Aramvith, N. E. O’Connor, Y. Ho, M. Gabbouj, and A. Elgammal (Eds.), pp. 493–504. Cited by: §2.
- Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of CVPR, Cited by: §1.
- SphereFace: deep hypersphere embedding for face recognition. In Proceedings of CVPR, Cited by: §1.
- A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3691–3700. Cited by: §4.1.
- A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In Proceedings of CVPR, Cited by: §4.3.
Semi-supervised co-training and active learning based approach for multi-view intrusion detection. In Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 2042–2048. Cited by: §1.
- Direct shape regression networks for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
Stacked hourglass networks for human pose estimation. In ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 483–499. Cited by: §2.
- Face alignment at 3000 fps via regressing local binary features. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1685–1692. Cited by: §4.3.
- 300 faces in-the-wild challenge: the first facial landmark localization challenge. In ICCV, Cited by: §1, §4.1.
- The first facial landmark tracking in-the-wild challenge: benchmark and results. In ICCV Workshops, Cited by: §1, §4.1.
Hybrid deep learning for face verification. In ICCV, Cited by: §1.
- Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS 30, pp. 1195–1204. Cited by: §1, §3.4, §4.2.
- Face2Face: real-time face capture and reenactment of rgb videos. In Proceedings of CVPR, Cited by: §1.
- Project-out cascaded regression with an application to face alignment. In Proceedings of CVPR, Cited by: §1.
- Convolutional pose machines. In Proceedings of CVPR, Cited by: §2, §4.3.
- Look at boundary: a boundary-aware face alignment algorithm. In CVPR, Cited by: §3.1, §4.3.
- Supervised descent method and its applications to face alignment. In Proceedings of CVPR, Cited by: §2, §4.3.
- Facial landmark detection by deep multi-task learning. In ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), pp. 94–108. Cited by: §4.3.
- Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §2, §3.1, §3.2.
- Unconstrained face alignment via cascaded compositional learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1, §4.3.
- Hidden two-stream convolutional networks for action recognition. In Computer Vision – ACCV 2018, C. V. Jawahar, H. Li, G. Mori, and K. Schindler (Eds.), pp. 363–378. Cited by: §3.1, §3.2.