Due to the high annotation cost of large-scale facial landmark detection tasks in videos, a semi-supervised paradigm that uses self-training for mining high-quality pseudo-labels to participate in training has been proposed by researchers. However, self-training based methods often train with a gradually increasing number of samples, whose performances vary a lot depending on the number of pseudo-labeled samples added. In this paper, we propose a teacher-student asynchronous learning (TSAL) framework based on the multi-source supervision signal consistency criterion, which implicitly mines pseudo-labels through consistency constraints. Specifically, the TSAL framework contains two models with exactly the same structure. The radical student uses multi-source supervision signals from the same task to update parameters, while the calm teacher uses a single-source supervision signal to update parameters. In order to reasonably absorb student's suggestions, teacher's parameters are updated again through recursive average filtering. The experimental results prove that asynchronous-learning framework can effectively filter noise in multi-source supervision signals, thereby mining the pseudo-labels which are more significant for network parameter updating. And extensive experiments on 300W, AFLW, and 300VW benchmarks show that the TSAL framework achieves state-of-the-art performance.READ FULL TEXT VIEW PDF
Facial landmark detection aims to localize the anatomically defined poin...
We present a simple self-training method that achieves 87.4
In this paper, we introduce the Kaizen framework that uses a continuousl...
Optical aerial images change detection is an important task in earth
Standard segmentation of medical images based on full-supervised
Training deep neural networks requires many training samples, but in pra...
Rendering synthetic data (e.g., 3D CAD-rendered images) to generate
Facial landmark detection is widely used as a preliminary task in face related computer vision applicationSun et al. (2013); Liu et al. (2017); Fabian Benitez-Quiroz et al. (2016); Li et al. (2017); Blanz and Vetter (2003); Dou et al. (2017); Kittler et al. (2016); Hu et al. (2017); Thies et al. (2016). Although there are several public datasets with labeled facial landmark available Sagonas et al. (2013); Köstinger et al. (2011); Tzimiropoulos (2015); Shen et al. (2015), the process to mark the precise location of all the landmarks in large scale image or video collections is very time consuming, which renders fully-supervised training of
Deep Neural Networks(DNN) based facial landmark detector tedious and costly Dong and Yang (2019).
The situation has motivated researchers to leverage the semi-supervised learning paradigm by making use of both labeled and unlabeled data. Many of these methods operate by propagating labeling/supervision information to unlabeled data via pseudo-labels. For example,Jeong et al. presented a self-training method that takes many iterations to predict labels for the unlabeled data by gradually increasing the number of utilized samples and to retrain the model. Jeong et al.. Mao et al.; Blum and Mitchell presented a multi-model collaboration framework to use multiple complementary models to obtain high quality pseudo-labels better and to avoid optimizer from falling into local optimum Mao et al. (2009); Blum and Mitchell (1998). In general, due to the noise in the unlabeled data, the performances of these approaches vary a lot depending on the way the pseudo-labels are generated and utilized.
has become an effective strategy for semi-supervised learning, where the output of a student model is enforced to be consistent with a teacher model for unlabeled input. By asynchronously updating the two models, the consistency constraint helps to mine high quality pseudo-labels that can bring significant improvement in semi-supervised learning tasks.
The strategy inspired us to propose a facial landmark detection model based on asynchronous-learning with consistency constraint from multi-source supervision signals. The method consists of the teacher-student model where a radical student is updated with raw multi-source and a calm teacher is updated with more stable gradient. Specifically, the sources of input include firstly a set of facial landmark targets predicted through a facial Motion Field Estimation
Motion Field Estimation(MFE) module. Secondly, we use detection method to obtain the second set of facial landmark targets through the Landmark Heatmap Regression (LHR). And thirdly, another set of facial landmark targets is obtained through face center detection and Contour Landmark Offset Regression (CLOR). With the three sets of signals from different sources, the radical student uses all three types of signals to update parameters, while the calm teacher only uses facial MFE and LHR to update parameters. To allow the teacher model to accept part of the student’s suggestions, an exponential moving average strategy is additionally used to update the teacher’s parameters again, before the teacher model instructs the student model. In this way, the overall framework can utilize the consistency constraint between sources and models to achieve satisfactory facial landmark detection performance. Fig.2 shows an overview of our asynchronous-learning framework. In summary, major contributions of the paper include:
Three improved sets of supervision source to train the teacher-student network model, including the facial MFE, the LHR and the CLOR;
A teacher-student network model with asynchronous-learning to effectively smooth the learning and to obtain improved performance as shown in our experiments.
Fully-supervised Facial Landmark Detection. Fully supervised facial landmark detection can be categorized into two types: coordinate regression Xiong and De la Torre (2013); Cao et al. (2014) and landmark heatmap regression Wei et al. (2016); Dong et al. (2018); Newell et al. (2016) according to the type of supervision signal. Zhou et al. leverages the idea of ”object as point” to regress anchor of object. Based on this idea, we design CLOR task and obtained another type of facial landmark supervision signal. These different source of supervision signals are our model’s operation objects.
Semi-supervised facial landmark detection in video aims to use less annotation data to improve the performance of the entire video. Most of the existing arts focuses on how to mine pseudo labels to expand the training set.
Dong et al. uses a differentiable optical flow estimation method to obtain pseudo labels of subsequent frames, and contributes a way to estimate facial motion field, and completes the whole tasks through a two-stage training process, which performs pseudo label mining implicitly. When it fine-tunes the detector with tracking-based results in the second stage, it tries to solve the problem proposed by Khan et al. that detection results have no drift but low accuracy.
Fully supervised facial landmark detection mentioned above are for single-frame images, which are also applicable to facial landmark detection in video. Based on these fully supervised method, some self-training or co-training approaches simply leverage confidence score or an unsupervised loss to mine qualified samples. Due to the complementarity between multiple models, researchers proposed to leverage multiple models to promote each other’s performance. Hinton et al. (2015); Lee et al. (2018) as classic teacher-student models aim to let student model fit teacher’s output. Dong and Yang (2019) contains two different networks and leverages pseudo-labels with high quality envaluated by teacher to train student. Different from these multi models, our TSAL framework consists of two networks with exactly the same structure. And we use supervised signals with different levels of disturbance from different sources to train student and teacher, and implicitly mine high quality pseudo-labels through a mechanism of asynchronously updating network parameters.
Different from method of regressing face coordinates directly, we design a Contour Landmark Offset Regression (CLOR) task to detect facial landmarks . Inspired by Zhou et al. (2019), we regard face as a point, and use a Gaussian template
to represent face as an isotropic 2D Gaussian distribution. And we leverage another parallel branch to regress the offsetsfrom remaining landmarks to the center directly. However, this simple yet designed task CLOR
has a large detection variance on consecutive frames. Fortunately,can maintain a stable facial structure.
Landmark Heatmap Regression (LHR) based methods Dong et al. (2018b); Wu et al. (2018) often regress a -channel landmark heatmap firstly, and then parses landmark coordinates in post process, where is the number of landmarks. However, when face encounters occlusion, the response of landmarks is weakened. When we parse heatmap into coordinates in post process, the facial structure will be deformed. Therefore, we attempt to apply constraint to correct from CLOR, and regard it as a kind of signal for unlabeled images to supervised from LHR.
LHR-based method and CLOR
-based method are both detection-based methods in facial landmark detection. In continuous video frames, if mutual information of inter-frames is fully utilized, the facial landmark detection will be more accurate. Thus we leverage an unsupervised learning method of motion field estimation proposed inZhu et al. (2019) to estimate the facial motion field. However, just as Khan et al. (2017) mentioned, tracking-based method has drift although with a high accuracy. Therefore, we try to eliminate this drift by maintain the consistency between detection-based landmark detection results and tracking-based landmark detection results.
The core of improving performance of detector is to make full use of unlabeled images by mining higher-quality pseudo labels to participate in training. In the whole system, we obtain two supervision signals from different sources, one comes from detection source and the other comes from tracking source. However, noise of detection source signal is obvious. In order to smooth disturbance in supervision signal, we let the two supervision signals run on two asynchronously updated models, and use recursive average filtering to filter out the noise in supervision signal.
Input of the system are frames of the video, denoted as , and only has labele . Teacher and student of TSAL framework have exactly the same structure, but have different back propagation. For ease of description, we indiscriminately express the same pipeline of teacher and student in this section.
We have a encoder-decoder network to perform facial LHR and facial CLOR. One of output of decoder is facial landmark heatmap , the other output is face center heatmap , and another output is offset from remaining landmarks to the center. Thus we obtain two groups of facial landmark coordinates and from CLOR and LHR respectively. And motion field estimation module outputs inter-frame motion field estimation , so the tracking-based landmark detection results are represented as , where is the guide (landmark coordinates) of the first frame.
Therefore, the total loss function of supervised detection and unsupervised tracking is defined as follows:
where is facial structure invariant constraint we proposed to alleviate large variance of CLOR-based results, and represents the loss of motion field deviation suppression. is multi-channel landmark heatmap regression error by Dong et al., is landmark coordinate regression error by Zhou et al., is facial motion estimation loss proposed in Zhu et al. (2019). is a piecewise exponential climbing function helps the system gradually adapt to unsupervised signals.
Through LHR and CLOR, we have obtained two groups of facial landmark coordinates and respectively. When the facial motion range is too large or there is an object occlusion, has a deviation when parsing coordinates from heatmap, which deforms the facial landmark structure. While is heatmap-parsing free, so the facial landmark structure of is stable, but the variance of the in adjacent frame is large. We hope to merge and reasonably to alleviate the structure deformation of facial landmarks of in complex situations.
Therefore, we assume that on continuous frames, remaining N-1 facial landmarks are always attracted by the continuously changing face center (as Fig.3 shown), that is, normalized offset modulus sum of landmarks to the face center in the current frame is a constant, which makes to maintain the facial landmark structure while having a small variance between adjacent frames. The facial structure invariant constraint about facial landmark is defined as follows:
where is width of face bounding box in i-th frame. Through coordinates , we have = - , = -
. In final test session, the results are obtained through interpolation ofand , which is , and we set in our final test.
When using obtained by LHR as a facial motion filed guide to generate facial landmark of subsequent frames , the inaccuracy of the motion field estimated by facial motion filed estimation module will make inconsistent with detection-based results , which is reason why facial landmarks of subsequent frames predicted by tracking-based method have drift.
In this case, we propose Motion field deviation suppression to maintain the consistency between facial landmark detection results obtained by tracking-based method and detection-based method. Motion field deviation suppression updates parameters of facial motion filed estimation module, and the motion field estimation deviation constraint loss is calculated by:
Specifically, we represent the same pipeline loss of student and teacher as and respectively.
Through the same pipeline mentioned above, student and teacher obtained two groups of fine-tuned supervision signals from different source respectively. As two types pseudo-label for unlabeled images, they have different levels of disturbance. In order to seek a balance between two supervision signals, we use two model asynchronous learning and recursive average filtering methods to filter out the noise in the supervision signal, and effectively ensemble two supervision signals to update network parameters, which is the source of motivation for mining much more significant pseudo-labels.
Outputs of motion field estimation module in teacher and student are facial motion field and respectively, then we obtain tracking-based landmark coordinates of teacher and student, which are represented as and . The other supervision signals of teacher and student are CLOR-based and . And teacher merely uses tracking-based stable signal to update parameters, that’s to say, final supervision signal for LHR-based output is . Then the loss function of teacher using pseudo-heatmap supervision can be defined as:
While supervision signal of radical student is also affected by . We use linear interpolation to merge and , and final supervision signal for LHR-based output is . Then the loss function of student using pseudo-heatmap supervision can be defined as
Therefore, from the same pipeline and different pipelines mentioned above, we express total loss of calm teacher and radical student as follows:
where is a function of training iteration mentioned in the same pipeline above, used to control the proportion of unsupervised signals.
Student and teacher asynchronously update their network parameters and with different supervised signals from different sources in same task through and respectively. Since the pseudo-label supervision signal used by student with more disturbance, in order to allow teacher to accept student’s suggestions reasonably, we use a recursive average filtering method to smooth the noise in student.
Specifically, we maintain a queue of length to store recent iteration student’s parameters, denoted as ,, …, . And we define at training step as the teacher of successive weights:
is a smoothing coefficient hyperparameter, andmeans calculating students’ parameter average value in queue . Tarvainen and Valpola (2017) is a special case when our queue length is 1. And when we set , following Tarvainen and Valpola (2017), we let =0.999. Such a recursive average filtering method helps teachers absorb suggestions from students to promote teacher stably.
The first image dataset used was the 300W dataset Sagonas et al. (2013) which is a combination of five other datasets, including the LFPW, the AFW, the HELEN, the XM2VTS, and the IBUG dataset. Following prior works, our training set included the training of LFPW, HELEN as well as the full set of AFW, in which there was 3148 images in total. The common test subset consisted of 554 test images from LFPW and HELEN, and the challenging test subset consisted of 135 images from IBUG. The full test set was the union of the common and the challenging subsets, with 689 images in total.
The second image dataset used was the AFLW dataset Köstinger et al. (2011) that consists of 25993 faces from 21997 real-world images Lv et al. (2017); Dong et al. (2018b). Following Zhu et al., the dataset was partitioned into two different subsets, AFLW-Full and AFLW-Front respectively. The two subsets have the same training set, but with different testing samples: AFLW-Full contains 4386 test samples, while AFLW-Front only uses 1165 samples from AFLW-Full as testing set.
The video dataset used was the 300VW dataset Shen et al. (2015) that contains 50 training videos with 95192 frames. The test set consists of three (A, B and C) with 62135, 32805 and 26338 frames, respectively, and subset C is the most challenging one. Following Khan et al. (2017), we report the results on subset C.
We selected as our training set 20 videos from the 300VW training dataset (1st, 2nd, 7th, 13th, 19th, 20th, 22th, 25th, 28th, 33th, 37th, 41th, 44th, 47th, 57th, 119th, 138th, 160th, 205th and 225th) with different brightness, scenes, facial motion amplitude, face scale, occlusion, and gender. We set a stride of 10 frames which means that only the first frame in every 10-frame sequences was annotated, and the remaining 9 frames were used as unlabeled data.
For image datasets (300W and AFLW) the Normalized Mean Error (NME) normalized by inter-pupil distance and face size was used respectively, while for video dataset (300VW), the mean Area Under the Area Under the Curve (AUC)@0.08 error Khan et al. (2017) was employed.
The input image was resized to 256x256. We used the Adam optimizer for training with 140 epochs, with an initial learning rate of, decayed by and in the 90th and 120th epochs. The setting of the power value in Eq. (1),(6) and (7) followed Tarvainen and Valpola (2017), and the climbing period was from 1 to 60 epochs, the retention period was from 61 to 110 epochs, and the decay period was from 111 to 140 epochs. The batch size was set to 8 for both the teacher and the student model, and we used random flip, random translation, random angle rotate, and color jitter for data augmentation. All our experiments were conducted on a workstation with 2.4GHz Intel Core CPUs and 4 NVIDIA GTX 1080Ti GPUs.
As shown in Table 1, compared with SDM Xiong and De la Torre (2013), TCDCN Zhang et al. (2014), CPM Wei et al. (2016),LAB Wu et al. (2018), Wingloss Feng et al. (2017) and PFLD1x Guo et al. (2019), our model presented clear improvement in the 300W dataset. Similarly, in the AFLW dataset, our model also outperformed SDM Xiong and De la Torre (2013), LAB Wu et al. (2018), LBF Ren et al. (2014), CCL Zhu et al. (2016), Two stage Lv et al. (2017), SAN Dong et al. (2018a) and DSRN Miao et al. (2018).
As shown in Table2, compared with DGCM Khan et al. (2017), SBR Dong et al. (2018b) and TS Dong and Yang (2019), our model achieved the best accuracy for the 300VW dataset. It is worth noting that, although TS Dong and Yang (2019) also employs the teacher-student architecture, with the supervision from multiple sources, our model uses a simpler structure to achieve even higher detection accuracy.
To understand the effectiveness of each key components in our model, including the face invariance constraint, the facial motion field deviation suppression, and the dual-model asynchronous learning strategy, we conducted several ablation experiments as reported in Table 3.
Compared with the baseline model accuracy in Table 3, when the SIC module was on in either the single model or the student-teacher dual-model, a clear performance gain can be observed, which shows that by enforcing the facial structure to be consistent in consecutive frames, SIC made the CLOR-based detection results more stable and accurate.
It can also be seen from Table3 that, by adding MFDS to the MFE module, the NME score on 300W and the mean AUC@0.08 error rate on 300VW were improved. Therefore, the use of LHR-based detection results to supervise the training of facial MFE can decrease the deviation of motion estimation. The quantitative evaluation of pseudo-label quality in Fig.5 also verifies the effectiveness of field deviation suppression.
In Table 3, the performance of the teacher model was always better than the student model. This indicates that there was indeed a disturbance in updating the student model after merging the CLOR-based signal and tracking-based signal, and the applied recursive average filtering did improved the quality of teacher model than only using tracking-based signal.
To summarize, it is revealed that: 1) Facial structure invariant constraint is benefit to burst landmark detection results from CLOR; 2) Motion field deviation suppression successfully maintains the consistency between tracking-based detection results and detection-based detection results, alleviating the problem of field drift; And 3) Dual-model asynchronous learning and recursive filtering method helped mining high-quality pseudo-labels.
To further verify the effectiveness of facial invariant constraint and motion field deviation suppression, we visualize facial landmarks detection results of several faces on 300W test datasets in Fig.5. Compared with baseline, we can see that facial invariant constraint and motion field deviation suppression make the contour of facial landmarks more compact, which can effectively maintain the facial landmark structure invariant and improve the detection accuracy of landmarks. Combining facial structure invariant constraint and motion field deviation suppression benefits to reduce drift of facial motion estimation.
During training, we have a hyper parameter in Eq.5 that needs to be pre-defined. is used to control the fusion of tracking-based and CLOR-based detection results to generate multi-source supervision signals of student. Fig.6 shows the NME of 300W test datasets with different under the teacher-student asynchronous learning framework, using the facial SIC to correct CLOR-based detection results, and using MFDS to correct tracking-based detection results, from which we can find that: (1) The worst result comes from =0. At this time, the student’s supervision signal comes entirely from CLOR-based detection results, which means that there is a large disturbance in the CLOR-based detection results. (2) The best result comes from =0.6. At this time, the best balance is achieved between CLOR-based and tracking-based detection results. (3) When A is too large, NME starts to increase instead, which indicates that the unity of teacher-student supervision signals in the asynchronous learning framework will inhibit the promotion of teacher.
In this paper, we propose a teacher-student asynchronous learning framework for facial landmark detection, which shows effectiveness when mining pseudo-labels of unlabeled face video frames. Additionally, we propose a facial structure invariant constraint to fine tune Contour Landmark Offset Regression-based coordinates and a motion field deviation suppression method to maintain the consistency between detection-based and tracking-based landmark coordinates, which improves performance of our model significant.
Wing loss for robust facial landmark localisation with convolutional neural networks. CoRR abs/1711.06753. External Links: Cited by: §4.3.
Semi-supervised co-training and active learning based approach for multi-view intrusion detection. In Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 2042–2048. Cited by: §1.
Stacked hourglass networks for human pose estimation. In ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 483–499. Cited by: §2.
Hybrid deep learning for face verification. In ICCV, Cited by: §1.