Teacher-Student Asynchronous Learning with Multi-Source Consistency for Facial Landmark Detection

by   Rongye Meng, et al.
Xi'an Jiaotong University

Due to the high annotation cost of large-scale facial landmark detection tasks in videos, a semi-supervised paradigm that uses self-training for mining high-quality pseudo-labels to participate in training has been proposed by researchers. However, self-training based methods often train with a gradually increasing number of samples, whose performances vary a lot depending on the number of pseudo-labeled samples added. In this paper, we propose a teacher-student asynchronous learning (TSAL) framework based on the multi-source supervision signal consistency criterion, which implicitly mines pseudo-labels through consistency constraints. Specifically, the TSAL framework contains two models with exactly the same structure. The radical student uses multi-source supervision signals from the same task to update parameters, while the calm teacher uses a single-source supervision signal to update parameters. In order to reasonably absorb student's suggestions, teacher's parameters are updated again through recursive average filtering. The experimental results prove that asynchronous-learning framework can effectively filter noise in multi-source supervision signals, thereby mining the pseudo-labels which are more significant for network parameter updating. And extensive experiments on 300W, AFLW, and 300VW benchmarks show that the TSAL framework achieves state-of-the-art performance.


page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8


Teacher Supervises Students How to Learn From Partially Labeled Images for Facial Landmark Detection

Facial landmark detection aims to localize the anatomically defined poin...

Meta Self-Refinement for Robust Learning with Weak Supervision

Training deep neural networks (DNNs) with weak supervision has been a ho...

Self-Training with Differentiable Teacher

Self-training achieves enormous success in various semi-supervised and w...

Pseudo Label Is Better Than Human Label

State-of-the-art automatic speech recognition (ASR) systems are trained ...

Unsupervised Self-training Algorithm Based on Deep Learning for Optical Aerial Images Change Detection

Optical aerial images change detection is an important task in earth obs...

Learning to Rank from Samples of Variable Quality

Training deep neural networks requires many training samples, but in pra...

Semi-supervision semantic segmentation with uncertainty-guided self cross supervision

As a powerful way of realizing semi-supervised segmentation, the cross s...

1 Introduction

Facial landmark detection is widely used as a preliminary task in face related computer vision application 

Sun et al. (2013); Liu et al. (2017); Fabian Benitez-Quiroz et al. (2016); Li et al. (2017); Blanz and Vetter (2003); Dou et al. (2017); Kittler et al. (2016); Hu et al. (2017); Thies et al. (2016). Although there are several public datasets with labeled facial landmark available Sagonas et al. (2013); Köstinger et al. (2011); Tzimiropoulos (2015); Shen et al. (2015), the process to mark the precise location of all the landmarks in large scale image or video collections is very time consuming, which renders fully-supervised training of

Deep Neural Networks

(DNN) based facial landmark detector tedious and costly Dong and Yang (2019).

Figure 2: Overview of the student and teacher’s pipeline. and are parameters of student and teacher respectively. Facial signal 1 is Contour Landmark Offset Regression-based results, and Facial signal 2 is Landmark Heatmap Regression-based results. Thus , are landmark signals from detection source. And , are landmark signals from tracking source.

The situation has motivated researchers to leverage the semi-supervised learning paradigm by making use of both labeled and unlabeled data. Many of these methods operate by propagating labeling/supervision information to unlabeled data via pseudo-labels. For example,

Jeong et al. presented a self-training method that takes many iterations to predict labels for the unlabeled data by gradually increasing the number of utilized samples and to retrain the model.  Jeong et al.. Mao et al.; Blum and Mitchell presented a multi-model collaboration framework to use multiple complementary models to obtain high quality pseudo-labels better and to avoid optimizer from falling into local optimum Mao et al. (2009); Blum and Mitchell (1998). In general, due to the noise in the unlabeled data, the performances of these approaches vary a lot depending on the way the pseudo-labels are generated and utilized.

Amongst the reported methods, self-supervision signal generated by the teacher-student model Tarvainen and Valpola (2017); Berthelot et al. (2019)

has become an effective strategy for semi-supervised learning, where the output of a student model is enforced to be consistent with a teacher model for unlabeled input. By asynchronously updating the two models, the consistency constraint helps to mine high quality pseudo-labels that can bring significant improvement in semi-supervised learning tasks.

The strategy inspired us to propose a facial landmark detection model based on asynchronous-learning with consistency constraint from multi-source supervision signals. The method consists of the teacher-student model where a radical student is updated with raw multi-source and a calm teacher is updated with more stable gradient. Specifically, the sources of input include firstly a set of facial landmark targets predicted through a facial

Motion Field Estimation

(MFE) module. Secondly, we use detection method to obtain the second set of facial landmark targets through the Landmark Heatmap Regression (LHR). And thirdly, another set of facial landmark targets is obtained through face center detection and Contour Landmark Offset Regression (CLOR). With the three sets of signals from different sources, the radical student uses all three types of signals to update parameters, while the calm teacher only uses facial MFE and LHR to update parameters. To allow the teacher model to accept part of the student’s suggestions, an exponential moving average strategy is additionally used to update the teacher’s parameters again, before the teacher model instructs the student model. In this way, the overall framework can utilize the consistency constraint between sources and models to achieve satisfactory facial landmark detection performance. Fig.2 shows an overview of our asynchronous-learning framework. In summary, major contributions of the paper include:

  • Three improved sets of supervision source to train the teacher-student network model, including the facial MFE, the LHR and the CLOR;

  • A teacher-student network model with asynchronous-learning to effectively smooth the learning and to obtain improved performance as shown in our experiments.

2 Related Work

Fully-supervised Facial Landmark Detection. Fully supervised facial landmark detection can be categorized into two types: coordinate regression Xiong and De la Torre (2013); Cao et al. (2014) and landmark heatmap regression Wei et al. (2016); Dong et al. (2018); Newell et al. (2016) according to the type of supervision signal.  Zhou et al. leverages the idea of ”object as point” to regress anchor of object. Based on this idea, we design CLOR task and obtained another type of facial landmark supervision signal. These different source of supervision signals are our model’s operation objects.

Semi-supervised Facial Landmark Detection

Semi-supervised facial landmark detection in video aims to use less annotation data to improve the performance of the entire video. Most of the existing arts focuses on how to mine pseudo labels to expand the training set.

Dong et al. uses a differentiable optical flow estimation method to obtain pseudo labels of subsequent frames, and contributes a way to estimate facial motion field, and completes the whole tasks through a two-stage training process, which performs pseudo label mining implicitly. When it fine-tunes the detector with tracking-based results in the second stage, it tries to solve the problem proposed by  Khan et al. that detection results have no drift but low accuracy.

Fully supervised facial landmark detection mentioned above are for single-frame images, which are also applicable to facial landmark detection in video. Based on these fully supervised method, some self-training or co-training approaches simply leverage confidence score or an unsupervised loss to mine qualified samples. Due to the complementarity between multiple models, researchers proposed to leverage multiple models to promote each other’s performance.  Hinton et al. (2015); Lee et al. (2018) as classic teacher-student models aim to let student model fit teacher’s output.  Dong and Yang (2019) contains two different networks and leverages pseudo-labels with high quality envaluated by teacher to train student. Different from these multi models, our TSAL framework consists of two networks with exactly the same structure. And we use supervised signals with different levels of disturbance from different sources to train student and teacher, and implicitly mine high quality pseudo-labels through a mechanism of asynchronously updating network parameters.

3 Methodology

3.1 Motivation

Different from method of regressing face coordinates directly, we design a Contour Landmark Offset Regression (CLOR) task to detect facial landmarks . Inspired by  Zhou et al. (2019), we regard face as a point, and use a Gaussian template

to represent face as an isotropic 2D Gaussian distribution. And we leverage another parallel branch to regress the offsets

from remaining landmarks to the center directly. However, this simple yet designed task CLOR

has a large detection variance on consecutive frames. Fortunately,

can maintain a stable facial structure.

Landmark Heatmap Regression (LHR) based methods Dong et al. (2018b); Wu et al. (2018) often regress a -channel landmark heatmap firstly, and then parses landmark coordinates in post process, where is the number of landmarks. However, when face encounters occlusion, the response of landmarks is weakened. When we parse heatmap into coordinates in post process, the facial structure will be deformed. Therefore, we attempt to apply constraint to correct from CLOR, and regard it as a kind of signal for unlabeled images to supervised from LHR.

LHR-based method and CLOR

-based method are both detection-based methods in facial landmark detection. In continuous video frames, if mutual information of inter-frames is fully utilized, the facial landmark detection will be more accurate. Thus we leverage an unsupervised learning method of motion field estimation proposed in  

Zhu et al. (2019) to estimate the facial motion field. However, just as  Khan et al. (2017) mentioned, tracking-based method has drift although with a high accuracy. Therefore, we try to eliminate this drift by maintain the consistency between detection-based landmark detection results and tracking-based landmark detection results.

The core of improving performance of detector is to make full use of unlabeled images by mining higher-quality pseudo labels to participate in training. In the whole system, we obtain two supervision signals from different sources, one comes from detection source and the other comes from tracking source. However, noise of detection source signal is obvious. In order to smooth disturbance in supervision signal, we let the two supervision signals run on two asynchronously updated models, and use recursive average filtering to filter out the noise in supervision signal.

3.2 The Same Pipeline of Teacher and Student

Input of the system are frames of the video, denoted as , and only has labele . Teacher and student of TSAL framework have exactly the same structure, but have different back propagation. For ease of description, we indiscriminately express the same pipeline of teacher and student in this section.

We have a encoder-decoder network to perform facial LHR and facial CLOR. One of output of decoder is facial landmark heatmap , the other output is face center heatmap , and another output is offset from remaining landmarks to the center. Thus we obtain two groups of facial landmark coordinates and from CLOR and LHR respectively. And motion field estimation module outputs inter-frame motion field estimation , so the tracking-based landmark detection results are represented as , where is the guide (landmark coordinates) of the first frame.

Therefore, the total loss function of supervised detection and unsupervised tracking is defined as follows:


where is facial structure invariant constraint we proposed to alleviate large variance of CLOR-based results, and represents the loss of motion field deviation suppression. is multi-channel landmark heatmap regression error by  Dong et al., is landmark coordinate regression error by  Zhou et al., is facial motion estimation loss proposed in  Zhu et al. (2019). is a piecewise exponential climbing function helps the system gradually adapt to unsupervised signals.

Figure 3: Facial structure invariant constraint ensures facial landmarks are always attracted by face center on the adjacent frames. And face size are used to normalize facial structure invariant constraint.

Facial Structure Invariant Constraint

Through LHR and CLOR, we have obtained two groups of facial landmark coordinates and respectively. When the facial motion range is too large or there is an object occlusion, has a deviation when parsing coordinates from heatmap, which deforms the facial landmark structure. While is heatmap-parsing free, so the facial landmark structure of is stable, but the variance of the in adjacent frame is large. We hope to merge and reasonably to alleviate the structure deformation of facial landmarks of in complex situations.

Therefore, we assume that on continuous frames, remaining N-1 facial landmarks are always attracted by the continuously changing face center (as Fig.3 shown), that is, normalized offset modulus sum of landmarks to the face center in the current frame is a constant, which makes to maintain the facial landmark structure while having a small variance between adjacent frames. The facial structure invariant constraint about facial landmark is defined as follows:


where is width of face bounding box in i-th frame. Through coordinates , we have = - , = -

. In final test session, the results are obtained through interpolation of

and , which is , and we set in our final test.

Motion Field Deviation Suppression

When using obtained by LHR as a facial motion filed guide to generate facial landmark of subsequent frames , the inaccuracy of the motion field estimated by facial motion filed estimation module will make inconsistent with detection-based results , which is reason why facial landmarks of subsequent frames predicted by tracking-based method have drift.

In this case, we propose Motion field deviation suppression to maintain the consistency between facial landmark detection results obtained by tracking-based method and detection-based method. Motion field deviation suppression updates parameters of facial motion filed estimation module, and the motion field estimation deviation constraint loss is calculated by:


Specifically, we represent the same pipeline loss of student and teacher as and respectively.

3.3 Different Pipelines of Teacher and Student

Through the same pipeline mentioned above, student and teacher obtained two groups of fine-tuned supervision signals from different source respectively. As two types pseudo-label for unlabeled images, they have different levels of disturbance. In order to seek a balance between two supervision signals, we use two model asynchronous learning and recursive average filtering methods to filter out the noise in the supervision signal, and effectively ensemble two supervision signals to update network parameters, which is the source of motivation for mining much more significant pseudo-labels.

Outputs of motion field estimation module in teacher and student are facial motion field and respectively, then we obtain tracking-based landmark coordinates of teacher and student, which are represented as and . The other supervision signals of teacher and student are CLOR-based and . And teacher merely uses tracking-based stable signal to update parameters, that’s to say, final supervision signal for LHR-based output is . Then the loss function of teacher using pseudo-heatmap supervision can be defined as:


While supervision signal of radical student is also affected by . We use linear interpolation to merge and , and final supervision signal for LHR-based output is . Then the loss function of student using pseudo-heatmap supervision can be defined as


Therefore, from the same pipeline and different pipelines mentioned above, we express total loss of calm teacher and radical student as follows:


where is a function of training iteration mentioned in the same pipeline above, used to control the proportion of unsupervised signals.

Input: Teacher at step and Student queue .
Output: Teacher at step .
1 for  do
2       - - = 0 for  do
3             +
4       /
5return ;
Algorithm 1 Asynchronous-learning model

3.4 Parameter Updating

Student and teacher asynchronously update their network parameters and with different supervised signals from different sources in same task through and respectively. Since the pseudo-label supervision signal used by student with more disturbance, in order to allow teacher to accept student’s suggestions reasonably, we use a recursive average filtering method to smooth the noise in student.

Specifically, we maintain a queue of length to store recent iteration student’s parameters, denoted as ,, …, . And we define at training step as the teacher of successive weights:



is a smoothing coefficient hyperparameter, and

means calculating students’ parameter average value in queue .  Tarvainen and Valpola (2017) is a special case when our queue length is 1. And when we set , following  Tarvainen and Valpola (2017), we let =0.999. Such a recursive average filtering method helps teachers absorb suggestions from students to promote teacher stably.

4 Experiments

4.1 Dataset


The first image dataset used was the 300W dataset Sagonas et al. (2013) which is a combination of five other datasets, including the LFPW, the AFW, the HELEN, the XM2VTS, and the IBUG dataset. Following prior works, our training set included the training of LFPW, HELEN as well as the full set of AFW, in which there was 3148 images in total. The common test subset consisted of 554 test images from LFPW and HELEN, and the challenging test subset consisted of 135 images from IBUG. The full test set was the union of the common and the challenging subsets, with 689 images in total.

The second image dataset used was the AFLW dataset Köstinger et al. (2011) that consists of 25993 faces from 21997 real-world images Lv et al. (2017); Dong et al. (2018b). Following Zhu et al., the dataset was partitioned into two different subsets, AFLW-Full and AFLW-Front respectively. The two subsets have the same training set, but with different testing samples: AFLW-Full contains 4386 test samples, while AFLW-Front only uses 1165 samples from AFLW-Full as testing set.


The video dataset used was the 300VW dataset Shen et al. (2015) that contains 50 training videos with 95192 frames. The test set consists of three (A, B and C) with 62135, 32805 and 26338 frames, respectively, and subset C is the most challenging one. Following Khan et al. (2017), we report the results on subset C.

4.2 Setup


We selected as our training set 20 videos from the 300VW training dataset (1st, 2nd, 7th, 13th, 19th, 20th, 22th, 25th, 28th, 33th, 37th, 41th, 44th, 47th, 57th, 119th, 138th, 160th, 205th and 225th) with different brightness, scenes, facial motion amplitude, face scale, occlusion, and gender. We set a stride of 10 frames which means that only the first frame in every 10-frame sequences was annotated, and the remaining 9 frames were used as unlabeled data.


For image datasets (300W and AFLW) the Normalized Mean Error (NME) normalized by inter-pupil distance and face size was used respectively, while for video dataset (300VW), the mean Area Under the Area Under the Curve (AUC)@0.08 error Khan et al. (2017) was employed.


The input image was resized to 256x256. We used the Adam optimizer for training with 140 epochs, with an initial learning rate of

, decayed by and in the 90th and 120th epochs. The setting of the power value in Eq. (1),(6) and (7) followed Tarvainen and Valpola (2017), and the climbing period was from 1 to 60 epochs, the retention period was from 61 to 110 epochs, and the decay period was from 111 to 140 epochs. The batch size was set to 8 for both the teacher and the student model, and we used random flip, random translation, random angle rotate, and color jitter for data augmentation. All our experiments were conducted on a workstation with 2.4GHz Intel Core CPUs and 4 NVIDIA GTX 1080Ti GPUs.

4.3 Comparison with State-of-the-Art (Sota)

Results on 300W and AFLW

As shown in Table 1, compared with SDM Xiong and De la Torre (2013), TCDCN Zhang et al. (2014), CPM Wei et al. (2016),LAB Wu et al. (2018), Wingloss Feng et al. (2017) and PFLD1x Guo et al. (2019), our model presented clear improvement in the 300W dataset. Similarly, in the AFLW dataset, our model also outperformed SDM Xiong and De la Torre (2013), LAB Wu et al. (2018), LBF Ren et al. (2014), CCL Zhu et al. (2016), Two stage Lv et al. (2017), SAN Dong et al. (2018a) and DSRN Miao et al. (2018).

Results on 300VW

As shown in Table2, compared with DGCM Khan et al. (2017), SBR Dong et al. (2018b) and TS Dong and Yang (2019), our model achieved the best accuracy for the 300VW dataset. It is worth noting that, although TS Dong and Yang (2019) also employs the teacher-student architecture, with the supervision from multiple sources, our model uses a simpler structure to achieve even higher detection accuracy.

Method 300W AFLW
Common Challenge Full Front Full
SDM 5.57 15.40 7.52 2.94 4.05
LBF 4.95 11.98 6.32 2.74 4.25
TCDCN 4.80 8.60 5.54 - -
LAB 3.42 6.98 4.12 1.62 1.85
CPM 3.39 8.14 4.36 - -
Wing loss 3.27 7.18 4.04 - 1.65
PFLD 1x 3.32 6.56 3.95 - 1.88
CCL - - - 2.17 2.72
Two stage - - - - 2.17
SAN - - - 1.85 1.91
DSRN - - - - 1.86
Ours 3.13 6.02 3.69 1.47 1.65
Table 1: Comparison of NME with the state-of-the-art methods on 300W and AFLW datasets.
Method DGCM SBR* TS Ours
AUC@0.08 error 59.38 59.39 59.65 59.92
Table 2: Comparisons of mean AUC@0.08 error with the state-of-the-art methods on 300VW dataset.

4.4 Ablation Study

To understand the effectiveness of each key components in our model, including the face invariance constraint, the facial motion field deviation suppression, and the dual-model asynchronous learning strategy, we conducted several ablation experiments as reported in Table 3.

Model SIC MFDS Asynchronous learning Metric
300W AFLW 300VW
Common Challenge Full Front Full
baseline 3.73 6.92 4.35 1.98 1.66 55.14
single model 3.50 6.63 4.11 1.81 1.57 56.76
3.58 6.72 4.19 1.84 1.57 55.79
3.39 6.44 3.98 1.76 1.53 57.58
Student 3.62 6.67 4.21 1.95 1.62 56.21
Teacher 3.51 6.52 4.09 1.84 1.57 57.44
Student 3.36 6.28 3.93 1.78 1.50 57.97
Teacher 3.23 6.19 3.80 1.66 1.46 57.12
Student 3.51 6.45 4.08 1.83 1.51 56.94
Teacher 3.34 6.32 3.92 1.71 1.49 58.11
Student 3.24 6.12 3.79 1.73 1.50 58.89
Teacher 3.13 6.02 3.69 1.65 1.47 59.92
Table 3: NME scores with respect to on/off between the facial structure invariant constraint (SIC), the motion field deviation suppression (MFDS) and the teacher-student asynchronous-learning strategy on 300W, AFLW and 300VW datasets.

The Facial Structure Invariant Constraint:

Compared with the baseline model accuracy in Table 3, when the SIC module was on in either the single model or the student-teacher dual-model, a clear performance gain can be observed, which shows that by enforcing the facial structure to be consistent in consecutive frames, SIC made the CLOR-based detection results more stable and accurate.

The Facial Motion Field Deviation Suppression:

It can also be seen from Table3 that, by adding MFDS to the MFE module, the NME score on 300W and the mean AUC@0.08 error rate on 300VW were improved. Therefore, the use of LHR-based detection results to supervise the training of facial MFE can decrease the deviation of motion estimation. The quantitative evaluation of pseudo-label quality in Fig.5 also verifies the effectiveness of field deviation suppression.

Asynchronous-learning Strategy

In Table 3, the performance of the teacher model was always better than the student model. This indicates that there was indeed a disturbance in updating the student model after merging the CLOR-based signal and tracking-based signal, and the applied recursive average filtering did improved the quality of teacher model than only using tracking-based signal.

To summarize, it is revealed that: 1) Facial structure invariant constraint is benefit to burst landmark detection results from CLOR; 2) Motion field deviation suppression successfully maintains the consistency between tracking-based detection results and detection-based detection results, alleviating the problem of field drift; And 3) Dual-model asynchronous learning and recursive filtering method helped mining high-quality pseudo-labels.

4.5 Qualitative Analysis

To further verify the effectiveness of facial invariant constraint and motion field deviation suppression, we visualize facial landmarks detection results of several faces on 300W test datasets in Fig.5. Compared with baseline, we can see that facial invariant constraint and motion field deviation suppression make the contour of facial landmarks more compact, which can effectively maintain the facial landmark structure invariant and improve the detection accuracy of landmarks. Combining facial structure invariant constraint and motion field deviation suppression benefits to reduce drift of facial motion estimation.

Figure 4: Qualitative results on several faces in 300W challenge dataset by teacher of our model.
Figure 5: Mean AUC@0.08 error of pseudo-labels estimated by motion field estimation (MFE) module on the unlabeled training samples in the 20 videos used in training.
Figure 6: Analysis of hyper-parameter on 300W test datasets.

4.6 Discussion

During training, we have a hyper parameter in Eq.5 that needs to be pre-defined. is used to control the fusion of tracking-based and CLOR-based detection results to generate multi-source supervision signals of student. Fig.6 shows the NME of 300W test datasets with different under the teacher-student asynchronous learning framework, using the facial SIC to correct CLOR-based detection results, and using MFDS to correct tracking-based detection results, from which we can find that: (1) The worst result comes from =0. At this time, the student’s supervision signal comes entirely from CLOR-based detection results, which means that there is a large disturbance in the CLOR-based detection results.  (2) The best result comes from =0.6. At this time, the best balance is achieved between CLOR-based and tracking-based detection results.  (3) When A is too large, NME starts to increase instead, which indicates that the unity of teacher-student supervision signals in the asynchronous learning framework will inhibit the promotion of teacher.

5 Conclusion

In this paper, we propose a teacher-student asynchronous learning framework for facial landmark detection, which shows effectiveness when mining pseudo-labels of unlabeled face video frames. Additionally, we propose a facial structure invariant constraint to fine tune Contour Landmark Offset Regression-based coordinates and a motion field deviation suppression method to maintain the consistency between detection-based and tracking-based landmark coordinates, which improves performance of our model significant.


  • D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. A. Raffel (2019) MixMatch: a holistic approach to semi-supervised learning. In NIPS 32, pp. 5049–5059. Cited by: §1.
  • V. Blanz and T. Vetter (2003) Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25 (9), pp. 1063–1074. Cited by: §1.
  • Blum and Mitchell (1998) Combining labeled and unlabeled data with co-training. New York, NY. Cited by: §1.
  • X. Cao, Y. Wei, F. Wen, and J. Sun (2014) Face alignment by explicit shape regression. International Journal of Computer Vision 107 (2), pp. 177–190. Cited by: §2.
  • X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018) Style aggregated network for facial landmark detection. In Proceedings of CVPR, Cited by: §2.
  • X. Dong, Y. Yan, W. Ouyang, and Y. Yang (2018a) Style aggregated network for facial landmark detection. External Links: 1803.04108 Cited by: §4.3.
  • X. Dong and Y. Yang (2019) Teacher supervises students how to learn from partially labeled images for facial landmark detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2, §4.3.
  • X. Dong, S. Yu, X. Weng, S. Wei, Y. Yang, and Y. Sheikh (2018b) Supervision-by-Registration: an unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of CVPR, pp. 360–368. Cited by: §2, §3.1, §3.2, §4.1, §4.3.
  • P. Dou, S. K. Shah, and I. A. Kakadiaris (2017) End-to-end 3d face reconstruction with deep neural networks. In Proceedings of CVPR, Cited by: §1.
  • C. Fabian Benitez-Quiroz, R. Srinivasan, and A. M. Martinez (2016) EmotioNet: an accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In Proceedings of CVPR, Cited by: §1.
  • Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu (2017)

    Wing loss for robust facial landmark localisation with convolutional neural networks

    CoRR abs/1711.06753. External Links: Link, 1711.06753 Cited by: §4.3.
  • X. Guo, S. Li, J. Yu, J. Zhang, J. Ma, L. Ma, W. Liu, and H. Ling (2019) PFLD: a practical facial landmark detector. External Links: 1902.10859 Cited by: §4.3.
  • G. Hinton, O. Vinyals, and J. Dean (2015) Distilling the knowledge in a neural network. External Links: 1503.02531 Cited by: §2.
  • G. Hu, F. Yan, J. Kittler, W. Christmas, C. H. Chan, Z. Feng, and P. Huber (2017) Efficient 3d morphable face model fitting. Pattern Recognition 67, pp. 366 – 379. Cited by: §1.
  • J. Jeong, S. Lee, J. Kim, and N. Kwak (2019) Consistency-based semi-supervised learning for object detection. In NIPS 32, pp. 10759–10768. Cited by: §1.
  • M. H. Khan, J. McDonagh, and G. Tzimiropoulos (2017) Synergy between face alignment and tracking via discriminative global consensus optimization. In 2017 ICCV, Vol. , pp. 3811–3819. Cited by: §2, §3.1, §4.1, §4.2, §4.3.
  • J. Kittler, P. Huber, Z. Feng, G. Hu, and W. Christmas (2016) 3D morphable face models and their applications. In Articulated Motion and Deformable Objects, F. J. Perales and J. Kittler (Eds.), pp. 185–206. Cited by: §1.
  • M. Köstinger, P. Wohlhart, P. M. Roth, and H. Bischof (2011) Annotated facial landmarks in the wild: a large-scale, real-world database for facial landmark localization. In 2011 ICCV Workshops, Vol. , pp. 2144–2151. Cited by: §1, §4.1.
  • H. J. Lee, W. J. Baddar, H. G. Kim, S. T. Kim, and Y. M. Ro (2018) Teacher and student joint learning for compact facial landmark detection network. In MultiMedia Modeling, K. Schoeffmann, T. H. Chalidabhongse, C. W. Ngo, S. Aramvith, N. E. O’Connor, Y. Ho, M. Gabbouj, and A. Elgammal (Eds.), pp. 493–504. Cited by: §2.
  • S. Li, W. Deng, and J. Du (2017) Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of CVPR, Cited by: §1.
  • W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song (2017) SphereFace: deep hypersphere embedding for face recognition. In Proceedings of CVPR, Cited by: §1.
  • J. Lv, X. Shao, J. Xing, C. Cheng, and X. Zhou (2017) A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3691–3700. Cited by: §4.1.
  • J. Lv, X. Shao, J. Xing, C. Cheng, and X. Zhou (2017) A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. In Proceedings of CVPR, Cited by: §4.3.
  • C. Mao, H. Lee, D. Parikh, T. Chen, and S. Huang (2009)

    Semi-supervised co-training and active learning based approach for multi-view intrusion detection

    In Proceedings of the 2009 ACM Symposium on Applied Computing, pp. 2042–2048. Cited by: §1.
  • X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang (2018) Direct shape regression networks for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.3.
  • A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    In ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), pp. 483–499. Cited by: §2.
  • S. Ren, X. Cao, Y. Wei, and J. Sun (2014) Face alignment at 3000 fps via regressing local binary features. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Vol. , pp. 1685–1692. Cited by: §4.3.
  • C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic (2013) 300 faces in-the-wild challenge: the first facial landmark localization challenge. In ICCV, Cited by: §1, §4.1.
  • J. Shen, S. Zafeiriou, G. G. Chrysos, J. Kossaifi, G. Tzimiropoulos, and M. Pantic (2015) The first facial landmark tracking in-the-wild challenge: benchmark and results. In ICCV Workshops, Cited by: §1, §4.1.
  • Y. Sun, X. Wang, and X. Tang (2013)

    Hybrid deep learning for face verification

    In ICCV, Cited by: §1.
  • A. Tarvainen and H. Valpola (2017) Mean teachers are better role models: weight-averaged consistency targets improve semi-supervised deep learning results. In NIPS 30, pp. 1195–1204. Cited by: §1, §3.4, §4.2.
  • J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Niessner (2016) Face2Face: real-time face capture and reenactment of rgb videos. In Proceedings of CVPR, Cited by: §1.
  • G. Tzimiropoulos (2015) Project-out cascaded regression with an application to face alignment. In Proceedings of CVPR, Cited by: §1.
  • S. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh (2016) Convolutional pose machines. In Proceedings of CVPR, Cited by: §2, §4.3.
  • W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou (2018) Look at boundary: a boundary-aware face alignment algorithm. In CVPR, Cited by: §3.1, §4.3.
  • X. Xiong and F. De la Torre (2013) Supervised descent method and its applications to face alignment. In Proceedings of CVPR, Cited by: §2, §4.3.
  • Z. Zhang, P. Luo, C. C. Loy, and X. Tang (2014) Facial landmark detection by deep multi-task learning. In ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), pp. 94–108. Cited by: §4.3.
  • X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §2, §3.1, §3.2.
  • S. Zhu, C. Li, C. Loy, and X. Tang (2016) Unconstrained face alignment via cascaded compositional learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1, §4.3.
  • Y. Zhu, Z. Lan, S. Newsam, and A. Hauptmann (2019) Hidden two-stream convolutional networks for action recognition. In Computer Vision – ACCV 2018, C. V. Jawahar, H. Li, G. Mori, and K. Schindler (Eds.), pp. 363–378. Cited by: §3.1, §3.2.