Several landmark detection algorithms, implemented in PyTorch.
Facial landmark detection aims to localize the anatomically defined points of human faces. In this paper, we study facial landmark detection from partially labeled facial images. A typical approach is to (1) train a detector on the labeled images; (2) generate new training samples using this detector's prediction as pseudo labels of unlabeled images; (3) retrain the detector on the labeled samples and partial pseudo labeled samples. In this way, the detector can learn from both labeled and unlabeled data to become robust. In this paper, we propose an interaction mechanism between a teacher and two students to generate more reliable pseudo labels for unlabeled data, which are beneficial to semi-supervised facial landmark detection. Specifically, the two students are instantiated as dual detectors. The teacher learns to judge the quality of the pseudo labels generated by the students and filter out unqualified samples before the retraining stage. In this way, the student detectors get feedback from their teacher and are retrained by premium data generated by itself. Since the two students are trained by different samples, a combination of their predictions will be more robust as the final prediction compared to either prediction. Extensive experiments on 300-W and AFLW benchmarks show that the interactions between teacher and students contribute to better utilization of the unlabeled data and achieves state-of-the-art performance.READ FULL TEXT VIEW PDF
We present a simple self-training method that achieves 87.4
Knowledge Distillation (KD) has been used in image classification for mo...
Weakly-supervised object detection attempts to limit the amount of
Semi-supervised learning, i.e. jointly learning from labeled an unlabele...
Multi-task Learning (MTL) for classification with disjoint datasets aims...
In this paper, we focus on semi-supervised object detection to boost
This paper presents a semi-supervised learning framework to train a keyp...
Several landmark detection algorithms, implemented in PyTorch.
Facial landmark detection aims to find some pre-defined anatomical keypoints of human faces [44, 27, 43, 37]. These keypoints include the corners of a mouth, the boundary of eyes, the tip of a nose, etc [36, 35, 21]
. It is usually a prerequisite of a large number of computer vision tasks[26, 39, 3]. For example, facial landmark coordinates are required to align faces to ease the visualization for users when people would like to sort their faces by time and see the changes over time . Other examples include face morphing , face replacement , etc.
The main challenge in recent landmark detection literatures is how to obtain abundant facial landmark labels. The annotation challenge comes from two perspectives. First, a large number of keypoints are required for a single face image, e.g., 68 keypoints for each face in the 300-W dataset . To precisely depict the facial features for a whole dataset, millions of keypoints are usually required. Second, different annotators have a semantic gap. There is no universal standard for the annotation of the keypoints, so different annotators give different positions for the same keypoints. A typical way to reduce such semantic deviations among various annotators is to merge the labels from several annotators. This will further increase the costs of the whole annotation work.
for semi-supervised learning use self-training or similar paradigms to utilize the unlabeled samples. For example, the authors of[23, 17, 28]
adopt a heuristic unsupervised criterion to select the pseudo labeled data for the retraining procedure. This criterion is the loss of each pseudo labeled data, where its predicted pseudo label is treated as the ground truth to calculate the loss[17, 28]. Since no extra supervision is given to train the criterion function, this unsupervised loss criterion has a high possibility of passing inaccurate pseudo labeled data to the retraining stage. In this way, these inaccurate data will mislead the optimization of the detector and make it easier to trap into a local minimum. A straightforward solution to this problem is to use multiple models and regularize each other by the co-training strategy . Unfortunately, even if co-training performs well in simple tasks such as classification [4, 28], in more complex scenarios such as detection, co-training requires extremely sophisticated design and careful tuning of many additional hyper-parameters , e.g., more than 10 hyper-parameters for three models in .
To better utilize the pseudo labeled data as well as avoid the complicated model tuning for landmark detection, we propose Teacher Supervises StudentS (TS). As illustrated in Figure 1, TS is an interaction mechanism between one teacher network and two (or multiple) student networks. Two student detection networks learn to generate pseudo labels for unlabeled images. The teacher network learns to judge the quality of the pseudo labels generated from students. Consequently, the teacher can select qualified pseudo labeled samples and use them to retrain the students. TS applies these steps in an iterative manner, where students gradually become more robust, and the teacher is adaptively updated with the improved students. Besides, two students can also encourage each other to advance their performances in two ways. First, predictions from two students can be ensembled to further improve the quality of pseudo labels. Second, two students can regularize each other by training on different samples. The interactions between the teacher and students as well as the students themselves help to provide more accurate pseudo labeled samples for retraining and the model does not need careful hyper-parameter tuning.
To highlight our contribution, we propose an easy-to-train interaction mechanism between teacher and students (TS) to provide more reliable pseudo labeled samples in semi-supervised facial landmark detection. To validate the performance of our TS, we do experiments on 300-W, 300-VW, and AFLW benchmarks. TS achieves state-of-the-art semi-supervised performance on all three benchmarks. In addition, using only 30% labels, our TS achieves competitive results compared to supervised methods using all labels on 300-W and AFLW.
We will first introduce some supervised facial landmark algorithms in Section 2.1. Then, we will compare our algorithm with semi-supervised learning algorithms and semi-supervised facial landmark algorithm in Section 2.2. Lastly, we explain our algorithm in a meta learning perspective in Section 2.3.
Supervised facial landmark detection algorithms can be categorized into linear regression based methods[44, 7] and heatmap regression based methods [41, 11, 9, 30]. Linear regression based methods learn a function that maps the input face image to the normalized landmark coordinates [44, 7]. Heatmap regression based methods produce one heatmap for each landmark, where the coordinate is the location of the highest response on this heatmap [41, 11, 9, 30, 5]. All above algorithms can be readily integrated into our framework, serving as different student detectors.
These supervised algorithms require a large amount of data to train deep neural networks. However, it is tedious to annotate the precise facial landmarks, which need to average different annotations from multiple different annotators. Therefore, to reduce the annotation cost, it is necessary to investigate the semi-supervised facial landmark detection.
Some early semi-supervised learning algorithms are difficult to handle large scale datasets due to the high complexity . Others exploit pseudo-labels of unlabeled data in the semi-supervised scenario [1, 2, 23, 28]. Since most of these algorithms studied their effect on small-scale datasets [8, 1, 23, 28], a question remains open: can they be used to improve large-scale semi-supervised landmark detection? In addition, those self-training or co-training approaches [23, 28, 12] simply leverage the confidence score or an unsupervised loss to select qualified samples. For example, Dong et al.  proposed a model communication mechanism to select reliable pseudo labeled samples based on loss and score. However, such selection criterion does not reflect the real quality of a pseudo labeled sample. In contrast, our teacher directly learns to model the quality, and selected samples are thus more reliable.
There are only few of researchers study the semi-supervised facial landmark detection algorithms. A recent work  presented two techniques to improve landmark localization from partially annotated face images. The first technique is to jointly train facial landmark network with an attribute network, which predicts the emotion, head pose, etc. In this multi-task framework, the gradient from the attribute network can benefit the landmark prediction. The second technique is a kind of supervision without the need of manual labels, which enables the transformation invariant of landmark prediction. Compared to using the supervision from transformation, our approach leverages a progressive paradigm to learn facial shape information from unlabeled data. In this way, our approach is orthogonal to , and these two techniques can complement our approach to further boost the performance.
Radosavovic et al.  applied the data augmentation to improve the quality of generated pseudo landmark labels. For an unlabeled image, they ensemble predictions from multiple transformations, such as flipping and rotation. This strategy can also be used to improve the accuracy of our pseudo labels and complement our approach. Since the data augmentation is not the focus of this paper, we did not apply their algorithms in our approach. Dong et al.  proposed a self-supervised loss by exploiting the temporal consistence on unlabeled videos to enhance the detector. This is a video-based approach and not the focus of our work. Therefore, we do not discuss more with those video-based approach [20, 11].
In a meta learning perspective, our TS learns a teacher network to learn which pseudo labeled samples are helpful to train student detectors. In this sense, we are related to some recent literature in “learning to learn” [25, 33, 13, 45]. For example, Ren et al.  learn to re-weight samples based on gradients of a model on the clean validation set. Xu et al.  suggest using meta-learning to tune the optimization schedule of alternative optimization problems. Jiang et al.  propose an architecture to learn data-driven curriculum on corrupted labels. Fan et al. 
leverage reinforcement learning to learn a policy to select good training samples for a single student model. These algorithms are designed in the supervised scenarios and can not easily be modified in semi-supervised scenario.
Difference with other teacher-student frameworks and generative adversarial networks (GAN). Our TS learns to utilize the output (pseudo labels) of the student model qualified by the teacher model to do semi-supervised learning. Other teacher-student methods [38, 15, 10, 24] aim to fit the output of the student model to that of the teacher model. The student and teacher in our work do similar jobs as the generator and discriminator in GAN , while we aim to predict/generate qualified pseudo labels in semi-supervised learning using a different training strategy.
In this section, we will first introduce the scenario of the semi-supervised facial landmark detection in Section 3.1. We explain how to design our student detectors and the teacher network in Section 3.2. Lastly, we demonstrate our overall algorithm in Section 3.3.
We introduce some necessary notations for the presentation of the proposed method. Let be the labeled data in the training set and be the unlabeled data in the training set, where denotes the -th image, and denotes the ground-truth landmark label of . is the number of the facial landmarks, and the -th column of indicates the coordinate of the -th landmark. and denote the number of labeled data and unlabeled data, respectively. The semi-supervised facial landmark detection aims to learn robust detectors from both and .
The Student Detectors. We choose the convolutional pose machine (CPM)  and stacked hourglass (HG)  models as our student detectors. These two landmark detection architectures are the cornerstone of many facial landmark detection algorithms [30, 9, 6, 37]. Moreover, their architectures are quite different, and can thus complement each other to achieve a better detection performance compared to using two similar neural architectures. Therefore, we integrate these two detectors in our TS approach. In this paragraph, we will give a brief overview of these two facial landmark detectors. We illustrate the structures of CPM and HG in Figure 2. Both CPM and HG are the heatmap regression based methods and utilize the cascaded structure. Formally, suppose there are convolutional stages in CPM, the output of CPM is:
where indicates the CPM student detector whose parameters are . is the RGB image of the -th data-point and indicates the heatmap prediction of the -th stage. and denote the spatial height and width of the heatmap. Similarly, we use indicates the HG student detector whose parameters are
. The detection loss function of the CPM student is:
where is a function taking the label as inputs to generate the the ideal heatmap . Details of can be found in [41, 30]. During the evaluation, we take the argmax results over the first channel of the last heatmap as the coordinates of landmarks, and the -th channel corresponding to the background will be omitted.
The Teacher Network. Since our student detectors are based on heatmap, the pseudo label is in the form of heatmap and ground truth label is the ideal heatmap. We build our teacher network using the structure of discriminators adopted in CycleGAN . As shown in Figure 3, the input of this teacher network is the concatenation of a face image and its heatmap prediction 111 will be resized into the same spatial size as its face image. The output of this teacher network is a scalar representing the quality of a pseudo labeled facial image. Since we train the teacher on the trustworthy labeled data, we could obtain a supervised detection loss by calculating . We consider the negative value of this detection loss as the ground truth label of the quality, because a high negative value of the detection loss indicates a high similarity between the predicted heatmap and the ideal heatmap. In another word, a higher quality scalar corresponds to a more accurate pseudo label.
Formally, denote the teacher network as , we have:
where the parameters of the teacher is . “
” first resizes the tensorinto the same spatial shape as and then concatenates the resized tensor with to get a new tensor. This new tensor is regarded as pseudo labeled image and will be qualified by the teacher later. The teacher outputs a scalar representing the quality of the -th sample associated with its pseudo label . We optimize the teacher on the trustworthy labeled data by minimizing Eq. (4).
Our TS aims to progressively improve the performance of the student detector. The key idea is to learn a teacher network that can teach students which pseudo labeled sample is reliable and can be used for training. In this procedure, we define the pseudo label of a facial image is as follows:
where indicates the heatmap prediction from the first student at the -th stage for the -th sample. in Eq. (3.3) indicates the ensemble result from both two students detection networks. It will be used as the prediction during the inference procedure.
We show our overall algorithm in Algorithm 1. We first initialize the two detectors and on the labeled facial images . Then, in the first round, our algorithm applies the following procedures: (1) generate pseudo labels on via Eq. (3.3) and train the teacher network from scratch with these pseudo labels; (2) generate pseudo labels on
and estimate the quality of these pseudo labeled using the learned teacher; (3) select some high-quality pseudo labeled samples to retrain one student network from scratch. (4) repeat the first three steps to update another student detection network. In the next rounds, each student can be improved and generate more accurate pseudo labels. In this way, we will select more pseudo labeled samples when retraining the students. As the rounds go, students will gradually become better, and the teacher will also be adaptive with the improved students. Our interaction mechanism helps to obtain more accurate pseudo labels and select more reliable pseudo labeled samples. As a result, our algorithm achieves better performance in the semi-supervised facial landmark detection.
Can this algorithm generalize to other tasks? Our algorithm relies on the design of the teacher network. It requires the input pseudo label to be a structured prediction. Therefore, our algorithm is possible to be applied to tasks with structured predictions, such as segmentation and pose estimation, but is not suitable other tasks like classification.
Limitation. It is challenging for a teacher to judge the quality of a pseudo label for an image, especially when the spatial shape of this image becomes large. Therefore, in this paper, we use an input size of 6464. If we increase the input size to 256256, the teacher will fail and need to be modified accordingly. There are two main reasons: (1) the larger resolution requires a deeper architecture or dilated convolutions for the teacher network and (2) the high-resolution faces bring high-dimensional inputs, and consequently, the teacher needs much more training data. This drawback limits the extension of our algorithm to high-resolution tasks, such as segmentation. We will explore to solve this problem in the future.
Further improvements. (1) In our algorithm, during the retraining procedure, a part of unlabeled samples are not involved during retraining. To utilize these unlabeled facial images, we could use self-supervised techniques such as  to improve the detectors. (2) In this framework, we use only two student detectors, while it is easy to integrate more student detectors. More student detectors are likely to improve the prediction accuracy, but this will introduce more computation costs. (3) The specifically designed data augmentation [31, 42] is another direction to improve the accuracy and precision of the pseudo labels.
Will the teacher network over-fit to the labeled data? In Algorithm 1, since labeled data set is used to optimize both teacher and students, the teacher’s judgment could suffer from the over-fitting problem. Most of the students’ predictions on the labeled data can be similar to the ground truth labels. In other words, most pseudo labeled samples on are “correctly” labeled samples. If the teacher is optimized on with those pseudo labels, it might only learn what a good pseudo labeled sample is, but overlook what a bad one is. It would be more reasonable to let students predict on the unseen validation set, and then train the teacher on this validation set. However, having an additional validation set during training is different from the typical setting of previous semi-supervised facial landmark detection. We would explore this problem in our future work.
We perform experiments on three benchmark datasets to investigate the behavior of the proposed method. The datasets and experiment settings are introduced in Section 4.1 and Section 4.2. We first compare the proposed semi-supervised facial landmark algorithm with other state-of-the-art algorithms in Sec. 4.3. We then perform ablation studies in Sec. 4.4 and visualize our results at last.
The 300-W dataset  annotates 68 landmarks from five facial landmark datasets, i.e., LFPW, AFW, HELEN, XM2VTS, and IBUG. Following the common settings [11, 9, 27], we regard all the training samples from LFPW, HELEN and the full set of AFW as the training set, in which there is 3148 training images. The common test subset consists of 554 test images from LFPW and HELEN. The challenging test subset consists of 135 images from IBUG to construct . The full test set the union of the common and challenging subsets, 689 images in total.
|Methods||SDM ||LBF ||CCL ||Two-Stage ||SBR ||SAN ||DSRN |
|Methods||RCN  (5%)||TS (5%)||TS(10%)||TS(20%)|
The AFLW dataset  contains 21997 real-world images with 25993 faces in total. They provide at most 21 landmark coordinates for each face, but they exclude invisible landmarks. Faces in AFLW usually have a different head pose, expression, occlusion or illumination, and therefore it causes difficulties to train a robust detector. Following the same setting as in [27, 47], we do not use the landmarks of two ears. There are two types of AFLW splits, i.e., AFLW-Full and AFLW-Frontal following [47, 9]. AFLW-Full contains 20000 training samples and 4386 test samples. AFLW-Front uses the same training samples as in AFLW-Full, but only use the 1165 samples with the frontal face as the test set.
, and the number of cascaded stages is set as three. Its number of parameters is 16.70 MB and its FLOPs is 1720.98 M. To train CPM, we apply the SGD optimizer with the momentum of 0.9 and the weight decay of 0.0005. For each stage, we train the CPM for 50 epochs in total. We start the learning rate of 0.00005, and reduce it by 0.5 at 20-th, 25-th, 30-th, and 40-th epoch.
but use the number of cascaded stages of four to build our HG model, where the number of parameters is 24.97 MB and FLOPs is 1600.85 M. To train HG, we apply the RMSprop optimizer with the alpha of 0.99. For each stage, we train the HG for 110 epochs in total. We start the learning rate of 0.00025, and reduce it by 0.5 at 50-th, 70-th, 90-th, and 100-th.
For both of these two detectors, we use the batch size of eight on two GPUs. To generate the heatmap ground truth labels, we apply the Gaussian distribution with the sigma of 3. Each face image is first resized into the size of 6464, and then randomly resized between the scale of 0.9 and 1.1. After the random resize operation, the face image will be randomly rotated with the maximum degree of 30, and then randomly cropped with the size of 6464222Different input image resolution can cause different detection performance. We choose 6464 to ease the training of our teacher network.. We set selection ratio as and the maximum step as based on cross-validation.
Training the teacher network333Model codes are publicly available on GitHub: https://github.com/D-X-Y/landmark-detection. We build our teacher network using the structure of discriminators adopted in CycleGAN . Given a 6464 face image, we first resize the predicted heatmap into the same spatial size of 6464. We use the Adam to train this teacher network. The initial learning rate is 0.01, and the batch size is 128. Random flip, random rotation, random scale and crop are applied as data argumentation.
Evaluation. Normalized Mean Error (NME) is usually applied to evaluate the performance for facial landmark predictions [27, 34, 47, 9]. For the 300-W dataset, we use the inter-ocular distance to normalize mean error following the same setting as in [35, 27, 11, 9]. For the AFLW dataset, we use the face size to normalize mean error . Area Under the Curve (AUC) @ 0.08 error is also employed for evaluation [6, 40]. When training on the partially labeled data, the sets of and are randomly sampled. During evaluation, we use Eq. (3.3) to obtain the final heatmap and follow [41, 30] to generate the coordinate of each landmark. We repeat each experiment three times and report the mean result. The codes will be public available upon the acceptance.
|Method||DGCM ||SBR ||TS|
Comparisons on 300-W. We compare our algorithm with several state-of-the-art algorithms [44, 43, 27, 43, 19, 16], as shown in Table 1. In this table, [9, 22, 11] are very recent methods, which represent the state-of-the-art supervised facial landmark algorithms. By using 100% facial landmark labels on 300-W training set and unlabeled AFLW, our algorithm achieves competitive 3.49 NME on the 300-W common test set, which is competitive to other state-of-the-art algorithms. In addition, even though our approach utilizes two detectors, the number of parameters is much lower than SAN . The robust detection performance of ours can be mainly caused by two reasons. First, the proposed teacher network can effectively sample the qualified pseudo labeled data, which enables the model to exploit more useful information. Second, our framework leverages two advanced CNN architectures, which can complement each other.
We also compare our TS with a recent work on semi-supervised facial landmark detection  in Table 1. When using 10% of labels, our TS obtains a lower NME result on the challenging test set than RCN  (5.64 NME vs. 6.32 NME). When using 20% of labels, our TS is also superior to it (5.03 NME vs. 5.88 NME). Note that  utilizes a transformation invariant auxiliary loss function. This auxiliary loss can also be easily integrated into our framework. Therefore,  is orthogonal to our work, combining two methods can potentially achieve a better performance.
Comparisons on AFLW. We also show the NME comparison on the AFLW dataset in Table 2. Compared to semi-supervised facial landmark detection algorithm , we achieve a similar performance. RCN  can learn transformation invariant information from a large amount of unlabeled images, while ours does not consider this information as it is not our focus. On the AFLW-Full test set, using 20% annotation, our framework achieves 1.99 NME, which is competitive to other supervised algorithms. On the AFLW-Front test set, using only 10% annotation, our framework achieves competitive NME results to . The above results demonstrate our framework can train a robust detector with much less annotation effort.
Comparisons on 300-VW. We experiment our algorithm to leverage a large amount of unlabeled facial video frames on 300-VW. We use the labeled 300-W training set and the unlabeled 300-VW training set to train our TS. We evaluate the learned detectors on the 300-VW C test subset w.r.t. AUC @ 0.08. Some video-based facial landmark detection algorithms [20, 11] utilize the labeled 300-VW training data to improve the base detectors. Compared with them, without using any label on 300-VW, our TS obtains a higher AUC result than them, i.e., 59.65 vs. 59.39, as shown in Table 3.
The key contribution of our TS lies on two components: (1) the teacher supervising the training data selection of students. (2) the complementary effect of two students. In this subsection, we validate the contribution of these two components to the final detection performance.
The effect of the teacher. Compared to other progressive pseudo label generation strategies [23, 17, 28], our designed teacher can sample pseudo labeled with higher quality. In Figure 4, we show the detection results after the first five training rounds (only 10% labels are used). We use SPL [23, 17] to separately train CPM and HG, and then ensemble them together as Eq. (3.3). We use SPaCo  to jointly optimize CPM and HG in a co-training strategy. To make a fair comparison, at each round, we control the number of pseudo labels is the same across these three algorithms. From Figure 4, several conclusions can be made: (1) TS obtains the lowest NME, because the quality of selected pseudo labels is better than others. (2) SPL falls into a local trap at round and results in a higher error at round, whereas SPaCo and our TS not. This could be caused by that the interaction between two students can help regularize each other. (3) Our TS converges faster than SPaCo and achieves better results. The pseudo labeled data selection in SPaco is a heuristic unsupervised criterion, whereas our criterion is a supervised teacher. Since no extra supervision is given in SPaCo, their criterion might induce inaccurate pseudo labeled samples. Besides, as discussed in Section 3.4, our TS can utilize validation set to further improve the performance by avoid over-fitting, but the compared methods may not effectively utilize validation set.
The effect of the interaction between students. From Table 4, we show the ablative studies on the complementary effect of multiple students. In these experiments, we use the same teacher structure, while “CPM” and “HG” are trained without the interaction between students. Using 10% labels, CPM achieves 8.28 NME, and HG achieves 6.25 NME on 300-W. Leveraging from their mutual benefits, our TS can boost the performance to 5.64, which is higher than CPM by about 30% and than HG by 9%. Under different portion of annotations, we can conclude similar observations. This ablation study demonstrates the contribution of student interaction to the final performance. Note that, our algorithm can be readily applied to multiple students without introducing additional hyper-parameters. In contrast, the number of hyper-parameters in other co-training strategies [28, 12] is quadratic to the number of detectors.
On the 300-W training set, we train our TS using only 10% labeled facial images, and we show some qualitative results of the 300-W test set in Figure 5. The first row shows seven raw input facial images. The second row shows the ground truth background heatmaps, and the third row shows the faces with ground truth landmarks of these images. We visualize the predicted background heatmap in the fourth row and the predicted coordinates in the fifth row. As we can see, the predicted landmarks of our TS are very close to the ground truth. These predictions are already robust enough, and human may not be able to distinguish the difference between our predictions (the third line) and the ground truth (the fifth line).
In this paper, we propose an interaction mechanism between a teacher and multiple students for semi-supervised facial landmark detection. The students learn to generate pseudo labels for the unlabeled data, while the teacher learns to judge the quality of these pseudo labeled data. After that, the teacher can filter out unqualified samples; and the students get feedback from the teacher and improve itself by the qualified samples. The teacher is adaptive along with the improved students. Besides, multiple students can not only regularize each other but also be ensembled to predict more accurate pseudo labels. We empirically demonstrate that the proposed interaction mechanism achieves state-of-the-art performance on three facial landmark benchmarks.
Learning to reweight examples for robust deep learning. In ICML, Cited by: §2.3.
Facial landmark detection with tweaked convolutional neural networks. IEEE TPAMI. Cited by: §3.4.
Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, Cited by: §3.2, §4.2.