Accurate and robust localization of facial landmarks is a critical step in many existing face processing applications, including tracking, expression analysis, and face identification. Unique localization of such landmarks is severely affected by occlusions, partial face visibility, large pose variations, uneven illumination, or large, non-rigid deformations during more extreme facial expressions [44, 48]. These challenges have to be overcome in order to achieve a low landmark localization error, implying high robustness to appearance changes in faces while guaranteeing high localization accuracy for each landmark.
The recent advances in deep learning techniques coupled with the availability of large, annotated databases have allowed steady progress with localization accuracy on a typical benchmark increasing by 100% (from  to 
- see below for more related work). Most approaches use a combination of highly-tuned, supervised learning schemes in order to achieve this performance and almost always are specifically optimized on the particular datasets that are tested, increasing the potential of overfitting to that dataset. Similarly, it has been shown that annotations in datasets can be imprecise and inconsistent (e.g., ).
Given that in addition to the existing annotated facial landmark datasets, there is an even larger number of datasets available for other tasks (face detection, face identification, facial expression analysis, etc.), it should be possible to leverage the implicit knowledge about face shape contained in this pool to both ensure better generalizability across datasets and easier and faster, few-shot training of landmark localization. Here, we present such a framework that is based on a two-stage architecture (3FabRec, see Figs.1,2): the key to the framework lies in the first, unsupervised stage, in which an (generative) adversarial autoencoder  is trained on a large dataset of faces that yields a low-dimensional embedding capturing ”face knowledge”  from which it is able to reconstruct face images across a wide variety of appearances. With this embedding, the second, supervised stage then trains the landmark localization task on annotated datasets, in which the generator is re-tasked to predict the locations of a set of landmarks by generating probabilistic heatmaps 
. This two-stage approach is a special case of semi-supervised learning[25, 62] and has been successful in other domains, including general network training , text classification  and translation , and visual image classification .
In the current study, we show that our framework is able to achieve state-of-the-art results running at 300 FPS on the standard benchmark datasets. Most importantly, it yields impressive localization performance already with a few percent of the training data - beating the leading scores in all cases and setting new standards for landmark localization from as few as 10 images. The latter result demonstrates that landmark knowledge has, indeed, been implicitly captured by the unsupervised pre-training. Additionally, the reconstructed autoencoder images are able to ”explain away” extraneous factors (such as occlusions or make-up), yielding a best-fitting face shape for accurate localization and adding to the explainability of the framework.
2 Related Work
provided the state-of-the-art in facial landmark detection. Current models using deep convolutional neural networks, however, quickly became the best-performing approaches, starting with deep alignment networks, fully-convolutional networks , coordinate regression models [29, 49], or multi-task learners , with the deep networks being able to capture the pixel-to-landmark correlations across face appearance variations.
The recent, related work in the context of our approach can be structured into supervised and semi-supervised approaches (for a recent, interesting unsupervised method - at lower performance levels - see ).
2.1 Supervised methods
Several recent, well-performing supervised methods are based on heatmap regression, in which a deep network will infer a probabilistic heatmap for each of the facial landmarks with its corresponding maximum encoding the most likely location of that landmark [28, 5, 12] - an approach we also follow here. In order to provide additional geometric constraints, extensions use an active-appearance-based model-fitting step based on PCA , explicit encoding of geometric information from the face boundary 
, or additional weighting from occlusion probabilities
. The currently best-performing method on many benchmarks uses a heatmap-based framework together with optimization of the loss function to foreground versus background pixels. Such supervised methods will typically require large amounts of labelled training data in order to generalize across the variability in facial appearance (see  for an architecture using high-resolution deep cascades that tries to address this issue).
2.2 Semi-supervised methods
In addition to changes to the network architecture, the issue of lack of training data and inconsistent labeling quality is addressed in semi-supervised models [25, 62] that augment the training process to make use of partially- or weakly-annotated data. Data augmentation based on landmark perturbation  or from generating additional views from a 3D face model  can be applied to generate more robust pseudo landmark labels.  uses constraints from temporal consistency of landmarks based on optic flow to enhance the training of the landmark detector - see also . In [59, 39], multi-task frameworks are proposed in which attribute-networks tasked with predicting other facial attributes including pose and emotion are trained together with the landmark network, allowing for gradient transfer from one network to the other. Similar to this,  show improvements using data augmentation with style-translated examples during training. In , a teacher-supervises-students (TS) framework is proposed in which a teacher is trained to filter student-generated landmark pseudolabels into ”qualified” and ”unqualified” samples, such that the student detectors can retrain themselves with better-quality data. Similarly, in , a GAN framework produces ”fake” heatmaps that the main branch of the network needs to discriminate, hence improving performance.
3.1 Our approach
Most of the semi-supervised approaches discussed above use data augmentation on the same dataset as done for testing. Our approach (see Figs. 1,2) starts from an unsupervised method in which we leverage the implicit knowledge about face shape contained in large datasets of faces (such as used for face identification ). This knowledge is captured in a low-dimensional latent space of an autoencoder framework. Importantly, the autoencoder also has generative capabilities, i.e., it is tasked during training to reconstruct the face from the corresponding latent vector. This step is done because the following, supervised stage implements a hybrid reconstruction pipeline that uses the generator together with interleaved transfer layers to both reconstruct the face as well as probabilistic landmark heatmaps. Hence, the changes in the latent vector space will be mapped to the position of the landmarks trained on labeled datasets. Given that the first, unsupervised stage has already captured knowledge about facial appearance and face shape, this information will be quickly made explicit during the second, supervised stage allowing for generalization across multiple datasets and enabling low-shot and few-shot training.
3.2 Unsupervised face representation
The unsupervised training step follows the framework of  in which an adversarial autoencoder is trained through four loss functions balancing faithful image reconstruction with the generalizability and smoothness of the embedding space needed for the generation of novel faces. A reconstruction loss penalizes reconstruction errors through a pixel-based error. An encoding feature loss  ensures the creation of a smooth and continuous latent space. An adversarial feature loss pushes the encoder and generator to produce reconstructions with high fidelity since training of generative model using only image reconstruction losses typically leads to blurred images.
As the predicted landmark locations in our method follow directly from the locations of reconstructed facial elements, our main priority in training the autoencoder lies in the accurate reconstruction of such features. Thus, we trade some of the generative power against reconstruction accuracy by replacing the generative image loss, , used in  with a new structural image loss .
: An accurate representation of faces is achieved by an pixel loss that ensures a faithful reconstruction of input images:
Adversarial feature loss:
Adversarial autoencoders  achieve the creation of a smooth embedding space by employing a discriminator to forces the encoder to encode images according to prior distribution . This avoids overfitting of the encoder and increased the generalizability to unseen images.
Adversarial image loss:
Autoencoders trained with image reconstruction losses typically suffer from creating blurry images. We observed the same behavior leading to accurate image reconstructions in terms of pixel differences, yet smoothing out facial elements like eye corners or lip boundaries. To ensure that reconstructed images maintain high image fidelity we, therefore, add an adversarial loss:
The encoder and decoder are optimized to auto-encode images that look similar to the input images from the training set. The discriminator tries to distinguish real images from their reconstructions . can only be fooled if the same amount of detail is present in the auto-encoded face as in the original image.
Structural image loss:
To penalize reconstructions that do not align facial structures well with input images we add a structural image loss based on the SSIMimage similarity metric, which measures contrast and correlation between two image windows and :
The values and
denote intensity variances of windows, and denotes their covariance. The constant adds stability against small denominators. It is set to for images with 8-bit channels. The calculation is run for each window across the images:
We obtain the structural image loss by evaluating with the original image and its reconstructions:
This loss improves the alignment of high-frequency image elements and imposes a penalty for high-frequency noise introduced by the adversarial image loss. Hence, serves as a regularizer, stabilizing adversarial training.
Full autoencoder objective:
The final training objective is a weighted combination of all loss terms:
We set and to . and are selected so the corresponding loss terms yield similarly large values to each other, while at the same time ensuring a roughly 10 times higher weight in comparison to and (given the range of loss terms, we set , ).
3.3 Supervised landmark discovery
For landmark detection, we are not primarily interested in producing a RGB image but rather a -channel image containing landmark probability maps. This can be seen as a form of style transfer in which the appearance of the generated face is converted to a representation that allows us to read off landmark positions. Hence, information about face shape that was implicitly present in the generation of color images before is now made explicit. Our goal is to create this transfer without losing the face knowledge distilled from the very large set of (unlabeled) images as the annotated datasets available for landmark prediction are only a fraction of that size and suffer from imprecise and inconsistent human annotations . For this, we introduce additional, interleaved transfer layers into the generator .
3.3.1 Interleaved transfer layers
Training of landmark generation starts by freezing all parameters of the autoencoder. We then interleave the inverted ResNet layers of the generator with convolutional layers. Each of these Interleaved Transfer Layers (ITL) produces the same number of output channels as the original ResNet layer. Activations produced by a ResNet layer are transformed by these layers and fed into the next higher block. The last convolutional layer mapping to RGB images is replaced by a convolutional layer mapping to -channel heatmap images ( number of landmarks to be predicted). This approach adds just enough flexibility to the generator to produce new heatmap outputs by re-using the pre-trained autoencoder weights.
Given an annotated face image , the ground truth heatmap for each landmark
consists of a 2D Normal distribution centered at
and with standard deviation. During landmark training and inference the activations produced by the first inverted ResNet layer for an encoded image are passed to the first ITL layer. This will transfer the activations and feed it into the next, frozen inverted ResNet layer, such that the full cascade of ResNet and ITLs can reconstruct a landmark heatmap . The heatmap prediction loss is defined as the distance between predicted () and ground truth heatmap ()
The position of the landmark is .
Once training of the ITL layers reaches convergence we can perform an optional finetuning step. For this, the encoder is unfrozen so that ITL layers and encoder are optimized in tandem (see Fig.2).
Since the updates are only based on landmark errors, this will push to encode input faces such that facial features are placed more precisely in reconstructed faces. At the same time, other attributes like gender, skin color, or illumination may be removed as these are not relevant for the landmark prediction task. Overfitting is avoided since the generator remains unchanged, which acts as a regularizer and limits the flexibility of the encoder.
4 Experiments111For experiments on parameter tuning, cross-database results, and further ablation studies, see supplementary materials.
VGGFace2 & AffectNet
The dataset used for unsupervised training of the generative autoencoder combines two datasets: the VGGFACE2 dataset , which contains a total of 3.3 million faces collected with large variability in pose, age, illumination, and ethnicity in mind. From the full dataset, we removed faces with a height of less than 100 pixels resulting in 1.8 million faces (from 8631 unique identities). In addition, we add the AffectNet dataset  that was designed for capturing a wide variety of facial expressions (hence providing additional variability in face shape), which contains 228k images, yielding a total of 2.1M images for autoencoder training.
This dataset was assembled by  from several sources, including LFPW , AFW , HELEN , XM2VTS , and own data and annotated semi-automatically with 68 facial landmarks. Using the established splits reported in , a total of 3,148 training images and 689 testing images were used in our experiments. The latter is further split into 554 images that constitute the common subset and a further 135 images that constitute the challenging subset. Additionally, 300-W contains 300 indoor and 300 outdoor images that define the private testset of the original 300-W challenge.
This dataset  contains 24,386 in-the-wild faces with an especially wide range of face poses (yaw angles from –] and roll and pitch angles from –). Following common convention, we used splits of 20,000 images for training and 4,386 for testing and trained with only 19 of the 21 annotated landmarks .
The newest dataset in our evaluation protocol is from  containing a total of 10,000 faces with a 7,500/2,500 train/test split. Images were sourced from the WIDER Face dataset  and were manually annotated with a much larger number of 98 landmarks. The dataset contains different (partially overlapping) test subsets for evaluation where each subset varies in pose, expression, illumination, make-up, occlusion, or blur.
4.2 Experimental settings
4.2.1 Unsupervised autoencoder training
Our implementation is based on  which combines a standard ResNet-18 as encoder with an inverted ResNet-18 (first convolution layers in each block replaced by 44 deconvolution layers) as decoder. Both encoder and decoder contain 10M parameters each. The encoded feature length is 99 dimensions.
We train the autoencoder for 50 epochs with an input/output size ofand a batchsize of 100 images. Upon convergence we add an additional ResNet layer to both the encoder and decoder and train for another 50 epochs with an image size of to increase reconstruction fidelity with a batchsize of 50. We use the Adam optimizer  () with a constant learning rate of , which yielded robust settings for adversarial learning. We apply data augmentations of random horizontal flipping (), translation () resizing ( to ), rotation ().
4.2.2 Supervised landmark training
Images are cropped using supplied bounding boxes and resized to . For creating ground truth heatmaps, we set . In all experiments we train four ITL layers and generate landmark heatmaps of size by skipping the last generator layer (as detailed in 4.6, higher generator layers contain mostly decorrelated local appearance information). To train from the landmark dataset images, we apply data augmentations of random horizontal flipping (), translation () resizing (), rotation (), and occlusion (at inference time no augmentation is performed). The learning rate during ITL-only training is set to 0.001. During the optional finetuning stage we lower ITL learning rate to 0.0001 while keeping the encoder learning rate the same as during training (=) and resetting Adam’s to the default value of .
Performance of facial landmark detection is reported here using Normalized mean error (NME), failure rate (FR) at 10% NME and area-under-the-curve (AUC) of the Cumulative Error Distribution (CED) curve. For 300-W and WFLW we use the distance between the outer eye-corners as the ”inter-ocular” normalization. Due to the high number of profile faces in AFLW, errors are normalized using the width of the (square) bounding boxes following .
4.3 Qualitative results
The trained generator is able to produce a wide range of realistic faces from a low-dimensional (99D) latent feature vector - this is shown in Fig.3 with randomly-generated
faces with overlaid, predicted landmark heatmaps. To achieve this, the model must have learned inherent information about the underlying structure of faces. We can further illustrate the implicit face shape knowledge by interpolating between face embeddings and observing that facial structures (such as mouth corners) in produced images are constructed in a highly consistent manner (see Fig.4 for a visualization). This leads to two insights: First, facial structures are actually encoded in the low-dimensional representation . Second, this information can be transformed into 2D maps of pixel intensities (i.e., a color image) while maintaining high correlation with the originating encoding.
Further examples of the reconstruction quality on challenging images are shown in Fig. 5. As can be seen, the pipeline will try to reconstruct a full face as much as possible given the input, removing occlusions and make-up and even ”upsampling” the face (Fig. 5, first column) in the process. This is because the databases for training the autoencoder contained mostly unoccluded and non-disguised faces at roughly similar resolutions. Additionally we note that the reconstructed faces will not necessarily preserve the identity as the goal of the fully-trained pipeline is to reconstruct the best-fitting face shape. Although our method is able to handle considerable variations in resolution (Fig. 5, first column), make-up (Fig. 5, second column), lighting (Fig. 5, third column), and pose (Fig. 5, fourth column), it does produce failed predictions in cases when these factors become too extreme, as shown in the fifth column of Fig. 5. Landmark prediction, however, typically degrades gracefully in these cases as the confidence encoded in the heatmaps will also be low.
4.4 Comparison with state-of-the-art
Table 1 shows comparisons of our semi-supervised pipeline with state-of-the-art on the 300-W and the AFLW datasets using the full amount of training. We achieve top-2 accuracy on nearly all test sets with the exception of the common set from 300-W. This demonstrates that our framework is able to reach current levels of performance despite a much lighter, supervised training stage using only a few interleaved transfer layers on top of the generator pipeline.
The results in Table 2 for AUC and FR for the commonly-reported 300-W dataset demonstrate that our framework achieves the lowest failure rate of all methods (our FR=0.17 corresponds to only 1 image out of the full set that has large enough errors to count as a failure). At the same time, the AUC is in the upper range but not quite as good as that of , for example, which means that overall errors across landmarks are low, but more equally distributed compared to the top-performing methods.
|LaplaceKL (70K) ||1.97||-||3.19||6.87||3.91|
|M CSR ||47.52||5.5|
The NME results in Table 3 show that on the newest WFLW dataset, our approach performs at levels of the LAB method  with most subsets, although we perform consistently below the current StyleAlign approach (SA,  - note, however, that this approach could be easily implemented into our framework as well, which would allow us to disentangle the 99D-feature vector into style attributes  to generate augmented training data). The main reason for this is that WFLW contains much more heavy occlusions and extreme appearance changes compared to our training sets leading to more failure cases (see Fig.5 fifth column).
|NME (%)||SDM ||10.29||24.10||11.45||9.32||9.38||13.03||11.28|
|FR @0.1 (%)||SDM ||29.40||84.36||33.44||26.22||27.67||41.85||35.32|
|AUC @0.1||SDM ||0.300||0.023||0.229||0.324||0.312||0.206||0.239|
|Method||Training set size|
|100%||20%||10%||5%||50 (1.5%)||10 (0.3%)||1 (0.003%)|
4.5 Limited training data and few-shot learning
Table 4 shows that performance is comparable to that of 2-year-old approaches trained on the full dataset (cf. Table 1) although 3FabRec was trained only with 10% of the dataset. In addition, performance does not decrease much when going to lower values of 5% and 1.5% of training set size. Even when training with only 10 images or 1 image, our approach is able to deliver reasonably robust results (see Fig.1 for landmark reconstruction results from training with 10 images).
For this dataset (Table 5), our approach already starts to come ahead at 20% of training set size with little degradation down to 1%. Again, even with only a few images 3FabRec can make landmark predictions.
For this, more challenging dataset (Table 6), our approach easily outperforms the StyleAlign  method as soon as less than 10% is used for training while being able to maintain landmark prediction capabilities down to only 10 images in the training set.
|Method||Training set size|
|100%||20%||10%||5%||1%||50 (0.0025%)||10 (0.0005%)||1 (0.0001%)|
|Method||Training set size|
4.6 Ablation studies
4.6.1 Effects of ITLs
In order to see where information about landmarks is learned in the interleaved transfer layers, Figure 6 shows the reconstruction of the landmark heatmap when using all four layers versus decreasing subsets of the upper layers. As can be seen, the highest layer has only very localized information (mostly centered on eyes and mouth), whereas the lower layers are able to add information about the outlines - especially below layer 2.
Localization accuracy is reported on the 300-W dataset (NME of 51 inner landmarks and outlines, as well as FR) in Table 7. As can be expected from the visualization, performance is bad for the upper layers only, but quickly recovers (especially when including the outlines) below layer 2. The reason for this is that the upper layers of the generator will mostly contain localized, de-correlated information at the pixel level, whereas the lower layers are closer to the more global and contextual information necessary to cover highly variable outlines (cf. blue curve in Figure 6, note that all ITLs have 33 convolutions). As the gray curve in Figure 6 and Table 7 show as well, the ITLs can achieve this with only very few additional parameters.
4.6.2 Effects of finetuning
Table 8 reports the effects of running the model with and without finetuning on the full testsets of the three evaluated datasets. The additional retraining of the autoencoder allows for better reconstruction of the faces and results in benefits of 10.9% on average (8.9% for 300-W, 15.2% for AFLW, and 8.5% for WFLW, respectively).
|NME before FT||4.16||2.12||6.11|
|NME after FT||3.82||1.84||5.62|
4.6.3 Autoencoder losses
The adversarial autoencoder is trained through four loss functions balancing faithful image reconstruction with the generalizability and smoothness of the embedding space needed for the generation of novel faces. A reconstruction loss penalizes reconstruction errors through a pixel-based error. An encoding feature loss  ensures the creation of a smooth and continuous latent space. An adversarial feature loss pushes the encoder and generator to produce reconstructions with high fidelity since training of generative model using only image reconstruction losses typically leads to blurred images. As the predicted landmark locations in our method follow directly from the locations of reconstructed facial elements, our main priority in training the autoencoder lies in the accurate reconstruction of such features, reconstruction accuracy is further enhanced by introducing a structural image loss .
|Model||FT||Global Reconstr.||Local Reconstr.||NME||FR@0.1|
|RMSE||SSIM||Patch SSIM||%||% (#)|
|(HG)||Heatmap HG||✓||-||-||-||5.48||4.21 (29)|
|(A)||Adv. Autoencoder||✓||12.61||0.68||0.64||5.67||4.94 (34)|
|(A-FT)||Adv. Autoencoder (FT)||✓||✓||25.03||0.57||0.55||4.92||2.47 (17)|
|(B)||AE + GAN||✓||✓||15.10||0.60||0.58||5.30||3.77 (26)|
|(B-FT)||AE + GAN (FT)||✓||✓||✓||27.48||0.49||0.50||4.71||2.03 (14)|
|(C)||AE + GAN + Struct.||✓||✓||✓||15.91||0.62||0.64||4.92||2.61 (18)|
|(C-FT)||AE + GAN + Struct. (FT)||✓||✓||✓||✓||27.65||0.50||0.53||4.41||1.45 (10)|
Here, we present results of the framework ablating different loss terms (except for the encoding feature loss ) during the training of the autoencoder to study their impact on landmark localization accuracy (see Table 9) using the 300-W dataset. In addition, we report the effects of the optional finetuning step on accuracy, in which the autoencoder is further tuned on the 300-W training dataset. All setups were trained on 128x128px images at a half of the resolution of the setup reported in the paper (see also Figure 7).
As benchmarks, the first two rows of Table 9 also list a standard ResNet-18 predictor of landmark locations (trained on 300-W) as well as a standard heatmap-based system (trained on 300-W). Both approaches offer roughly the same kind of performance on this dataset with a slight advantage for heatmap-based prediction.
If we only add the autoencoder (using , ) to our ResNet-architecture, then performance is comparable to that of the standard, non-bottlenecked ResNet-18 architecture, which shows that the 99 dimensions seem to be sufficient to capture the landmark ”knowledge” - it is important to note, however, that this landmark knowledge was obtained from unsupervised training. Further (supervised) finetuning of the autoencoder on 300-W provides another, significant boost that goes beyond the performance of both supervised benchmark systems. Hence, the finetuning step on the dataset is able to sharpen the implicit landmark representation obtained during the unsupervised step.
Forcing the autoencoder to generate believable images by adding the adversarial loss (using , , ) provides a further 7% improvement in NME for standard and finetuned training. Finally, the addition of the structural loss that further enhances small details in the reconstructed faces (using , , , ) yields another 7% improvement. Overall, these results clearly show that losses that tune the face representation to be able to generate more detailed faces will also improve the landmark localization accuracy.
We note that the columns reporting ”global” reconstruction errors (as RMSE or SSIM comparisons between the original and reconstructed images, respectively) and ”local” reconstruction error (as SSIM errors evaluated for patches centered on the landmark locations of the original and reconstructed images) yield already good quality for the most ”simple” loss setup. For this it is best to look at Figure 7, which shows how the different losses affect the visual quality of the reconstruction. When looking at rows (A), (B), (C), faces gain an increasing amount of high-frequency detail. When adding the GAN loss, these high-frequency details will not aid the reconstruction error at first as the details are ”hallucinated” globally all over the face - these details, however, seem to be able to aid the landmark layers in providing a better mapping onto heatmaps and therefore landmark locations. The addition of the SSIM loss does improve the reconstruction error again as the loss forces the high-frequency details to better match with the trained source face images - again, the added details in this case will help landmark localization.
The effect of finetuning on face appearance is interesting to observe as the faces gain immediate detail for all loss setups, yet their overall reconstruction is sometimes more ”different” to the source face compared to the non-finetuned version. This is because finetuning unfreezes the weights of the encoder but will train to predict the landmark locations more reliably - hence, the reconstructed faces will favor clear landmark localizability (through well-defined facial feature locations) at the expense of more faithful face reconstruction. Overall, the effect is therefore an increase of the reconstruction error.
As a final note, we observe that training the autoencoder setup on 256x256px provides another jump in performance as the system will learn to reconstruct facial details at an even higher fidelity (see final two rows in Figure 7).
4.6.4 Encoding length
In Table 10, we report the effect of halving this dimensionality to on landmark localization accuracy. Although yielding a slightly higher NME, the reduced autoencoder obtains a slightly lower FR, which overall means that both embedding dimensionalities result in similar performance levels. An issue with the reduced dimensionality embedding, however, was that the subsequent landmark training was notably less robust, requiring a much more conservative learning rate.
Hence, for the task of landmark localization, the current framework may work with a lower-dimensional embedding space, however, it seems that pulling the implicit information out of the reduced dimensions is a harder task than for a richer embedding.
Further experiments are needed to investigate the effects of increasing the dimensionality as well as providing further constraints on the embedding vector during the unsupervised training.
|Unlabeled training data||Labeled training data|
|Model||Pre-train||Num. of||100% (3,189)||1.5% (50)|
|ResNet-18||None||0||5.64||4.64 (32)||8.70||22.21 (153)|
|C-FT||None||0||6.73||11.03 (76)||15.56||88.82 (612)|
|C-FT||300-W||3,189||5.40||4.79 (33)||7.95||15.82 (109)|
|C-FT||VGG + AN||100k||4.73||1.74 (12)||6.34||9.29 (064)|
|C-FT||VGG + AN||2.1M||4.41||1.45 (10)||5.71||4.35 (030)|
4.6.5 Unsupervised training and few-shot learning
We next take a look at the effects of the unsupervised training step as well as the amount of supervised post-training on 300-W. Table 11 shows again the ResNet-18 baseline and then four different training setups for the full, finetuned system.
The first row reports results of the full architecture without any unsupervised pre-training and hence without any implicit face knowledge. When adding the supervised training step, the intermediate transfer layers (ITLs) are able to pick up on the encoded information and are able to recover reasonable performance, yet not at the levels of the much deeper ResNet-18 architecture (note that the ITLs only consist of single 3x3 convolution layers). If the amount of training data is reduced, the performance of this severely limited architecture predictably drops significantly.
The next rows show results for the full architecture with different amounts of pre-training. Pre-training on the 300-W training dataset results in equal or slightly better performance compared to the ResNet-18 architecture showing that the system is able to pick up implicit knowledge already from only 3,200 images. Pre-training on 100,000 images provides a significant, further jump as does pre-training on the full 2,1M image dataset.
Importantly, the error increase in the presence of limited training data (columns labeled 1.5% in Table 11) with just 50 images showcase the power of the pre-trained representation: whereas ResNet-18 increases around 54% in NMW from 100% to 1.5% training set size, our pre-trained architectures only reduce 47%, 34%, and 29% respectively owing to the more robust generalization from the latent
4.7 Runtime performance
Since inference complexity is equivalent to two forward-passes through a ResNet-18, our method is able to run at frame rates of close to 300fps on a TitanX GPU - an order of magnitude faster than state-of-the-art approaches with similar, high accuracy (LAB =16fps, Wing =30fps, Deep Regression =83fps, Laplace =20fps).
With 3FabRec, we have demonstrated that an unsupervised, generative training on large amounts of faces captures implicit information about face shape, making it possible to solve landmark localization with only a minimal amount of supervised follow-up training. This paradigm makes our approach inherently more robust against overfitting to specific training datasets as well as against human annotation variability . The critical ingredients of 3FabRec that enable this generalization are the use of an adversarial autoencoder that reconstructs high-quality faces from a low-dimensional latent space, together with low-overhead, interleaved transfer layers added to the decoder-generator stage that transfer face reconstruction to landmark heatmap reconstruction.
Results show that the autoencoder is easily able to generalize from its training set (VGGFace2 and AffectNet) to data from unseen datasets. On more challenging subsets (such as WLFW make-up and occlusion) in which face occlusions and large variability in face appearance are more common than in the training set, the autoencoder reconstructions will ”explain away” these changes to reconstruct a face whose shape is as close to the evidence as possible. Most importantly, whereas results for training with full datasets are competitive, the framework shows impressive generalization for training from only a few percent of the training sets and still produces reliable results from as few as 10 images - far below anything reported so far in the literature. This result testifies to the power of the face representation learned in the unsupervised stage. At the same time, the interleaved transfer layers make heatmap inference based on two forward passes through a ResNet18, such that our framework can run at framerates of 300fps on a GPU, which is at least an order of magnitude faster than other highly accurate approaches.
Additional improvements to the method will come from improved training of the autoencoder to handle facial occlusion and larger amounts of pose changes (either by more extensive data augmentation or using specific datasets) as well as leveraging the implicit constraints about facial landmark positions contained in video data. Our approach is not restricted to facial landmark localization, but can also be used in other domains for which large amounts of unlabeled data is available, including prediction of body pose (human and non-human), or registration and alignment of medical imaging data, or even a combination of tasks).
-  Riza Alp Guler, George Trigeorgis, Epameinondas Antonakos, Patrick Snape, Stefanos Zafeiriou, and Iasonas Kokkinos. Densereg: Fully convolutional dense shape regression in-the-wild. In , pages 6799–6808, 2017.
-  Peter N Belhumeur, David W Jacobs, David J Kriegman, and Neeraj Kumar. Localizing parts of faces using a consensus of exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12):2930–2940, 2013.
-  Matteo Bodini. A review of facial landmark extraction in 2d images and videos using deep learning. Big Data and Cognitive Computing, 3(1):14, 2019.
-  Bjoern Browatzki and Christian Wallraven. Robust discrimination and generation of faces using compact, disentangled embeddings. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 0–0, 2019.
Adrian Bulat and Georgios Tzimiropoulos.
Two-stage convolutional part heatmap regression for the 1st 3D face
alignment in the wild (3DFAW) challenge.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), volume 9914 LNCS, pages 616–624, 2016.
-  Qiong Cao, Li Shen, Weidi Xie, Omkar M Parkhi, and Andrew Zisserman. Vggface2: A dataset for recognising faces across pose and age. In Face & Gesture Recognition (FG 2018), pages 67–74. IEEE, 2018.
Gavin C Cawley and Nicola LC Talbot.
On over-fitting in model selection and subsequent selection bias in
Journal of Machine Learning Research, 11(Jul):2079–2107, 2010.
-  Timothy F Cootes, Gareth J Edwards, and Christopher J Taylor. Active appearance models. In European Conference on Computer Vision, pages 484–498. Springer, 1998.
-  Timothy F Cootes and Christopher J Taylor. Active shape models—‘smart snakes’. In BMVC92, pages 266–275. Springer, 1992.
-  Arnaud Dapogny, Kévin Bailly, and Matthieu Cord. DeCaFA: Deep Convolutional Cascade for Face Alignment In The Wild. pages 6893–6901, 2019.
-  Jiankang Deng, Qingshan Liu, Jing Yang, and Dacheng Tao. M3 csr: Multi-view, multi-scale and multi-component cascade shape regression. Image and Vision Computing, 47:19–26, 2016.
-  Jiankang Deng, George Trigeorgis, Yuxiang Zhou, and Stefanos Zafeiriou. Joint multi-view face alignment in the wild. IEEE Transactions on Image Processing, 28(7):3636–3648, 2019.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
-  Xuanyi Dong, Yan Yan, Wanli Ouyang, and Yi Yang. Style Aggregated Network for Facial Landmark Detection. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 379–388, 2018.
-  Xuanyi Dong and Yi Yang. Teacher Supervises Students How to Learn From Partially Labeled Images for Facial Landmark Detection. In Proceedings of the IEEE International Conference on Computer Vision, pages 783–792, 2019.
-  Xuanyi Dong, Shoou-I Yu, Xinshuo Weng, Shih-En Wei, Yi Yang, and Yaser Sheikh. Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 360–368, 2018.
-  Zhen-Hua Feng, Guosheng Hu, Josef Kittler, William Christmas, and Xiao-Jun Wu. Cascaded collaborative regression for robust facial landmark detection trained using a mixture of synthetic and real images with dynamic weighting. IEEE Transactions on Image Processing, 24(11):3425–3440, 2015.
-  Zhen-Hua Feng, Josef Kittler, Muhammad Awais, Patrik Huber, and Xiao-Jun Wu. Wing Loss for Robust Facial Landmark Localisation with Convolutional Neural Networks. 2017.
-  Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, pages 2672–2680, 2014.
-  Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep belief nets. Neural Computation, 18(7):1527–1554, 2006.
-  Sina Honari, Pavlo Molchanov, Stephen Tyree, Pascal Vincent, Christopher Pal, and Jan Kautz. Improving Landmark Localization with Semi-Supervised Learning. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 1546–1555, 2018.
-  Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
-  Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning via meta-learning. arXiv preprint arXiv:1810.02334, 2018.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
-  Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems, pages 3581–3589, 2014.
-  Martin Koestinger, Paul Wohlhart, Peter M Roth, and Horst Bischof. Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In 2011 IEEE International Conference on Computer Vision Workshops (ICCV workshops), pages 2144–2151. IEEE, 2011.
-  Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, and Thomas S Huang. Interactive facial feature localization. In European Conference on Computer Vision, pages 679–692. Springer, 2012.
-  Zhujin Liang, Shengyong Ding, and Liang Lin. Unconstrained facial landmark localization with backbone-branches fully-convolutional networks. arXiv preprint arXiv:1507.03409, 2015.
-  Jiangjing Lv, Xiaohu Shao, Junliang Xing, Cheng Cheng, and Xi Zhou. A deep regression architecture with two-stage re-initialization for high performance facial landmark detection. Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, 2017-January:3691–3700, 2017.
-  Jiang-Jing Lv, Cheng Cheng, Guo-Dong Tian, Xiang-Dong Zhou, and Xi Zhou. Landmark perturbation-based data augmentation for unconstrained face recognition. Signal Processing: Image Communication, 47:465–475, 2016.
-  Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
-  Mingjie Zheng Meilu Zhu, Daming Shi∗, Muhammad Sadiq. Robust Facial Landmark Detection via Occlusion-adaptive Deep Networks. Cvpr, pages 3486–3496, 2019.
-  Daniel Merget, Matthias Rock, and Gerhard Rigoll. Robust Facial Landmark Detection via a Fully-Convolutional Local-Global Context Network. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 781–790, 2018.
-  Kieron Messer, Jiri Matas, Josef Kittler, Juergen Luettin, and Gilbert Maitre. Xm2vtsdb: The extended m2vts database. In Second international conference on audio and video-based biometric person authentication, volume 964, pages 965–966, 1999.
-  Xin Miao, Xiantong Zhen, Xianglong Liu, Cheng Deng, Vassilis Athitsos, and Heng Huang. Direct shape regression networks for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5040–5049, 2018.
-  Ali Mollahosseini, Behzad Hasani, and Mohammad H. Mahoor. AffectNet: A Database for Facial Expression, Valence, and Arousal Computing in the Wild. IEEE Transactions on Affective Computing, 2017.
-  Shengju Qian, Keqiang Sun, Wayne Wu, Chen Qian, and Jiaya Jia. Aggregation via Separation: Boosting Facial Landmark Detector with Semi-Supervised Style Translation. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
Rajeev Ranjan, Vishal M Patel, and Rama Chellappa.
Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(1):121–135, 2017.
-  Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face alignment at 3000 fps via regressing local binary features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1685–1692, 2014.
-  Shaoqing Ren, Xudong Cao, Yichen Wei, and Jian Sun. Face alignment via regressing local binary features. IEEE Transactions on Image Processing, 25(3):1233–1245, 2016.
-  Joseph P Robinson, Yuncheng Li, Ning Zhang, Yun Fu, Tulyakov, and Sergey. Laplace Landmark Localization. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
-  Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 Faces In-The-Wild Challenge: database and results. Image and Vision Computing, 47:3–18, 2016.
-  Christos Sagonas, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 397–403, 2013.
-  Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 815–823, 2015.
-  Yi Sun, Xiaogang Wang, and Xiaoou Tang. Deep convolutional network cascade for facial point detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3476–3483, 2013.
-  James Thewlis, Samuel Albanie, Hakan Bilen, and Andrea Vedaldi. Unsupervised learning of landmarks by descriptor vector exchange. In Proceedings of the IEEE International Conference on Computer Vision, pages 6361–6371, 2019.
-  Phil Tresadern, Tim Cootes, Chris Taylor, and Vladimir Petrović. Face alignment models. In Handbook of face recognition, pages 109–135. Springer, 2011.
-  George Trigeorgis, Patrick Snape, Mihalis A Nicolaou, Epameinondas Antonakos, and Stefanos Zafeiriou. Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4177–4187, 2016.
-  Tim Valentine, Michael B Lewis, and Peter J Hills. Face-space: A unifying concept in face recognition research. The Quarterly Journal of Experimental Psychology, 69(10):1996–2019, 2016.
-  Xinyao Wang, Liefeng Bo, and Li Fuxin. Adaptive Wing Loss for Robust Face Alignment via Heatmap Regression. 2019.
-  Zhou Wang, Eero P Simoncelli, and Alan C Bovik. Multiscale structural similarity for image quality assessment. In The Thirty-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, volume 2, pages 1398–1402. IEEE, 2003.
-  Wenyan Wu, Chen Qian, Shuo Yang, Quan Wang, Yici Cai, and Qiang Zhou. Look at Boundary: A Boundary-Aware Face Alignment Algorithm. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2129–2138, 2018.
-  Yue Wu, Tal Hassner, KangGeon Kim, Gerard Medioni, and Prem Natarajan. Facial Landmark Detection with Tweaked Convolutional Neural Networks. 2015.
-  Xuehan Xiong and Fernando De la Torre. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 532–539, 2013.
-  Wei Yang, Shuang Li, Wanli Ouyang, Hongsheng Li, and Xiaogang Wang. Learning feature pyramids for human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 1281–1290, 2017.
-  Egor Zakharov, Aliaksandra Shysheya, Egor Burkov, and Victor Lempitsky. Few-Shot Adversarial Learning of Realistic Neural Talking Head Models. 2019.
-  Richard Zhang, Phillip Isola, and Alexei A Efros. Split-brain autoencoders: Unsupervised learning by cross-channel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1058–1067, 2017.
-  Zhanpeng Zhang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Facial Landmark Detection by Deep Multi-task Learning. Eccv, 2014.
-  Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape searching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4998–5006, 2015.
-  Shizhan Zhu, Cheng Li, Chen Change Loy, and Xiaoou Tang. Face alignment by coarse-to-fine shape searching. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 07-12-June:4998–5006, 2015.
-  Xiaojin Zhu. Semi-supervised learning. Encyclopedia of Machine Learning and Data Mining, pages 1142–1147, 2017.
-  Xiangyu Zhu, Zhen Lei, Xiaoming Liu, Hailin Shi, and Stan Z. Li. Face Alignment Across Large Poses: A 3D Solution. 2016.
-  Xiangxin Zhu and Deva Ramanan. Face detection, pose estimation, and landmark localization in the wild. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2879–2886. IEEE, 2012.