, age estimation, and expression classification . Various benchmark datasets [1, 12, 14, 16] have been released, each of which containing large quantities of labelled images. Despite the databases were collected with the goal of being as rich and diverse as possible, inherent bias across datasets is unavoidable in practice .
The bias presents in the form of different characteristics and distributions across datasets, as depicted in Fig. 1. For instance, one set mainly contains white Caucasian male with mostly frontal faces, while another set consists of challenging samples with various poses or severe occlusions. In addition, the distribution difference between profile views can differ as much as over 10% across datasets. Clearly, training a model on one dataset would lead to over-fitting easily, and causing poor performance on unseen domain. To improve generalisation, it is of practical interest to combine different databases so as to leverage the characteristics and distributions of multiple sources. This thought, however, is hindered by the annotation gaps (see the first column of Fig 1), which requires huge effort to standardize before databases fusion is possible.
The objective of this study is to formulate an approach that allows integration of different databases despite their different annotation protocols. At first glance, this seems unsolvable but we make this possible through exploiting common landmarks across datasets. Specifically, we observe that many landmarks are well labelled with decisive semantic definition across different datasets, e.g. left and right eyes corners, mouth corners and pupil centers. These common landmarks can usually be found on different datasets, although their numbers can be different. Often, there are 6 to 12 common landmarks annotated on a pair of datasets (the dark blue landmarks in Fig. 4). Theses common landmarks provide us with an opportunity to transfer information from one dataset to the other.
To this end, we propose a simple yet effective approach to exploit common landmarks as guidance, and transfer labelled landmarks from a given source data and fitting them to the images in an arbitrary target set. Performing datasets fusion with the proposed approach offers us with enormous advantages: (i) sample diversification – our method allows standardisation of disparate annotation spaces, thus allowing fusion of datasets. The combined dataset captures the diverse characteristics of samples from multiple sources. A model trained on this dataset is expected to have better generalisation on unseen samples. (ii) annotation enrichment – we are no longer limited to models only capable of particular landmark configuration if it is trained on a specific dataset. With the proposed approach, one can transfer densely labelled annotations from a source to sparsely labelled target for high-quality dense annotation in the target domain.
Contribution: We show for the first time how annotation spaces of different face alignment datasets can be standardised automatically. This allows us to combine and exploit diverse datasets for training a single model. Extensive experiments show that the resulting model achieves state-of-the-art face alignment results in cross datasets and unseen domain data evaluations. In particular, we achieve 16.6% improvement on average against the method with ‘closed-world’ assumption when performing cross-datasets evaluations, and 11.4% improvement on average compared to naïve training sets fusion. Based on the proposed annotation transferring approach, we obtain and release dense annotations (68 and 194 points) on the popular face verification dataset LFW  via the link http://mmlab.ie.cuhk.edu.hk/projects/landmarksTransferring.html.
2 Related Work
Face Alignment Approaches for face alignment can be broadly divided into three categories: (i) active appearance model based method, (ii) cascaded regression method, and (iii) detection based method. As the most classic method, the original active appearance model (AAM)  tries to search for shape parameters through minimising the residual between the face appearance and a face template. The method suffers from poor generalisation and sensitivity to initialisation.
Cascaded regression treats shape estimation as a regression problem. It starts from raw estimates of landmark positions, and learns regressors that map shape dependent features into pose increments iteratively. Examples of cascaded regression method include the approach by Cao et al. , which employs boosted nonlinear regression with shape dependent pixel difference features. Burgos-Artizzu et al.  builds a cascaded regression model with an occlusion detection and voting strategy to cope with severe occlusion. Xiong and De la Torre  address regression by learning generic descent directions, and perform linear mapping on non-linear SIFT features, which achieves state-of-the-art results.
Detection based approaches detects object parts independently and then estimates pose and/or shape directly from the detections [7, 9] or through flexible part models [5, 3, 24]. These methods are effective at detecting and localising articulated objects from multiple views in challenging scenarios. Sun et al.  propose a cascaded deep convolutional network for five-points face alignment. The network detects approximate locations of the landmarks in the lower cascade and refine the estimations in higher cascade. State-of-the-art result is recently achieved by Zhang et al. [28, 29] using a deep convolutional network trained for facial landmark detection together with heterogeneous but subtly correlated tasks, like head pose estimation and facial attribute inference. The model achieves the state-of-the-art result on the 300-W benchmark dataset (mean error of 9.15% on the challenging IBUG subset).
Dataset Bias Torralba and Efros 
raise an important question: are the datasets deployed for computer vision studies unbiased representations of the visual world? They showed that even large number of training images are employed, an image classification model can still over-fit if it is trained on a single dataset with bias. The over-fitting would severely hamper cross-dataset generalisation. A number of studies focus on undoing this bias by transfer learning[17, 15, 22] or through other means like max-margin based learning framework or subspace alignment method [13, 11]. To our knowledge, our work is among the first studies that investigates the problem of dataset bias in face alignment domain. We wish to show that using existing databases independently for training/test would risk a ‘closed-world’ evaluation environment. To allow improved cross-dataset generalisation, we devise a novel transductive alignment method to bridge the annotation gap between diverse datasets, which in turn facilitates seamless databases fusion for domain adaptive face alignment.
It is worth noting that in the recent work by Smith and Zhang , they have independently presented an alternative way to combine multiple face landmark datasets with different landmark definitions into a super dataset.
3.1 Problem and Notations
In a typical face alignment pipeline, one assumes the training set consists of both images and the corresponding ground truth coordinates , each image contains a cropped face and each ground truth pose , in which is the number of landmarks on each face. We use asterisk to denote ground-truth coordinates.
The goal is to learn a model , to estimate the location of landmarks in the test set . For a cascaded regression method, the estimate of the current landmarks’ coordinates for iteration is denoted by . For clarity, we use for abbreviation in the following discussion. We use
to represent the shape-indexed features extracted according to the specific pose parameterised by. For a SIFT-based shape dependent features , the dimension of the features is .
In this study, we assume there exists a source dataset, represented as for training and for testing. On the other hand, we have a target set and . More precisely, as shown in Fig. 2, the source and target training sets share some common landmarks , which co-exist between source and target training sets. On the other hand, there exist private landmarks, which refer to those that can only be found on either source or target training sets, but not both. They are represented as and , respectively.
Note that despite the source and target training sets share some common landmarks, their private landmarks are different, thus the total number of landmark annotations, , and , are also different. Our task is to bridge such annotation gap. As discussed in Section 1, performing such an annotation transfer operation is challenging, in that the annotation protocols of source and target sets could differ significantly. We address this problem through exploiting the common landmarks co-exists between the source and target sets. The details are presented in Section 3.3.
Transferring the annotations from source to target will provide the target set with source-type landmarks, as shown in the right-most subfigure in Fig. 2. The transferred private landmarks from the source to the target set are denoted as , whilst the transferred common landmarks are given as . The target set with this new set of annotations is known as pseudo-labeled target training set, and it is represented as . We show in Section 5.1 that the transferred annotations/landmarks are close to human annotating accuracy. We can readily combine the synthesised target training set with the source training set, since they now have an identical set of annotations. We show this possibility in Section 3.4.
3.2 Brief Overview of Supervised Descent Method
Before we detail how annotation transfer is performed, we first provide a brief review on the Supervised Descent Method (SDM)  as it forms the basis of the proposed approach. We note that our concept of transferring annotations is not limited to the SDM method but can be adapted to other existing cascaded regression-based approaches [6, 4].
In SDM, faces are centered to a mean shape and initial poses for all samples are initialised as the mean pose. Features are then extracted from the initial landmarks. The goal of SDM is to refine the current landmarks’ locations iteratively following a movement target. This is achieved by defining a loss function
The movement target
can be obtained through optimisation method by linear regression. Here the loss function can be approximated by its second order Taylor expansion, where and are Jacobian and Hessian matrix of the loss function , respectively. The approximate solution for can be obtained by .
Since features are not always differentiable and performing numerical differential is computationally expensive, the projection from derivatives of features to is estimated through the following regression problem:
where is the coefficient matrix. However, in the testing stage, the specific locations of ground truth is not available, thus it is impossible to extract features . The SDM resorts to approximation by using to replace the factor so that the regression can be learned as follows:
This formulation inspires the proposed transductive alignment method, which will be presented next. In particular, in cross-dataset annotation transfer, we could actually exploit the set of common landmarks to estimate . Given the estimated , we could achieve better regression performance than approximation as in Eq. 3.
3.3 Transductive Alignment
The core step of our approach is transductive alignment. The goal of transductive alignment is to obtain the synthesized target training set , as shown in Fig. 2. We will obtain the transferred annotations with the guidance of common landmarks . The details are given as follows.
To extend the original SDM for transductive alignment, we need to estimate those unknown features . In SDM, all information from is counted implicitly into the bias term, causing a loss in information. Considering our task of forming synthesized training dataset , we actually have extra information from common landmarks . Here we attempt to partially recover the missing term in Equation 3 by . Since the features extracted around the landmarks always share a considerable extent of overlapping area, especially in densely annotated region, features and 111We denotes as , thus . thus have high correlation.
More precisely, we assume we can estimate by a linear projection from , as
Substituting this back to Equation 2, we found that also has linear relation with . Since we add relatively accurate estimations for the regression, it would be more suitable if we apply the following regression strategy:
where are the estimated source-type common and private landmarks, whilst is the ground-truth common landmarks. and denote the regression coefficient matrix and bias learned using the source dataset.
Figure 3 summarizes our transductive alignment step in an intuitive schematic diagram. We obtain substantial improvement using the reference information from features extracted from common landmarks. Note that we do not directly use the specific location information from in our estimation mainly because we need to prevent improvement bias, in which estimate on improves a great deal, while is still in poor modification, since provides global reference information beneficial to all the landmarks but only contributes to itself. Experiments in Section 5.1 demonstrate that the proposed transductive alignment method produces accurate source-type annotations on target domain.
3.4 Augmenting Source and Target Training Sets
Figure 4 shows the full pipeline of our proposed algorithm, including the step of augmenting the source and pseudo-labeled target training sets. We call the full pipeline as Transductive Cascaded Regression (TCR).
Step 1 – Unlike the conventional cascaded model learning process (depicted with red arrows), we first obtain the pseudo-labeled target training set by transductive alignment described in Section 3.3.
Step 2 – We then filter erroneous transferred annotations in the pseudo-labeled target training set. This is done through comparing the estimated and ground truth common landmarks. In particular, we remove target training samples with error larger than in their estimated common landmarks. Only those samples with valid transferred annotations remain in the pseudo-labeled target training set. The filtered transferred annotations are clean and close to human annotation, as we will show in Section 5.1.
Step 3 – We combine the cleaned pseudo-labeled target training set, , with the source training set .
Step 4 – A model is learned using the combined training set.
Note that Step 3 is possible thanks to the transductive alignment step, which bridge the annotation gap between the source and target training sets. Next, we demonstrate the effectives of the proposed approach in cross-dataset and unseen data evaluation, and its robustness in handling challenging settings, such as large pose variations and severe occlusions.
4 Experimental Settings
Datasets: We selected a number of popular face alignment datasets for evaluation. These datasets are different in terms of their distributions in pose variations, and the degrees of illuminance and occlusion. Table 1 summarizes the datasets, with sample images provided in Fig. 1.
|LFW||13233||10389/2595||10||249||Mainly male with frontal faces|
|AFLW||21123||9565/2396||18||178||Challenging in pose variation|
|LFPW||1432||782/188||29||0||Faces are mostly frontal|
|HELEN||2330||2000/330||194||0||Extreme closeup & dense labels|
These four datasets can be combined differently to form a source and target pairs, resulting into 12 possible combinations. Training on these combinations gives us 12 models for cross-datasets evaluation. Recall that we aim to predict source-type landmarks on target testing set. We therefore require extra labelling to generate ground truth for evaluation. We collected all four types of annotations (LFW 10 pts, AFLW 18 pts, LFPW 29 pts, HELEN 194 pts) on HELEN and LFPW. In addition, we also selected 40 samples randomly from the testing sets of LFW and AFLW with challenging pose variations, and labelled them manually with two other types of annotations222We do not label LFW and AFLW with 194 landmarks since we do not have the special annotation tool . to form testing sets LFW-C and AFLW-C. Note that all the additional labelled landmarks are only used for evaluation purpose.
Performance Evaluation: Similar to previous studies [6, 26, 4], we measured error as the Root Mean Square Error (RMSE) percentage of the interocular distance. Estimations with error larger than 10% are reported as failure cases .
Comparison: To our knowledge, no method exists for transferring annotations to perform cross-dataset face alignment. Thus no suitable baselines can be found. We choose SDM  as our baseline method, because it’s the most closest to our approach which can be compared under same initialization and feature settings. It achieves state-of-the-art performances on most of the popular benchmark datasets. However, it does not have the capability to exploit additional dataset due to annotation discrepancy. Since the training codes for  is not publicly available, and most of the employed databases are not shared, we re-implemented the method, and verified the correctness of our implementation on LFW and HELEN.
Implementation Details: In our framework, faces were detected using a multiview Viola-Jone detector , which returned not only a bounding-box for each face, but also a rough pose category label. Number of miss-detected images in each dataset is reported in Table 1. We initialized the face by aligning faces to a mean pose in a 250 250 normalised square. We set initialized landmarks for Step 1 as mean locations of all samples in training set.We tuned parameters following the same settings as in : training samples were perturbed 10 times by a random rigid transform, and we reduced the dimensionality of the regression data by performing PCA preserving 98% of the energy of the extracted features. We used SIFT features with fixed direction and each descriptor covers 20 20 pixels. In the 12 pairs of cross-dataset evaluations, the number of common landmarks ranges from 6 to 12. The threshold error for selecting valid pseudo annotation is fixed at 7.5 throughout the experiments.
5.1 Evaluating the Effectiveness of Transductive Alignment
In this experiment, we wish to verify the effectiveness of transductive alignment method (Step-1 in Sec. 3.4) by evaluating the accuracy of transferred annotations on the target training set. Specifically, the evaluations were conducted to measure the mean error (i) the transferred common landmarks, and (ii) the transferred private landmarks, onto the target training set. Since only the LFPW and HELEN datasets have all four types of ground truth annotations (see Sec. 4), our evaluation was limited to 6 source-target pairs. The average errors of transferred common landmarks and transferred private landmarks are 3.35 and 3.87, respectively. Both errors are very close to human annotation performance, which is commonly within the range between 3.0 and 4.5 .
Note that a straightforward way to label the target training set is to train a model on source training set and apply the model to infer landmarks on the target. Figure 5 shows the results obtained with such a naïve cross-dataset alignment, and compares with that yielded by the proposed transductive alignment method. Clearly, our approach outperforms, thanks to the guidance offered by the common landmarks.
5.2 Evaluating TCR on Common Landmarks
In this experiment, we evaluate the performance of our full Transductive Cascaded Regression model (the training steps described in Sec. 3.4). We compared the proposed model against the state-of-the-art method SDM . We used it to represent a ‘closed-world’ method without annotation transfer and source-target sets augmentation. In addition, we also compared against a naïve fusion method, which simply combines the source and target sets without transductive alignment. In particular, this naïve method was only trained with common landmarks from the source and target training sets. It therefore can predict only common landmark locations during testing. Note that although both our method and SDM are capable of predicting full source-type landmarks, we only used them for predicting common landmarks.
Table 2 summarizes the results. In general the proposed TCR outperforms the two baselines. Several observations are outlined below:
From the diagonal values under the SDM column, we can observe that the alignment result is reaching its best when the training and testing are conducted on the same dataset. Nevertheless, the performance deteriorates in cross-dataset evaluations. These results strongly suggests the existence of dataset bias .
The proposed TCR method obtains superior performance over the SDM, which assumes a ‘closed world’ training/test environment.
The proposed model also outperforms the naïve fusion method in most of the source-target pairs. The reason is that our model learns from a richer set of landmarks transferred from the source domain. In comparison, the naïve fusion method only learns from common landmarks alone. The richer set of landmarks offers more constrains to the face shape, which in turn leads to more accurate estimation . This highlights the values of performing transductive alignment for annotation enrichment.
|Closed-World (SDM )||Naïve Training Set Fusion||The Proposed TCR|
5.3 Evaluating TCR on All Landmarks
Our evaluation in Section 5.2 was confined to common landmarks between source and target. Due to the unique capability of standardizing the annotation spaces between datasets, our model can readily exploits additional independent sources to enrich the annotations of a target dataset. To evaluate such capability, in this experiment we examined performance of our algorithm over full source-type annotations. Similar to Section 5.2, we compared our model with the ‘closed-world’ model (using SDM ).
Table 3 summarizes the results333No results were reported on (HELEN / LFW-C) and (HELEN / AFLW-C) since we do not have the special tool for annotating 194 landmarks as in HELEN.. It is observed that the TCR method outperforms the ‘closed-world’ method in all cases. Note that the proposed method employs target+source training data for learning a model, whilst the ‘closed-world’ method only learns from the source training set. This is arguably an ‘unfair’ comparison, but it highlights the importance of fusing different datasets for better generalisation. The fusion is not possible without the proposed transductive alignment approach. Figure 6 shows the results by our transductive algorithm.
|Closed-World (SDM )||The Proposed TCR||Relative Improvement|
5.4 Evaluating TCR on Unseen Samples with Occlusions
In previous experiments we focused on training a model using target training set and have it tested on target testing set. In this experiment, we evaluated our model in an unseen domain with many samples have faces being partially occluded. We selected the challenging Caltech Occluded Faces in the Wild (COFW)  for our evaluation. The training set of COFW consists of nearly all training samples in LFPW and 507 extra faces with heavy occlusions. Its testing set contains 500 challenging samples with occlusion. The annotation type is identical to LFPW, with 29 landmarks annotated on each face. To increase the challenge, we did not train our model on COFW, but the combination of LFPW and AFLW after applying the transductive alignment. Here, we used LFPW as source training data and AFLW as target training data. We compared our method against RCPR , which is trained on COFW and specifically designed for handling faces with heavy occlusion. We used the publicly available implementation with the same parameter settings.
The quantitative results and qualitative examples were summarized in Fig. 7. As expected, our model performs much better than the ‘closed-world’ SDM method, which was trained only on source data, i.e. LFPW. Interestingly, the proposed model, which are trained on LFPW and AFLW but without COFW, achieves competitive even better result than SDM with COFW as training set. The results suggest the effectiveness of our model in combining the source LFPW with the target AFLW, leading to superior generalization even without the dedicated COFW as training set.
We have formulated a novel Transductive Cascaded Regression (TCR) method, of which the core a transductive alignment approach, which is capable of transferring annotation style from one dataset to another seamlessly. Effectively bridging the annotation space allows one to combine two different datasets with diverse characteristics. We have shown that a model trained on combined datasets performed extremely well in cross-dataset evaluation and even unseen domain with severe occlusion. In particular, our method has achieved 16.6% improvement on average against the ‘closed-world’ method when performing cross-datasets evaluations, and 11.4% improvement on average compared to naïve training sets fusion.
-  Belhumeur, P.N., Jacobs, D.W., Kriegman, D., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR. pp. 545–552 (2011)
-  Berg, T., Belhumeur, P.N.: Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In: CVPR. pp. 955–962 (2013)
-  Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human pose annotations. In: ICCV. pp. 1365–1372 (2009)
-  Burgos-Artizzu, X.P., Perona, P., Dollár, P.: Robust face landmark estimation under occlusion. In: CVPR (2013)
-  Burl, M.C., Weber, M., Perona, P.: A probabilistic approach to object recognition using local photometry and global geometry. In: ECCV, pp. 628–641 (1998)
-  Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. In: CVPR. pp. 2887–2894 (2012)
Cevikalp, H., Triggs, B., Franc, V.: Face and landmark detection by using cascade of classifiers. In: FG. pp. 1–7 (2013)
-  Chen, K., Gong, S., Xiang, T., Loy, C.C.: Cumulative attribute space for age and crowd density estimation. In: CVPR. pp. 2467–2474 (2013)
Cootes, T.F., Ionita, M.C., Lindner, C., Sauer, P.: Robust and accurate shape model fitting using random forest regression voting. In: ECCV, pp. 278–291 (2012)
-  Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. In: ECCV, pp. 484–498 (1998)
-  Fernando, B., Habrard, A., Sebban, M., Tuytelaars, T., et al.: Unsupervised visual domain adaptation using subspace alignment. ICCV (2013)
-  Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: E.: Labeled faces in the wild: A database for studying face recognition in unconstrained environments (2007)
-  Khosla, A., Zhou, T., Malisiewicz, T., Efros, A.A., Torralba, A.: Undoing the damage of dataset bias. In: ECCV, pp. 158–171 (2012)
-  Kostinger, M., Wohlhart, P., Roth, P.M., Bischof, H.: Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization. In: ICCV-Workshop. pp. 2144–2151 (2011)
-  Kulis, B., Saenko, K., Darrell, T.: What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In: CVPR. pp. 1785–1792 (2011)
-  Le, V., Brandt, J., Lin, Z., Bourdev, L., Huang, T.S.: Interactive facial feature localization. In: Computer Vision–ECCV 2012, pp. 679–692 (2012)
-  Saenko, K., Kulis, B., Fritz, M., Darrell, T.: Adapting visual category models to new domains. In: ECCV, pp. 213–226 (2010)
-  Smith, B.M., Zhang, L.: Collaborative facial landmark localization for transferring annotations across datasets. In: ECCV, pp. 78–93 (2014)
-  Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR. pp. 3476–3483 (2013)
Sun, Y., Wang, X., Tang, X.: Hybrid deep learning for face verification. In: ICCV. pp. 1489–1496. IEEE (2013)
-  Sun, Y., Wang, X., Tang, X.: Deep learning face representation from predicting 10,000 classes. In: CVPR. pp. 1891–1898 (2014)
-  Tommasi, T., Quadrianto, N., Caputo, B., Lampert, C.H.: Beyond dataset bias: Multi-task unaligned shared knowledge transfer. In: ACCV. pp. 1–15 (2013)
-  Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: CVPR. pp. 1521–1528 (2011)
Van De Sande, K.E., Gevers, T., Snoek, C.G.: Evaluating color descriptors for object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on 32(9), 1582–1596 (2010)
-  Wang, Z., Wang, S., Ji, Q.: Capturing complex spatio-temporal relations among facial muscles for facial expression recognition. In: CVPR. pp. 3422–3429 (2013)
-  Xiong, X., De la Torre, F.: Supervised descent method and its applications to face alignment. In: CVPR (2013)
-  Zhang, C., Zhang, Z.: Winner-take-all multiple category boosting for multi-view face detection. Tech. rep. (2009)
-  Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: ECCV, pp. 94–108 (2014)
-  Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Learning and transferring multi-task deep representation for face alignment. arXiv:1408.3967 (2014)
-  Zhu, Z., Luo, P., Wang, X., Tang, X.: Deep learning identity-preserving face space. In: ICCV. pp. 113–120 (2013)