We present an algorithm for simultaneous face detection, landmarks localization, pose estimation and gender recognition using deep convolutional neural networks (CNN). The proposed method called, HyperFace, fuses the intermediate layers of a deep CNN using a separate CNN followed by a multi-task learning algorithm that operates on the fused features. It exploits the synergy among the tasks which boosts up their individual performances. Additionally, we propose two variants of HyperFace: (1) HyperFace-ResNet that builds on the ResNet-101 model and achieves significant improvement in performance, and (2) Fast-HyperFace that uses a high recall fast face detector for generating region proposals to improve the speed of the algorithm. Extensive experiments show that the proposed models are able to capture both global and local information in faces and performs significantly better than many competitive algorithms for each of these four tasks.READ FULL TEXT VIEW PDF
We present a multi-purpose algorithm for simultaneous face detection, fa...
We present an algorithm for extracting key-point descriptors using deep
Human face pose estimation aims at estimating the gazing direction or he...
This inherent relations among multiple face analysis tasks, such as land...
Currently in the domain of facial analysis single task approaches for fa...
Images taken from the Internet have been used alongside Deep Learning fo...
Detection and analysis of faces is a challenging problem in computer vision, and has been actively researched for applications such as face verification, face tracking, person identification, etc. Although recent methods based on deep Convolutional Neural Networks (CNN) have achieved remarkable results for the face detection task, , , it is still difficult to obtain facial landmark locations, head pose estimates and gender information from face images containing extreme poses, illumination and resolution variations. The tasks of face detection, landmark localization, pose estimation and gender classification have generally been solved as separate problems. Recently, it has been shown that learning correlated tasks simultaneously can boost the performance of individual tasks  ,, .
In this paper, we present a novel framework based on CNNs for simultaneous face detection, facial landmarks localization, head pose estimation and gender recognition from a given image (see Figure 1). We design a CNN architecture to learn common features for these tasks and exploit the synergy among them. We exploit the fact that information contained in features is hierarchically distributed throughout the network as demonstrated in . Lower layers respond to edges and corners, and hence contain better localization properties. They are more suitable for learning landmarks localization and pose estimation tasks. On the other hand, deeper layers are class-specific and suitable for learning complex tasks such as face detection and gender recognition. It is evident that we need to make use of all the intermediate layers of a deep CNN in order to train different tasks under consideration. We refer the set of intermediate layer features as hyperfeatures. We borrow this term from  which uses it to denote a stack of local histograms for multilevel image coding.
Since a CNN architecture contains multiple layers with hundreds of feature maps in each layer, the overall dimension of hyperfeatures is too large to be efficient for learning multiple tasks. Moreover, the hyperfeatures must be associated in a way that they efficiently encode the features common to the multiple tasks. This can be handled using feature fusion techniques. Features fusion aims to transform the features to a common subspace where they can be combined linearly or non-linearly. Recent advances in deep learning have shown that CNNs are capable of estimating an arbitrary complex function. Hence, we construct a separate fusion-CNN to fuse the hyperfeatures. In order to learn the tasks, we train them simultaneously using multiple loss functions. In this way, the features get better at understanding faces, which leads to improvements in the performances of individual tasks. The deep CNN combined with the fusion-CNN can be learned together in an end-to-end fashion.
We also study the performance of face detection, landmarks localization, pose estimation and gender recognition tasks using off-the-shelf Region-based CNN (R-CNN ) approach. Although R-CNN for face detection has been explored in DP2MFD , we provide a comprehensive study of all these tasks based on R-CNN. Furthermore, we study the multitask approach without fusing the intermediate layers of CNN. Detailed experiments show that the multi-task learning method performs better than methods based on individual learning. Fusing the intermediate layer features provides additional performance boost. This paper makes the following contributions.
We propose two novel CNN architectures that perform face detection, landmarks localization, pose estimation and gender recognition by fusing the intermediate layers of the network. The first one called HyperFace is based on AlexNet  model, while the second one called HyperFace-ResNet (HF-ResNet) is based on ResNet-101  model.
We propose two post-processing methods: Iterative Region Proposals (IRP) and Landmarks-based Non-Maximum Suppression (L-NMS), which leverage the multi-task information obtained from the CNN to improve the overall performance.
We study the performance of R-CNN-based approaches for individual tasks and the multi-task approach without intermediate layer fusion.
We achieve significant improvement in performance on challenging unconstrained datasets for all of these four tasks.
This paper is organized as follows. Section 2 reviews related work. Section 3 describes the proposed HyperFace framework in detail. Section 4 describes the implementation of R-CNN, Multitask_Face and HF-ResNet approaches. Section 5 provides the results of HyperFace and HF-ResNet along with R-CNN baselines on challenging datasets. Finally, Section 6 concludes the paper with a brief summary and discussion.
Multi-Task Learning: Multi-task learning (MTL) was first analyzed in detail by Caruana . Since then, several approaches have adopted MTL for solving different problems in computer vision. One of the earlier approaches for jointly addressing the tasks of face detection, pose estimation, and landmark localization was proposed in  and later extended in . This method is based on a mixture of trees with a shared pool of parts in the sense that every facial landmark is modeled as a part and uses global mixtures to capture the topological changes due to viewpoint variations. A joint cascade-based method was recently proposed in  for simultaneously detecting faces and landmark points on a given image. This method yields improved detection performance by incorporating a face alignment step in the cascade structure.
Multi-task learning using CNNs has also been studied recently. Eigen and Fergus  proposed a multi-scale CNN for simultaneously predicting depth, surface normals and semantic labels from an image. They apply CNNs at three different scales where the output of the smaller scale network is fed as input to the larger one. UberNet  adopts a similar concept of simultaneously training low-, mid- and high-level vision tasks. It fuses all the intermediate layers of a CNN at three different scales of the image pyramid for multi-task training on diverse sets. Gkioxari et al.  train a CNN for person pose estimation and action detection, using features only from the last layer. The use of MTL for face analysis is somewhat limited. Zhang et al.  used MTL-based CNN for facial landmark detection along with the tasks of discrete head yaw estimation, gender recognition, smile and glass detection. In their method, the predictions for all theses tasks were pooled from the same feature space. Instead, we strategically design the network architecture such that the tasks exploit low level as well as high level features of the network. We also jointly predict the task of face detection and landmark localization. These two tasks always go hand-in-hand and are used in most end-to-end face analysis systems.
Feature Fusion: Fusing intermediate layers from CNN to bring both geometry and semantically rich features together has been used by quite a few methods. Hariharan et al.  proposed Hypercolumns to fuse pool2, conv4 and fc7 layers of AlexNet  for image segmentation. Yang and Ramanan  proposed DAG-CNNs, which extract features from multiple layers to reason about high, mid and low-level features for image classification. Sermanet et al.  merge the 1st stage output of CNN to the classifier input after sub-sampling, for the application of pedestrian detection.
Face detection: Viola-Jones detector  is a classic method which uses cascaded classifiers on Haar-like features to detect faces. This method provides realtime face detection, but works best for full, frontal, and well lit faces. Deformable Parts Model (DPM) -based face detection methods have also been proposed in the literature where a face is essentially defined as a collection of parts , . It has been shown that in unconstrained face detection, features like HOG or Haar wavelets do not capture the discriminative facial information at different illumination variations or poses. To overcome these limitations, various deep CNN-based face detection methods have been proposed in the literature , , , , . These methods have produced state-of-the-art results on many challenging publicly available face detection datasets. Some of the other recent face detection methods include NPDFaces , PEP-Adapt , and .
Landmarks localization: Fiducial points extraction or landmarks localization is one of the most important steps in face recognition. Several approaches have been proposed in the literature. These include both regression-based , , , , ,  and model-based  ,,  methods. While the former learns the shape increment given a mean initial shape, the latter trains an appearance model to predict the keypoint locations. CNN-based landmark localization methods have also been proposed in recent years , , and have achieved remarkable performance.
Although much work has been done for localizing landmarks for frontal faces, limited attention has been given to profile faces which occur more often in real world scenarios. Jourabloo and Liu recently proposed PIFA  that estimates 3D landmarks for large pose face alignment by integrating a 3D point distribution model with a cascaded coupled-regressor. Similarly, 3DDFA  fits a dense 3D model by estimating its parameters using a CNN. Zhu et al.  proposed a cascaded compositional learning approach that combines shape prediction from multiple domain specific regressors.
Pose estimation: The task of head pose estimation is to infer the orientation of person’s head relative to the camera view. It is useful in face verification for matching face similarity across different orientations. Non-linear manifold-based methods have been proposed in , ,  to classify face images based on pose. A survey of various head pose estimation methods is provided in .
Gender recognition: Previous works on gender recognition have focused on finding good discriminative features for classification. Most previous methods use one or more combination of features such as LBP, SURF, HOG or SIFT. In recent years, attribute-based methods for face recognition have gained a lot of traction. Binary classifiers were used in  for each attribute such as male, long hair, white etc. Separate features were computed for different attributes and they were used to train individual SVMs for each attribute. CNN-based methods have also been proposed for learning attribute-based representations in , .
We propose a single CNN model for simultaneous face detection, landmark localization, pose estimation and gender classification. The network architecture is deep in both vertical and horizontal directions, i.e., it has both top-down and lateral connections, as shown in Figure 2. In this section, we provide a brief overview of the system and then discuss the different components in detail.
The proposed algorithm called HyperFace consists of three modules. The first one generates class independent region-proposals from the given image and scales them to pixels. The second module is a CNN which takes in the resized candidate regions and classifies them as face or non-face. If a region gets classified as a face, the network additionally provides facial landmarks locations, estimated head pose and gender information. The third module is a post-processing step which involves Iterative Region Proposals (IRP) and Landmarks-based Non-Maximum Suppression (L-NMS) to boost the face detection score and improve the performance of individual tasks.
We start with Alexnet  for image classification. The network consists of five convolutional layers along with three fully connected layers. We initialize the network with the weights of R-CNN_Face network trained for face detection task as described in Section 4. All the fully connected layers are removed as they encode image-classification specific information, which is not needed for pose estimation and landmarks extraction. We exploit the following two observations to create our network. 1) The features in CNN are distributed hierarchically in the network. While the lower layer features are effective for landmarks localization and pose estimation, the higher layer features are suitable for more complex tasks such as detection or classification. 2) Learning multiple correlated tasks simultaneously builds a synergy and improves the performance of individual tasks as shown in [6, 65]. Hence, in order to simultaneously learn face detection, landmarks, pose and gender, we need to fuse the features from the intermediate layers of the network (hyperfeatures), and learn multiple tasks on top of it. Since the adjacent layers are highly correlated, we do not consider all the intermediate layers for fusion.
We fuse the , and layers of Alexnet, using a separate network. A naive way for fusion is directly concatenating the features. Since the feature maps for these layers have different dimensions , , , respectively, they cannot be easily concatenated. We therefore add and convolutional layers to , layers to obtain consistent feature maps of dimensions at the output. We then concatenate the output of these layers along with to form a dimensional feature maps. The dimension is still quite high to train a multi-task framework. Hence, a kernel convolution layer () is added to reduce the dimensions  to . We add a fully connected layer () to , which outputs a
dimensional feature vector. At this point, we split the network into five separate branches corresponding to the different tasks. We add, , , and fully connected layers, each of dimension 512, to
. Finally, a fully connected layer is added to each of the branch to predict the individual task labels. After every convolution or a fully connected layer, we deploy the Rectified Linear Unit (ReLU). We do not include any pooling operation in the fusion network as it provides local invariance which is not desired for the face landmark localization task. Task-specific loss functions are then used to learn the weights of the network.
We use the AFLW dataset for training the HyperFace network. It contains faces in real-world images with full pose, expression, ethnicity, age and gender variations. It provides annotations for landmark points per face, along with the face bounding-box, face pose (yaw, pitch and roll) and gender information. We randomly selected images for testing, and used the rest for training the network. Different loss functions are used for training the tasks of face detection, landmark localization, pose estimation and gender classification.
Face Detection: We use the Selective Search  algorithm in R-CNN  to generate region proposals for faces in an image. A region having an Intersection over Union (IOU) overlap of more than with the ground truth bounding box is considered a positive sample (). The candidate regions with IOU overlap less than are treated as negative instances (). All the other regions are ignored. We use the softmax loss function given by (1) for training the face detection task.
where p is the probability that the candidate region is a face. The probability valuesand are obtained from the last fully connected layer for the detection task.
Landmarks Localization: We use point markups for face landmarks locations as provided in the AFLW dataset. Since the faces have full pose variations, some of the landmark points are invisible. The dataset provides the annotations for the visible landmarks. We consider bounding-box regions with IOU overlap greater than with the ground truth for learning this task, while ignoring the rest. A region can be characterized by where are the co-ordinates of the center of the region and , are the width and height of the region respectively. Each visible landmark point is shifted with respect to the region center , and normalized by () as given by (2)
where ()’s are the given ground truth fiducial co-ordinates. The ()’s are treated as labels for training the landmark localization task using the Euclidean loss weighted by the visibility factor. The loss in predicting the landmark location is computed from (3)
where () is the landmark location predicted by the network, relative to a given region, is the total number of landmark points ( for AFLW). The visibility factor is if the landmark is visible in the candidate region, else it is . This implies that there is no loss corresponding to invisible points and hence they do not take part during back-propagation.
Learning Visibility: We also learn the visibility factor in order to test the presence of the predicted landmark. For a given region with overlap higher than , we use a simple Euclidean loss to train the visibility as shown in (4)
where is the predicted visibility of landmark. The true visibility is if the landmark is visible in the candidate region, else it is .
Pose Estimation: We use the Euclidean loss to train the head pose estimates of roll (), pitch () and yaw (). We compute the loss for a candidate region having an overlap more than with the ground truth, from (5)
where () are the estimated pose labels.
Gender Recognition: Predicting gender is a two class problem similar to face detection. For a candidate region with overlap of with the ground truth, we compute the softmax loss given in (6)
where if the gender is male, or else . Here, () is the two dimensional probability vector computed from the network.
The total loss is computed as the weighted sum of the five individual losses as shown in (7)
where is the element from the set of tasks . The weight parameter is decided based on the importance of the task in the overall loss. We choose () for our experiments. Higher weights are assigned to landmark localization and pose estimation tasks as they need spatial accuracy.
From a given test image, we first extract the candidate region proposals using. For each region, we predict the task labels by a forward-pass through the HyperFace network. Only those regions, whose detection scores are above a certain threshold, are classified as face and processed for subsequent tasks. The predicted landmark points are scaled and shifted to the image co-ordinates using (8)
where () are the predicted locations of the landmark from the network, and are the region parameters defined in (2). Points obtained with predicted visibility less than a certain threshold are marked invisible. The pose labels obtained from the network are the estimated roll, pitch and yaw for the face region. The gender is assigned according to the label with maximum predicted probability.
There are two major issues while using proposal-based face detection. First, the proposals might not be able to capture small and difficult faces, hence reducing the overall recall of the system. Second, the proposal boxes might not be well localized with the actual face region. It is a common practice to use bounding-box regression  as a post processing step to improve the localization of the detected face box. This adds an additional burden of training regressors to learn the transformation from the detected candidate box to the annotated face box. Moreover, the localization is still weak since the regressors are usually linear. Recently, Gidaris and Komodakis proposed LocNet  which tries to solve these limitations by refining the detection bounding box. Given a set of initial bounding box proposals, it generates new sets of bounding boxes that maximize the likelihood of each row and column within the box. It allows an accurate inference of bounding box under a simple probabilistic framework.
Instead of using the probabilistic framework , we solve the above mentioned issues in an iterative way using the predicted landmarks. The fact that we obtain landmark locations along with the detections, enables us to improve the post-processing step so that all the tasks benefit from it. We propose two novel methods: Iterative Region Proposals (IRP) and Landmarks-based Non-Maximum Suppression (L-NMS) to improve the performance. IRP improves the recall by generating more candidate proposals by using the predicted landmarks information from the initial set of region proposals. On the other hand, L-NMS improves the localization by re-adjusting the detected bounding boxes according to the predicted landmarks and performing NMS on top of them. No additional training is required for these methods.
Iterative Region Proposals (IRP): We use a fast version of Selective Search which extracts around regions from an image. We call this version . It is quite possible that some faces with poor illumination or small size fail to get captured by any candidate region with a high overlap. The network would fail to detect that face due to low score. In these situations, it is desirable to have a candidate box which precisely captures the face. Hence, we generate a new candidate bounding box from the predicted landmark points using the FaceRectCalculator provided by , and pass it again through the network. The new region, being more localized yields a higher detection score and improves the corresponding tasks output, thus increasing the recall. This procedure can be repeated (say time), so that boxes at a given step will be more localized to faces as compared to the previous step. From our experiments, we found that the localization component saturates in just one step ( = ), which shows the strength of the predicted landmarks. The pseudo-code of IRP is presented in Algorithm 1. The usefulness of IRP can be seen in Figure 3, which shows a low-resolution face region cropped from the top-right image in Figure 15.
Landmarks-based Non-Maximum Suppression (L-NMS): The traditional approach of non-maximum suppression involves selecting the top scoring region and discarding all the other regions with overlap more than a certain threshold. This method can fail in the following two scenarios: 1) If a region corresponding to the same detected face has less overlap with the highest scoring region, it can be detected as a separate face. 2) The highest scoring region might not always be localized well for the face, which can create some discrepancy if two faces are close together. To overcome these issues, we perform NMS on a new region whose bounding box is defined by the boundary co-ordinates as of the landmarks for the given region. In this way, the candidate regions would get close to each other, thus decreasing the ambiguity of the overlap and improving the localization.
We apply landmarks-based NMS to keep the top- boxes, based on the detection scores. The detected face corresponds to the region with maximum score. The landmark points, pose estimates and gender classification scores are decided by the median of the top boxes obtained. Hence, the predictions do not rely only on one face region, but considers the votes from top- regions for generating the final output. From our experiments, we found that the best results are obtained with the value of being . The pseudo-code for L-NMS is given in Algorithm 2.
To emphasize the importance of multitask approach and fusion of the intermediate layers of CNN, we study the performance of simpler CNNs devoid of such features. We evaluate four R-CNN-based models, one for each task of face detection, landmark localization, pose estimation and gender recognition. We also build a separate Multitask_Face model which performs multitask learning just like HyperFace, but does not fuse the information from the intermediate layers. These models are described as follows:
R-CNN_Face: This model is used for face detection task. The network architecture is shown in Figure 4(a). For training R-CNN_Face, we use the region proposals from AFLW training set, each associated with a face label based on the overlap with the ground truth. The loss is computed as per (1). The model parameters are initialized using the Alexnet
weights trained on the Imagenet dataset. Once trained, the learned parameters from this network are used to initialize other models including Multitask_Face and HyperFace as the standard Imagenet initialization doesn’t converge well. We also perform a linear bounding box regression to localize the face co-ordinates.
R-CNN_Fiducial: This model is used for locating the facial landmarks. The network architecture is shown in Figure 4(b). It simultaneously learns the visibility of the points to account for the invisible points at test time, and thus can be used as a standalone fiducial extractor. The loss functions for landmarks localization and visibility of points are computed using (3) and (4), respectively. Only region proposals which have an overlap with the ground truth bounding box are used for training. The model parameters are initialized from R-CNN_Face.
R-CNN_Pose: This model is used for head pose estimation task. The outputs of the network are roll, pitch and yaw of the face. Figure 4(c) presents the network architecture. Similar to R-CNN_Fiducial, only region proposals with overlap with the ground truth bounding box are used for training. The training loss is computed using (5).
R-CNN_Gender: This model is used for face gender recognition task. The network architecture is shown in Figure 4(d). It has the same training set as R-CNN_Fiducial and R-CNN_Pose. The training loss is computed using (6).
Multitask_Face: Similar to HyperFace, this model is used to simultaneously detect face, localize landmarks, estimate pose and predict its gender. The only difference between Multitask_Face and HyperFace is that HyperFace fuses the intermediate layers of the network whereas Multitask_Face combines the tasks using the common fully connected layer at the end of the network as shown in Figure 5. Since it provides the landmarks and face score, it leverages iterative region proposals and landmark-based NMS post-processing algorithms during evaluation.
The performance of all the above models for their respective tasks are evaluated and discussed in details in Section 5.
Similar to HyperFace, we fuse the geometrically rich features from the lower layers and semantically strong features from the deeper layers of ResNet, such that multi-task learning can leverage from their synergy. Taking inspiration from , we fuse the features using hierarchical element-wise addition. Starting with ‘res2c’ features, we first reduce its resolution using a
convolution kernel with stride of. It is then passed through the a convolution layer that increases the number of channels to match the next level features (‘res3b3’ in this case). Element-wise addition is applied between the two to generate a new set of fused features. The same operation is applied in a cascaded manner to fuse ‘res4b22’ and ‘res5c’ features of the ResNet-101 model. Finally, average pooling is carried out to generate -dimensional feature vector that is shared among all the tasks. Task-specific sub-networks are branched out separately in a similar way as HyperFace. Each convolution layer is followed by a Batch-Norm+Scale  layer and ReLU activation unit. We do not use dropout in HF-ResNet. The training loss functions are the same as described in Section 3.2.
HF-ResNet is slower than HyperFace since it performs more convolutions. This makes it difficult to be used with Selective Search  algorithm which generates more than region proposals to be processed. Hence, we use a faster version of region proposals using high recall SSD  face detector. It produces proposals, needing just s. This considerably reduces the total runtime for HF-ResNet to less than s. The fast version of HyperFace is discussed in Section 5.6.
We evaluated the proposed HyperFace method, along with HF-ResNet, Multask_Face, R-CNN_Face, R-CNN_Fiducial, R-CNN_Pose and R-CNN_Gender on six challenging datasets:
Annotated Face in-the-Wild (AFW)  for evaluating face detection, landmarks localization, and pose estimation tasks
300-W Faces in-the-wild (IBUG)  for evaluating -point landmarks localization.
Annotated Facial Landmarks in the Wild (AFLW)  for evaluating landmarks localization and pose estimation tasks
Our method was trained on randomly selected
images from the AFLW dataset using Caffe. The remaining images were used for testing.
We present face detection results for AFW, PASCAL and FDDB datasets. The AFW dataset  was collected from Flickr and the images in this dataset contain large variations in appearance and viewpoint. In total there are 205 images with 468 faces in this dataset. The FDDB dataset  consists of 2,845 images containing 5,171 faces collected from news articles on the Yahoo website. This dataset is the most widely used benchmark for unconstrained face detection. The PASCAL faces dataset  was collected from the test set of PASCAL person layout dataset, which is a subset from PASCAL VOC . This dataset contains 1335 faces from 851 images with large appearance variations. For improved face detection performance, we learn a SVM classifier on top of features using the training splits from the FDDB dataset.
Some of the recent published methods compared in our evaluations include DP2MFD , Faceness , HeadHunter , JointCascade , CCF , SquaresChnFtrs-5 , CascadeCNN , Structured Models , DDFD , NPDFace , PEP-Adapt , TSM , as well as three commercial systems Face++, Picasa and Face.com.
The precision-recall curves of different detectors corresponding to AFW and PASCAL faces datasets are shown in Figures 7 (a) and (b), respectively. Figure 8 compares the performance of different detectors using the Receiver Operating Characteristic (ROC) curves on the FDDB dataset. As can be seen from these figures, both HyperFace and HF-ResNet outperform all the reported academic and commercial detectors on the AFW and PASCAL datasets. HyperFace achieves a high mean average precision () of and , for AFW and PASCAL datasets respectively. HF-ResNet further improves the mAP to and respectively.
The FDDB dataset is very challenging for HyperFace and any other R-CNN-based face detection methods, as the dataset contains many small and blurred faces. First, some of these faces do not get included in the region proposals from selective search. Second, re-sizing small faces to the input size of adds distortion to the face resulting in low detection score. In spite of these issues, HyperFace performance is comparable to recently published deep learning-based face detection methods such as DP2MFD  and Faceness  on the FDDB dataset 111http://vis-www.cs.umass.edu/fddb/results.html with of .
It is interesting to note the performance differences between R-CNN_Face, Multitask_Face and HyperFace for the face detection tasks. Figures 7, and 8 clearly show that multitask CNNs (Multitask_Face and HyperFace) outperform R-CNN_Face by a wide margin. The boost in the performance gain is mainly due to the following two reasons. First, multitask learning approach helps the network to learn improved features for face detection which is evident from their values on the AFW dataset. Using just the linear bounding box regression and traditional NMS, the HyperFace obtains a of (Figure 13) while R-CNN_Face achieves a of . Second, having landmark information associated with detection boxes makes it easier to localize the bounding box to a face, by using IRP and L-NMS algorithms. On the other hand, HyperFace and Multi-task_Face perform comparable to each other for all the face detection datasets which suggests that the network does not gain much by fusing intermediate layers for the face detection task.
We evaluate the performance of different landmarks localization algorithms on AFW  and AFLW  datasets. Both of these datasets contain faces with full pose variations. Some of the methods compared include Multiview Active Appearance Model-based method (Multi. AAM) , Constrained Local Model (CLM) , Oxford facial landmark detector , Zhu , FaceDPL , JointCascade , CDM , RCPR , ESR , SDM  and 3DDFA . Although both of these datasets provide ground truth bounding boxes, we do not use them for evaluating on HyperFace, HF-ResNet, Multitask_Face and R-CNN_Fiducial. Instead we use the respective algorithms to detect both the face and its fiducial points. Since, the R-CNN_Fiducial cannot detect faces, we provide it with the detections from the HyperFace.
Figure 9 compares the performance of different landmark localization methods on the AFW dataset using the protocol defined in . In this figure, (*) indicates that models that are evaluated on near frontal faces or use hand-initialization . The dataset provides six keypoints for each face which are: left_eye_center, right_eye_center, nose_tip, mouth_left, mouth_center and mouth_right. We compute the error as the mean distance between the predicted and ground truth keypoints, normalized by the face size. The plots for comparison were obtained from .
For the AFLW dataset, we calculate the error using all the visible keypoints. For AFW, we adopt the same protocol as defined in . The only difference is that our AFLW testset consists of only images with face samples, since we use the rest of the images for training. To be consistent with the protocol, we randomly create a subset of samples from our testset whose absolute yaw angles within ,  and  are each. Figure 10 compares the performance of different landmark localization methods. We obtain the comparison plots from  where the evaluations for RCPR, ESR and SDM are carried out after adapting the algorithms to face profiling. Table I provides the Normalized Mean Error (NME) for AFLW dataset, for each of the pose group.
|AFLW Dataset (21 pts)|
|Method||[0, 30]||[30, 60]||[60, 90]||mean||std|
As can be seen from the figures, R-CNN_Fiducial, Multitask_Face, HyperFace and HF-ResNet outperform many recent state-of-the-art landmark localization methods including FaceDPL , 3DDFA  and SDM . Table I shows that HyperFace performs consistently accurate over all pose angles. This clearly suggests that while most of the methods work well on frontal faces, HyperFace is able to predict landmarks for faces with full pose variations. Moreover, we find that R-CNN_Fiducial and Multitask_Face attain similar performance. The HyperFace has an advantage over them as it uses the intermediate layers for fusion. The local information is contained well in the lower layers of CNN and becomes invariant as depth increases. Fusing the layers brings out that hidden information which boosts the performance for the landmark localization task. Additionally, we observe that HF-ResNet significantly improves the performance over HyperFace for both AFW and AFLW datasets. The large margin in performance can be attributed to the larger depth for the HF-ResNet model.
We also evaluate our models on the challenging subset of the 300-W  landmarks localization dataset (IBUG). The dataset contains test images with wide variations in expression and illumination. The head-pose angle varies from to in yaw. Since the dataset contains landmarks points instead of used in AFLW  training, the model cannot be directly applied for evaluating IBUG. We retrain the network for predicting facial key-points as a separate task in conjunction with the proposed tasks in hand. We implement it by adding two fully-connected layers in a cascade manner to the shared feature space (fc-full), having dimensions and , respectively.
Following the protocol described in , we use faces with -point annotations for training. The network is trained end-to-end for the localization of -points landmarks along with the other tasks mentioned in Section 3.2. We use the standard Euclidean loss function for training. For evaluation, we compute the average error of all landmarks normalized by the inter-pupil distance. Table II compares the Normalized Mean Error (NME) obtained by HyperFace and HF-ResNet with other recently published methods. We observe that HyperFace achieves a comparable NME of , while HF-ResNet achieves the state-of-the-art result on IBUG  with NME of . This shows the effectiveness of the proposed models for -point landmarks localization.
We evaluate R-CNN_Pose, Multitask_Face and HyperFace on the AFW  and AFLW  datasets for the pose estimation task. The detection boxes used for evaluating the landmark localization task are used here as well for initialization. For the AFW dataset, we compare our approach with Multi. AAM , Multiview HoG , FaceDPL222Available at: http://www.ics.uci.edu/~dramanan/software/face/face_journal.pdf  and face.com. Note that multiview AAMs are initialized using the ground truth bounding boxes (denoted by *). Figure 12 shows the cumulative error distribution curves on AFW dataset. The curve provides the fraction of faces for which the estimated pose is within some error tolerance. As can be seen from the figure, both HyperFace and HF-ResNet outperform existing methods by a large margin. For the AFLW dataset, we do not have pose estimation evaluation for any previous method. Hence, we show the performance of our method for different pose angles: roll, pitch and yaw in Figure 11 (a), (b) and (c) respectively. It can be seen that the network is able to learn roll, and pitch information better than yaw.
The performance traits of R-CNN_Pose, Multitask_Face, HyperFace and HF-ResNet for pose estimation task are similar to that of the landmarks localization task. R-CNN_Pose and Multitask_Face perform comparable to each other whereas HyperFace achieves a boosted performance due to the intermediate layers fusion. It shows that tasks which rely on the structure and orientation of the face work well with features from lower layers of the CNN. HF-ResNet further improves the performance for roll, pitch as well as yaw.
We present the gender recognition performance on CelebA  and LFWA  datasets since these datasets come with gender information. The CelebA and LFWA datasets contain labeled images selected from the CelebFaces  and LFW  datasets, respectively . The CelebA dataset contains 10,000 identities and there are 200,000 images in total. The LFWA dataset has 13,233 images of 5,749 identities. We compare our approach with FaceTracer , PANDA-w , PANDA-1 ,  with ANet and . The gender recognition performance of different methods is reported in Table III. On the LFWA dataset, our method outperforms PANDA  and FaceTracer , and is equal to . On the CelebA dataset, our method performs comparably to . Unlike  which uses images for training and validation, we only use images from validation set of CelebA to fine-tune the network.
Similar to the face detection task, we find that gender recognition performs better for HyperFace and Multitask_Face as compared to R-CNN_Gender proving that learning related tasks together improves the discriminating capability of the individual tasks. Again, we do not see much difference in the performance of Multitask_Face and HyperFace suggesting intermediate layers do not contribute much for the gender recognition task. HF-ResNet achieves state-of-the-art results on both CelebA  and LFWA  datasets.
Figure 13 provides an experimental analysis of the post-processing methods: IRP and L-NMS, for face detection task on the AFW dataset. Fast SS denotes the quick version of selective search which produces around region proposals and takes per image to compute. On the other hand, Quality SS refers to its slow version which outputs more than region proposals consuming more than for one image. The HyperFace with a linear bounding box regression and traditional NMS achieves a of . Just by replacing them with L-NMS provides a boost of
. In this case, bounding-box is constructed using the landmarks information rather linear regression. Additionaly, we can see from the figure that althoughQuality SS generates more region proposals, it performs worse than Fast SS with iterative region proposals. IRP adds new regions for a typical image consuming less than which makes it highly efficient as compared to Quality SS.
The Hyperface method is tested on a machine with 8 cores and GTX TITAN-X GPU. The overall time taken to perform all the four tasks is per image. The limitation is not because of CNN, but due to Selective Search  algorithm which takes approximately to generate candidate region proposals. One forward pass through the HyperFace network for proposals takes merely s.
We also propose a fast version of HyperFace which uses a high recall fast face detector instead of Selective Search  to generate candidate region proposals. We implement a face detector using Single Shot Detector (SSD)  framework. The SSD-based face detector takes a dimensional input image and generates face boxes in less than s, with confidence probability scores ranging from to . We use a probability threshold of to select high recall detection boxes. Unlike traditional SSD, we do not use non-maximum suppression on the detector output, so that we have more number of region proposals. Typically, the SSD face detector generates proposals per image. These proposals are directly passed through HyperFace to generate face detection scores, localize face landmarks, estimate pose and recognize gender for every face in the image. Fast-HyperFace consumes a total time of s (s for SSD face detector, and s for HyperFace) on a GTX TITAN X GPU. The Fast-HyperFace achieves a mAP of on AFW face detection task, which is comparable to the HyperFace mAP of . Thus, Fast-HyperFace improves the speed by a factor of with negligible degradation in performance.
We present some observations based on our experiments. First, all the face related tasks benefit from using the multi-task learning framework. The gain is mainly due to the network’s ability to learn more discriminative features, and post-processing methods which can be leveraged by having landmarks as well as detection scores for a region. Secondly, fusing intermediate layers improves the performance for structure dependent tasks of pose estimation and landmarks localization, as the features become invariant to geometry in deeper layers of CNN. The HyperFace exploits these observations to improve the performance for all the four tasks.
We also visualize the features learned by the HyperFace network. Figure 14 shows the network activation for a few selected feature maps out of from the layer. It can be seen that some feature maps are dedicated solely for a single task while others can be used to predict different tasks. For example, feature map and can be used for face detection and gender recognition, respectively. The former distinguishes the face and non-face regions whereas the latter outputs high activation for the female faces. Similarly, feature map shows high activation near eyes and mouth regions, while feature map gives a rough contour of the face orientation. These features can be used for landmark localization and pose estimation tasks.
Several qualitative results of our method on the AFW, PASCAL and FDDB datasets are shown in Figure 15. As can be seen from this figure, our method is able to simultaneously perform all the four tasks on images containing extreme pose, illumination, and resolution variations with cluttered background.
In this paper, we presented a multi-task deep learning method called HyperFace for simultaneously detecting faces, localizing landmarks, estimating head pose and identifying gender. Extensive experiments using various publicly available unconstrained datasets demonstrate the effectiveness of our method on all four tasks. In future, we will evaluate the performance of our method on other applications such as simultaneous human detection and human pose estimation, object recognition and pedestrian detection.
We thank Dr. Jun-Cheng Chen for implementing the SSD512-based face detector used in the Fast-HyperFace pipeline. This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
Computer Vision and Pattern Recognition, 2007. CVPR ’07. IEEE Conference on, pages 1–7, June 2007.
International Conference on Machine Learning, pages 448–456, 2015.
Head pose estimation using view based eigenspaces.In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 4, pages 302–305 vol.4, 2002.