SSH Face Detector in Keras
We introduce the Single Stage Headless (SSH) face detector. Unlike two stage proposal-classification detectors, SSH detects faces in a single stage directly from the early convolutional layers in a classification network. SSH is headless. That is, it is able to achieve state-of-the-art results while removing the "head" of its underlying classification network -- i.e. all fully connected layers in the VGG-16 which contains a large number of parameters. Additionally, instead of relying on an image pyramid to detect faces with various scales, SSH is scale-invariant by design. We simultaneously detect faces with different scales in a single forward pass of the network, but from different layers. These properties make SSH fast and light-weight. Surprisingly, with a headless VGG-16, SSH beats the ResNet-101-based state-of-the-art on the WIDER dataset. Even though, unlike the current state-of-the-art, SSH does not use an image pyramid and is 5X faster. Moreover, if an image pyramid is deployed, our light-weight network achieves state-of-the-art on all subsets of the WIDER dataset, improving the AP by 2.5 SSH also reaches state-of-the-art results on the FDDB and Pascal-Faces datasets while using a small input size, leading to a runtime of 50 ms/image on a GPU. The code is available at https://github.com/mahyarnajibi/SSH.READ FULL TEXT VIEW PDF
SSH Face Detector in Keras
Face detection is a crucial step in various problems involving verification, identification, expression analysis, . From the Viola-Jones  detector to recent work by Hu , the performance of face detectors has been improved dramatically. However, detecting small faces is still considered a challenging task. The recent introduction of the WIDER face dataset , containing a large number of small faces, exposed the performance gap between humans and current face detectors. The problem becomes more challenging when the speed and memory efficiency of the detectors are taken into account. The best performing face detectors are usually slow and have high memory foot-prints ( takes more than second to process an image, see Section 4.5) partly due to the huge number of parameters as well as the way robustness to scale or incorporation of context are addressed.
State-of-the-art CNN-based detectors convert image classification networks into two-stage detection systems [4, 24]. In the first stage, early convolutional feature maps are used to propose a set of candidate object boxes. In the second stage, the remaining layers of the classification networks (fc6~8 in VGG-16 ), which we refer to as the network “head
”, are deployed to extract local features for these candidates and classify them. The head in the classification networks can be computationally expensive (the network head containsM parameters in VGG-16 and M parameters in ResNet-101). Moreover, in the two stage detectors, the computation must be performed for all proposed candidate boxes.
Very recently, Hu  showed state-of-the-art results on the WIDER face detection benchmark by using a similar approach to the Region Proposal Networks (RPN)  to directly detect faces. Robustness to input scale is achieved by introducing an image pyramid as an integral part of the method. However, it involves processing an input pyramid with an up-sampling scale up to pixels per side and passing each level to a very deep network which increased inference time.
In this paper, we introduce the Single Stage Headless (SSH) face detector. SSH performs detection in a single stage. Like RPN , the early feature maps in a classification network are used to regress a set of predefined anchors towards faces. However, unlike two-stage detectors, the final classification takes place together with regressing the anchors. SSH is headless. It is able to achieve state-of-the-art results while removing the head of its underlying network (all fully connected layers in VGG-16), leading to a light-weight detector. Finally, SSH is scale-invariant by design. Instead of relying on an external multi-scale pyramid as input, inspired by , SSH
detects faces from various depths of the underlying network. This is achieved by placing an efficient convolutional detection module on top of the layers with different strides, each of which is trained for an appropriate range of face scales. Surprisingly,SSH based on a headless VGG-16, not only outperforms the best-reported VGG-16 by a large margin but also beats the current ResNet-101-based state-of-the-art method on the WIDER face detection dataset. Unlike the current state-of-the-art, SSH does not deploy an input pyramid and is times faster. If an input pyramid is used with SSH as well, our light-weight VGG-16-based detector outperforms the best reported ResNet-101  on all three subsets of the WIDER dataset and improves the mean average precision by and on the validation and the test set respectively. SSH also achieves state-of-the-art results on the FDDB and Pascal-Faces datasets with a relatively small input size, leading to a runtime of ms/image.
Prior to the re-emergence of convolutional neural networks (CNN
), different machine learning algorithms were developed to improve face detection performance[29, 39, 10, 11, 18, 2, 31]. However, following the success of these networks in classification tasks , they were applied to detection as well . Face detectors based on CNNs significantly closed the performance gap between human and artificial detectors [12, 33, 32, 38, 7]. However, the introduction of the challenging WIDER dataset , containing a large number of small faces, re-highlighted this gap. To improve performance, CMS-RCNN  changed the Faster R-CNN object detector  to incorporate context information. Very recently, Hu proposed a face detection method based on proposal networks which achieves state-of-the-art results on this dataset . However, in addition to skip connections, an input pyramid is processed by re-scaling the image to different sizes, leading to slow detection speeds. In contrast, SSH is able to process multiple face scales simultaneously in a single forward pass of the network, which reduces inference time noticeably.
The idea of detecting and localizing objects in a single stage has been previously studied for general object detection. SSD  and YOLO  perform detection and classification simultaneously by classifying a fixed grid of boxes and regressing them towards objects. G-CNN  models detection as a piece-wise regression problem and iteratively pushes an initial multi-scale grid of boxes towards objects while classifying them. However, current state-of-the-art methods on the challenging MS-COCO object detection benchmark are based on two-stage detectors. SSH is a single stage detector; it detects faces directly from the early convolutional layers without requiring a proposal stage.
Although SSH is a detector, it is more similar to the object proposal algorithms which are used as the first stage in detection pipelines. These algorithms generally regress a fixed set of anchors towards objects and assign an objectness score to each of them. MultiBox  deploys clustering to define anchors. RPN , on the other hand, defines anchors as a dense grid of boxes with various scales and aspect ratios, centered at every location in the input feature map. SSH uses similar strategies, but to localize and at the same time detect, faces.
Being scale invariant is important for detecting faces in unconstrained settings. For generic object detection, [1, 36] deploy feature maps of earlier convolutional layers to detect small objects. Recently,  used skip connections in the same way as  and employed multiple shared RPN and classifier heads from different convolutional layers. For face detection, CMS-RCNN  used the same idea as [1, 36] and added skip connections to the Faster RCNN .  creates a pyramid of images and processes each separately to detect faces of different sizes. In contrast, SSH is capable of detecting faces at different scales in a single forward pass of the network without creating an image pyramid. We employ skip connections in a similar fashion as [17, 14], and train three detection modules jointly from the convolutional layers with different strides to detect small, medium, and large faces.
models context by deploying a recurrent neural network. For face detection,CMS-RCNN  utilizes a larger window with the cost of duplicating the classification head. This increases the memory requirement as well as detection time. SSH uses simple convolutional layers to achieve the same larger window effect, leading to more efficient context modeling.
SSH is designed to decrease inference time, have a low memory foot-print, and be scale-invariant. SSH is a single-stage detector; instead of dividing the detection task into bounding box proposal and classification, it performs classification together with localization from the global information extracted from the convolutional layers. We empirically show that in this way, SSH can remove the “head” of its underlying network while achieving state-of-the-art face detection accuracy. Moreover, SSH is scale-invariant by design and can incorporate context efficiently.
Figure 2 shows the general architecture of SSH. It is a fully convolutional network which localizes and classifies faces early on by adding a detection module on top of feature maps with strides of , , and , depicted as , , and respectively. The detection module consists of a convolutional binary classifier and a regressor for detecting faces and localizing them respectively.
To solve the localization sub-problem, as in [28, 24, 19], SSH regresses a set of predefined bounding boxes called anchors, to the ground-truth faces. We employ a similar strategy to the RPN  to form the anchor set. We define the anchors in a dense overlapping sliding window fashion. At each sliding window location, anchors are defined which have the same center as that window and different scales. However, unlike RPN, we only consider anchors with aspect ratio of one to reduce the number of anchor boxes. We noticed in our experiments that having various aspect ratios does not have a noticeable impact on face detection precision. More formally, if the feature map connected to the detection module has a size of , there would be anchors with aspect ratio one and scales .
For the detection module, a set of convolutional layers are deployed to extract features for face detection and localization as depicted in Figure 3. This includes a simple context module to increase the effective receptive field as discussed in section 3.3. The number of output channels of the context module, (“” in Figures 3 and 4) is set to for detection module and for modules and . Finally, two convolutional layers perform bounding box regression and classification. At each convolution location in , the classifier decides whether the windows at the filter’s center and corresponding to each of the scales contains a face. A convolutional layer with output channels is used as the classifier. For the regressor branch, another convolutional layer with output channels is deployed. At each location during the convolution, the regressor predicts the required change in scale and translation to match each of the positive anchors to faces.
In unconstrained settings, faces in images have varying scales. Although forming a multi-scale input pyramid and performing several forward passes during inference, as in , makes it possible to detect faces with different scales, it is slow. In contrast, SSH detects large and small faces simultaneously in a single forward pass of the network. Inspired by , we detect faces from three different convolutional layers of our network using detection modules , and . These modules have strides of , , and and are designed to detect small, medium, and large faces respectively.
More precisely, the detection module performs detection from the conv5-3 layer in VGG-16. Although it is possible to place the detection module directly on top of conv4-3, we use the feature map fusion which was previously deployed for semantic segmentation , and generic object detection . However, to decrease the memory consumption of the model, the number of channels in the feature map is reduced from 512 to 128 using convolutions. The conv5-3 feature maps are up-sampled and summed up with the conv4-3 features, followed by a
convolutional layer. We used bilinear up-sampling in the fusion process. For detecting larger faces, a max-pooling layer with stride ofis added on top of the conv5-3 layer to increase its stride to . The detection module is placed on top of this newly added layer.
During the training phase, each detection module is trained to detect faces from a target scale range as discussed in 3.4. During inference, the predicted boxes from the different scales are joined together followed by Non-Maximum Suppression (NMS) to form the final detections.
In two-stage detectors, it is common to incorporate context by enlarging the window around the candidate proposals. SSH mimics this strategy by means of simple convolutional layers. Figure 4 shows the context layers which are integrated into the detection modules. Since anchors are classified and regressed in a convolutional manner, applying a larger filter resembles increasing the window size around proposals in a two-stage detector. To this end, we use and filters in our context module. Modeling the context in this way increases the receptive field proportional to the stride of the corresponding layer and as a result the target scale of each detection module. To reduce the number of parameters, we use a similar approach as  and deploy sequential filters instead of larger convolutional filters. The number of output channels of the detection module (“” in Figure 4) is set to for and for modules and . It should be noted that our detection module together with its context filters uses fewer of parameters compared to the module deployed for proposal generation in . Although, more efficient, we empirically found that the context module improves the mean average precision on the WIDER validation dataset by more than half a percent.
We use stochastic gradient descent with momentum and weight decay for training the network. As discussed in section3.2, we place three detection modules on layers with different strides to detect faces with different scales. Consequently, our network has three multi-task losses for the classification and regression branches in each of these modules as discussed in Section 3.4.1. To specialize each of the three detection modules for a specific range of scales, we only back-propagate the loss for the anchors which are assigned to faces in the corresponding range. This is implemented by distributing the anchors based on their size to these three modules (smaller anchors are assigned to compared to , and ). An anchor is assigned to a ground-truth face if and only if it has a higher IoU than . This is in contrast to the methods based on Faster R-CNN which assign to each ground-truth at least one anchor with the highest IoU. Thus, we do not back-propagate the loss through the network for ground-truth faces inconsistent with the anchor sizes of a module.
SSH has a multi-task loss. This loss can be formulated as follows:
where is the face classification loss. We use standard multinomial logistic loss as . The index goes over the SSH detection modules and represents the set of anchors defined in . The predicted category for the ’th anchor in and its assigned ground-truth label are denoted as and respectively. As discussed in Section 3.2, an anchor is assigned to a ground-truth bounding box if and only if it has an IoU greater than a threshold (0.5). As in , negative labels are assigned to anchors with IoU less than a predefined threshold (0.3) with any ground-truth bounding box. is the number of anchors in module which participate in the classification loss computation.
represents the bounding box regression loss. Following [6, 5, 24], we parameterize the regression space with a log-space shift in the box dimensions and a scale-invariant translation and use smooth loss as . In this parametrized space, represents the predicted four dimensional translation and scale shift and is its assigned ground-truth regression target for the ’th anchor in module . is the indicator function that limits the regression loss only to the positively assigned anchors, and .
We use online negative and positive mining (OHEM) for training SSH as described in . However, OHEM is applied to each of the detection modules () separately. That is, for each module , we select the negative anchors with the highest scores and the positive anchors with the lowest scores with respect to the weights of the network at that iteration to form our mini-batch. Also, since the number of negative anchors is more than the positives, following , of the mini-batch is reserved for the positive anchors. As empirically shown in Section 4.8, OHEM has an important role in the success of SSH which removes the fully connected layers out of the VGG-16 network.
All models are trained on GPUs in parallel using stochastic gradient descent. We use a mini-batch of images. Our networks are fine-tuned for
iterations starting from a pre-trained ImageNet classification network. Following, we fix the initial convolutions up to conv3-1. The learning rate is initially set to and drops by a factor of 10 after iterations. We set momentum to , and weight decay to . Anchors with IoU are assigned to positive class and anchors which have an IoU with all ground-truth faces are assigned to the background class. For anchor generation, we use scales in , in , and in with a base anchor size of pixels. All anchors have aspect ratio of one. During training, detections per module is selected for each image. During inference, each module outputs best scoring anchors as detections and NMS with a threshold of is performed on the outputs of all modules together.
WIDER dataset: This dataset contains images with annotated faces, of which are in the train set, in the validation set and the rest are in the test set. The validation and test set are divided into “easy”, “medium”, and “hard” subsets cumulatively (the “hard” set contains all images). This is one of the most challenging public face datasets mainly due to the wide variety of face scales and occlusion. We train all models on the train set of the WIDER dataset and evaluate on the validation and test sets. Ablation studies are performed on the the validation set (“hard” subset).
FDDB: FDDB contains 2845 images and 5171 annotated faces. We use this dataset only for testing.
We compare SSH with HR , CMS-RCNN , Multitask Cascade CNN , LDCF , Faceness , and Multiscale Cascade CNN . When reporting SSH without an image pyramid, we rescale the shortest side of the image up to pixels while keeping the largest side below pixels without changing the aspect ratio. SSH+Pyramid is our method when we apply SSH to a pyramid of input images. Like HR, a four level image pyramid is deployed. To form the pyramid, the image is first scaled to have a shortest side of up to pixels and the longest side less than pixels. Then, we scale the image to have min sizes of , and pixels in the pyramid. All modules detect faces on all pyramid levels, except which is not applied to the largest level.
Table 1 compares SSH with best performing methods on the WIDER validation set. SSH without using an image pyramid and based on the VGG-16 network outperforms the VGG-16 version of HR by , and in “easy”, “medium”, and “hard” subsets respectively. Surprisingly, SSH also outperforms HR based on ResNet-101 on the whole dataset (“hard” subset) by . In contrast HR deploys an image pyramid. Using an image pyramid, SSH based on a light VGG-16 model, outperforms the ResNet-101 version of HR by a large margin, increasing the state-of-the-art on this dataset by .
The precision-recall curves on the test set is presented in Figure 5. We submitted the detections of SSH with an image pyramid only once for evaluation. As can be seen, SSH based on a headless VGG-16, outperforms the prior methods on all subsets, increasing the state-of-the-art by .
In these datasets, we resize the shortest side of the input to 400 pixels while keeping the larger side less than 800 pixels, leading to an inference time of less than ms/image. We compare SSH with HR, HR-ER, Conv3D, Faceness, Faster R-CNN(VGG-16), MTCNN, DP2MFD, and Headhunter. Figures 5(a) and 5(b) show the ROC curves with respect to the discrete and continuous measures on the FDDB dataset respectively.
It should be noted that HR-ER also uses FDDB as a training data in a -fold cross validation fashion. Moreover, HR-ER and Conv3D both generate ellipses to decrease the localization error. In contrast, SSH does not use FDDB for training, and is evaluated on this dataset out-of-the-box by generating bounding boxes. However, as can be seen, SSH outperforms all other methods with respect to the discrete score. Compare to HR, SSH improved the results by and with respect to the continuous and discrete scores.
SSH performs face detection in a single stage while removing all fully-connected layers from the VGG-16 network. This makes SSH an efficient detection algorithm. Table 2 shows the inference time with respect to different input sizes. We report average time on the WIDER validation set. Timing are performed on a NVIDIA Quadro P6000 GPU. In column with max size , the shortest side of the images are resized to “” pixels while keeping the longest side less than “” pixels. As shown in section 4.3, and 4.4, SSH outperforms HR on all datasets without an image pyramid. On WIDER we resize the image to the last column and as a result detection takes ms/image. In contrast, HR has a runtime of ms/image, more than slower. As mentioned in Section 4.4, a maximum input size of is enough for SSH to achieve state-of-the-art performance on FDDB and Pascal-Faces, with a detection time of ms/image. If an image pyramid is used, the runtime would be dominated by the largest scale.
As discussed in Section 3.2, SSH uses each of its detections modules, , to detect faces in a certain range of scales from layers with different strides. To better understand the impact of these design choices, we compare the results of SSH with and without multiple detection modules. That is, we remove and only detect faces with from conv5-3 in VGG-16. However, for fair comparison, all anchor scales in are moved to (we use in ). Other parameters remain the same. We refer to this simpler method as ”SSH-Only”. As shown in Figure 6(a), by removing the multiple detection modules from SSH, the AP significantly drops by on the hard subset which contains smaller faces. Although SSH does not deploy the expensive head of its underlying network, results suggest that having independent simple detection modules from different layers of the network is an effective strategy for scale-invariance.
The input size can affect face detection precision, especially for small faces. Table 3 shows the AP of SSH on the WIDER validation set when it is trained and evaluated with different input sizes. Even at a maximum input size of , SSH outperforms HR-VGG16, which up-scales images up to pixels, by , showing the effectiveness of our scale-invariant design for detecting small faces.
As discussed in Section 3.5, we apply hard negative and positive mining (OHEM) to select anchors for each of our detection modules. To show its role, we train SSH, with and without OHEM. All other factors are the same. Figure 6(b) shows the results. Clearly, OHEM is important for the success of our light-weight detection method which does not use the pre-trained head of the VGG-16 network.
In SSH, to form the input features for detection module , the outputs of conv4-3 and conv5-3 are fused together. Figure 6(c), shows the effectiveness of this design choice. Although it does not have a noticeable computational overhead, as illustrated, it improves the AP on the WIDER validation set.
Figure 8 shows some qualitative results on the Wider validation set. The colors encode the score of the classifier. Green and blue represent score and respectively.
We introduced the SSH detector, a fast and lightweight face detector that, unlike two-stage proposal/classification approaches, detects faces in a single stage. SSH localizes and detects faces simultaneously from the early convolutional layers in a classification network. SSH is able to achieve state-of-the-art results without using the “head” of its underlying classification network (fc layers in VGG-16). Moreover, instead of processing an input pyramid, SSH is designed to be scale-invariant while detecting different face scales in a single forward pass of the network. SSH achieves state-of-the-art performance on the challenging WIDER dataset as well as FDDB and Pascal-Faces while reducing the detection time considerably.
Acknowledgement This research is based upon work supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA R&D Contract No. 2014-14071600012. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the ODNI, IARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon.
From facial parts responses to face detection: A deep learning approach.In Proceedings of the IEEE International Conference on Computer Vision, pages 3676–3684, 2015.
Face detection, pose estimation, and landmark localization in the wild.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2879–2886. IEEE, 2012.