Face detection is a fundamental and essential step for many face related applications, e.g. face landmark [31, 40] and face recognition [24, 27, 41]. Starting from the pioneering work of Viola-Jones , face detection witnessed a large number of progress, especially as the recent development of convolutional neural networks . However, the occlusion problem is still a challenging problem and few of the work have been presented to address this issue. More importantly, occlusion caused by mask, sunglasses or other faces widely exists in the real-life applications.
The difficulty to address the occlusion issue lies at the risk of potential false positive problem. Considering the case of detecting a face occluded by a sunglasses, only the lower part of face is available. The models, which can recognize the face only based on the lower part, will be easily misclassified at the positions like hands which share the similar skin color. How to successfully address the occlusion issue and meanwhile prevent the false positive problem is still a challenging research topic.
In this paper, we present an effective face detector based on the one-shot detection pipeline called Face Attention Network (FAN), which can well address the occlusion and false positive issue. More specifically, following the similar setting as RetinaNet , we utilize feature pyramid network, and different layers from the network to solve the faces with different scales. Our anchor setting is designed specifically for the face application and an anchor-level attention is introduced which provides different attention regions for different feature layers. The attention is supervised trained based on the anchor-specific heatmaps. In addition, data augmentation like random crop is introduced to generate more cropped (occluded) training samples.
In summary, there are three contributions in our paper.
We propose a anchor-level attention, which can well address the occlusion issue in the face detection task. One illustrative example for our detection results in the crowd case can be found in Figure LABEL:fig:view.
A practical baseline setting is introduced based on the one-shot RetinaNet detector, which obtains comparable performance with fast computation speed.
2 Related Work
Face detection as the fundamental problem of computer vision, has been extensively studied. Prior to the renaissance of convolutional neural network (CNN), numerous of machine learning algorithms are applied to face detection. The pioneering work of Viola-Jones utilizes Adaboost with Haar-like feature to train a cascade model to detect face and get a real-time performance. Also, deformable part models (DPM)  is employed for face detection with remarkable performance. However, A limitation of these methods is that their use of weak features, e.g., HOG  or Haar-like features 
. Recently, deep learning based algorithm is utilized to improve both the feature representation and classification performance. Based on the whether following the proposal and refine strategy, these methods can be divided into: single-stage detector, such as YOLO, SSD , RetinaNet , and two-stage detector such as Faster R-CNN .
Single-stage method. CascadeCNN  proposes a cascade structure to detect face coarse to fine. MT-CNN  develops an architecture to address both the detection and landmark alignment jointly. Later, DenseBox  utilizes a unified end-to-end fully convolutional network to detect confidence and bounding box directly. UnitBox  presents a new intersection-over-union (IoU) loss to directly optimize IOU target. SAFD  and RSA unit  focus on handling scale explicitly using CNN or RNN. RetinaNet  introduces a new focal loss to relieve the class imbalance problem.
Two-stage method. Beside, face detection has inherited some achievements from generic object detection tasks.  use the Faster R-CNN framework to improve the face detection performance. CMS-RCNN  enhances Faster R-CNN architecture by adding body context information. Convnet 
joins Faster R-CNN framework with 3D face model to increase occlusion robustness. Additionally, Spatial Transformer Networks (STN) and it’s variant , OHEM , grid loss  also presents several effective strategies to improve face detection performance.
3 Face Attention Network (FAN)
Although remarkable improvement have been achieved for the face detection problem as discussed in Section 2, the challenge of locating faces with large occlusion remains. Meanwhile, in many face-related applications such security surveillance, faces usually appear with large occlusion by mask, sunglasses or other faces, which need to be detected as well.  addresses this problem by merging information from different feature layers.  changes anchor matching strategy in order to increase recall rate. Inspire of , we are trying to leverage a fully convolutional feature hierarchy, and each layer targets to handle faces with different scale range by assigning different anchors. Our algorithm, called Face Attention Network (FAN), can be considered as an integration of a single-stage detector discussed in Section 3.1 and our anchor-level attention in Section 3.2. An overview of our network structure can be found in Figure 1.
3.1 Base Framework
Convolutional neural network has different semantic information and spatial resolution at different feature layers. The shallow layers usually have high spatial resolution, which is good for spatial localization of small objects, but low semantic information, which is not good for visual classification. On the other hand, deep layers obtain more semantic information but the spatial resolution is compromised. Recent work like Feature Pyramid Network (FPN)  proposes a divide and conquer principle. A U-shape structure is attached to maintain both the high spatial resolution and semantic information. Different scales of objects are split and addressed at different feature layers.
as backbone to generate a hierarchy of feature pyramids with rich semantic information. Based on this backbone, RetinaNet is attached with two subnets: one for classifying and the other for regressing. We borrow the main network structure from RetinaNet and adapt it for the face detection task.
The classification subnet applies four convolution layers each with 256 filters, followed by a convolution layer with filters where means the number of classes and means the number of anchors per location. For face detection since we use sigmoid activation, and we use in most experiments. All convolution layers in this subnet share parameters across all pyramid levels just like the original RetinaNet. The regression subnet is identical to the classification subnet except that it terminates in convolution filters with linear activation. Figure 1 provides an overview for our algorithm. Notice, we only draw three levels pyramids for illustrative purpose.
3.2 Attention Network
Compared with the original RetinaNet, we have designed our anchor setting together with our attention function. There are three design principles:
addressing different scales of the faces in different feature layers,
highlighting the features from the face region and diminish the regions without face,
generating more occluded faces for training.
3.2.1 Anchor Assign Strategy
We start the discussion of our anchor setting first. In our FAN, we have five detector layers each associated with a specific scale anchor. In addition, the aspect ratio for our anchor is set as 1 and 1.5, because most of frontal faces are approximately square and profile faces can be considered as a 1:1.5 rectangle. Besides, we calculate the statistics from the WiderFace train set based on the ground-truth face size. As Figure 2 shows, more than 80% faces have an object scale from 16 to 406 pixel. Faces with small size lack sufficient resolution and therefore it may not be a good choice to include in the training data. Thus, we set our anchors from areas of to on pyramid levels. We set anchor scale step to , which ensure every ground-truth boxes have anchor with IoU . Specifically, anchors are assigned to a ground-truth box with the highest IoU larger than , and to background if the highest IoU is less than . Unassigned anchors are ignored during training.
3.2.2 Attention Function
To address the occlusion issue, we propose a novel anchor-level attention based on the network structure mentioned above. Specifically, we utilize a segment-like side branch as shown in Figure 5. The attention supervision information is obtained by filling the ground-truth box. Meanwhile as Figure 3 shows, supervised heatmaps are associated to the ground-truth faces assigned to the anchors in the current layer. These hierarchical attention maps could decrease the correlation among them. Different from traditional usage of attention map, which naively multiple it with the feature maps, our attention maps are first feed to an exponential operation and then dot with feature maps. It is able to keep more context information, and meanwhile highlight the detection information. Considering the example with occluded face, most invisible parts are not useful and may be harmful for detection. Some of the attention results can be found in Figure 4. Our attention mask can enhance the feature maps in the facial area, and diminish the opposition.
3.2.3 Data Augmentation
We find that the number of occluded faces in the training dataset, e.g., WiderFace train, is limited and cannot satisfy the training of CNN network. Only 16% faces are with highly occlusion property from the annotation. Thus, we propose a random crop strategy, which can generate a large number of occluded faces for training. More specifically, based on the training set, we randomly crop square patches from original images, whose range between [0.3, 1] of the short edge from the original images. In addition, We keep the overlapped part of the ground-truth box if its center is in the sampled patch. Besides from the random crop dataset augmentation, we also employ augmentation from random flip and color jitter .
3.3 Loss function
We employ a multi-task loss function to jointly optimize model parameters:
where k is the index of an feature pyramid level (), and represents the set of anchors defined in pyramid level . The ground-truth label is 1 if the anchor is positive, 0 otherwise. is the predicted classification result from our model.
is a vector representing the 4 parameterized coordinates of the predicted bounding box, andis that of the ground-truth box associated with a positive anchor.
The classification loss is focal loss introduced in  over two classes (face and background). is the number of anchors in which participate in the classification loss computation. The regression loss is smooth L1 loss defined in . is the indicator function that limits the regression loss only focusing on the positively assigned anchors, and . The attention loss is pixel-wise sigmoid cross entropy. is the attention map generated per level, and is the ground-truth described in Section 3.2.2. and are used to balance these three loss terms, here we simply set .
We use ResNet-50 as base model. All models are trained by SGD over 8 GPUs with a total of 32 images per mini-batch (4 images per GPU). Similar to , the four convolution layers attached to FPN are initialized with bias
and Gaussian weight variance. For the final convolution layer of the classification subnet, we initiate them with bias and
here. Meanwhile, the initial learning rate is set as 3e-3. We sample 10k image patches per epoch. Models without data augmentation are trained for 30 epochs. With data augmentation, models are trained for 130 epochs, whose learning rate is dropped by 10 at 100 epochs and again at 120 epochs. Weight decay is 1e-5 and momentum is 0.9. Anchors with IoUare assigned to positive class and anchors which have an IoU with all ground-truth are assigned to the background class.
The performance of FAN is evaluated across multiple face datasets: WiderFace and MAFA.
WiderFace dataset : WiderFace dataset contains 32, 203 images and 393, 703 annotated faces with a high degree of variability in scale, pose and occlusion. 158, 989 of these are chosen as train set, 39, 496 are in validation set and the rest are test set. The validation set and test set are split into ’easy’, ’medium’, ’hard’ subsets, in terms of the difficulties of the detection. Due to the variability of scale, pose and occlusion, WiderFace dataset is one of the most challenge face datasets. Our FAN is trained only on the train set and evaluate on both validation set and test set. Ablation studies are performed on the validation set.
MAFA dataset : MAFA dataset contains 30, 811 images with 35, 806 masked faces collected from Internet. It is a face detection benchmark for masked face, in which faces have vast various orientations and occlusion. Beside, this dataset is divided into masked face subset and unmasked face subset according to whether at least one part of each face is occluded by mask. We use both the whole dataset and occluded subset to evaluate our method.
|Our FAN Baseline||89.0||87.7||79.8|
4.1.1 Anchor setting and assign
We compare three anchor settings in Table 1. For the RetinaNet setting, we follow the setting described in the paper . For our FAN baseline, we set our anchors from areas of to on pyramid levels. In addition, the aspect ratio is set to 1 and 1.5. Also, inspire of , we choose an anchor assign rule with more cover rate. We uses 8 anchors per location spanning 4 scales (intervals are still fixed to so that areas from to ) and 2 ratios 1, 1.5. For the dense setting, it is the same as our FAN setting except we apply more dense scales from to . Based on the results in Table 1, we can see that anchor scale is important to detector performance and we can see that our setting is obviously superior to the setting in . Compared with the dense setting, we can see that anchor cover rate is not equal to the final detection performance as it may introduce a lot of negative windows due to the dense sampling.
4.1.2 Attention mechanism
As discussed in Section 3.2.2, we apply anchor-level attention mechanism to enhance the facial parts. We compare our FAN baseline with and without attention in Table 2 for the WiderFace val dataset. For the MAFA dataset?the results can be found in Table 3. Based on the experimental results, we can find that our attention can improve 1.1% in WiderFace hard subset and 2% in MAFA masked subset.
|BaseNet||Dense anchor||Anchor assign||Attention||Data augmentation||Multi-scale||AP (easy)||AP (medium)||AP (hard)|
4.1.3 Data augmentation
According to the statistics from the WiderFace dataset, there are around 26% of faces with occlusion. Among them, around 16% is of serious occlusion. As we are targeting to solve the occluded faces, the number of training samples with occlusion may not be sufficient. Thus, we employ the the random crop data augmentation as discussed in Section 3.2.3. The results can be found from Table 4. The performance improvement is significant. Besides from the benefits for the occluded face, our random crop augmentation potentially improve the performance of small faces as more small faces will be enlarged after augmentation.
4.1.4 WiderFace Dataset
We compare our FAN with the state-of-art detectors like SFD , SSH , HR  and ScaleFace . Our FAN is trained on WiderFace train set with data augmentation and tested on both validation and test set with multi-scale 600, 800, 1000, 1200, 1400. The precision-recall curves and AP is shown in Figure 6 and Table 5. Our algorithm obtains the best result in all subsets, i.e. 0.953 (Easy), 0.942 (Medium) and 0.888 (Hard) for validation set, and 0.946 (Easy), 0.936 (Medium) and 0.885 (Hard) for test set. Considering the hard subset which contains a lot of occluded faces, we have larger margin compared with the previous state-art-results, which validates the effectiveness of our algorithm for the occluded faces. Example results from our FAN can be found in Figure 7.
4.1.5 MAFA Dataset
As MAFA dataset  is specifically designed for the occluded face detection, it is adopted to evaluate our algorithm. We compare our FAN with LLE-CNNs  and AOFD . Our FAN is trained on WiderFace train set with data augmentation and tested on MAFA test set with scale 400, 600, 800, 1000. The results based on average precision is shown in Table 6. FAN significantly outperforms state-of-art detectors on MAFA test set with standard testing (IoU threshold = 0.5), which shows the promising performance on occluded faces. Example results from our FAN can be found in Figure 8.
4.2 Inference Time
Despite great performance obtained by our FAN, the speed of our algorithm is not compromised. As shown in Table 7, our FAN detector can not only obtain the state-of-art results but also possess efficient computational speed. The computational cost is tested on a NIVIDIA TITAN Xp. The min size means the shortest side of the images which are resized to by keeping the aspect ratio. Compared with the baseline results in Table 5, when testing with short-side 1000, our FAN already outperforms state-of-art detectors like ,  and .
In this paper, we are targeting the problem of face detection with occluded faces. We propose FAN detector which can integrate our specifically designed single-stage base net and our anchor-level attention algorithm. Based on our anchor-level attention, we can highlight the features from the facial regions and successfully relieving the risk from the false positives. Experimental results on challenging benchmarks like WiderFace and MAFA validate the effectiveness and efficiency of our proposed algorithm.
-  D. Chen, G. Hua, F. Wen, and J. Sun. Supervised transformer network for efficient face detection. In European Conference on Computer Vision, pages 122–138. Springer, 2016.
-  Y. Chen, L. Song, and R. He. Masquer hunter: Adversarial occlusion-aware face detection. arXiv preprint arXiv:1709.05188, 2017.
N. Dalal and B. Triggs.
Histograms of oriented gradients for human detection.
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
-  P. Dollar, R. Appel, S. J. Belongie, and P. Perona. Fast feature pyramids for object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(8):1532–1545, 2014.
-  P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008.
-  S. Ge, J. Li, Q. Ye, and Z. Luo. Detecting masked faces in the wild with lle-cnns. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2682–2690, 2017.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
-  Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu. Scale-aware face detection. arXiv preprint arXiv:1706.09876, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
-  P. Hu and D. Ramanan. Finding tiny faces. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  L. Huang, Y. Yang, Y. Deng, and Y. Yu. Densebox: Unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874, 2015.
-  M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In Advances in Neural Information Processing Systems, pages 2017–2025, 2015.
-  H. Jiang and E. Learned-Miller. Face detection with the faster r-cnn. In Automatic Face & Gesture Recognition, pages 650–657. IEEE, 2017.
-  H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015.
-  Y. Li, B. Sun, T. Wu, and Y. Wang. face detection with end-to-end integration of a convnet and a 3d model. In European Conference on Computer Vision, pages 420–436. Springer, 2016.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. arXiv preprint arXiv:1612.03144, 2016.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2980–2988, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, and X. Tang. Recurrent scale approximation for object detection in cnn. In IEEE International Conference on Computer Vision, 2017.
-  M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis. Ssh: Single stage headless face detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4875–4884, 2017.
-  E. Ohn-Bar and M. M. Trivedi. To boost or not to boost? on the limits of boosted trees for object detection. In Pattern Recognition (ICPR), 2016 23rd International Conference on, pages 3350–3355. IEEE, 2016.
-  M. Opitz, G. Waltner, G. Poier, H. Possegger, and H. Bischof. Grid loss: Detecting occluded faces. In European Conference on Computer Vision, pages 386–402. Springer, 2016.
-  O. M. Parkhi, A. Vedaldi, A. Zisserman, et al. Deep face recognition. In BMVC, volume 1, page 6, 2015.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
-  F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 815–823, 2015.
-  A. Shrivastava, A. Gupta, and R. Girshick. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 761–769, 2016.
-  P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.
-  P. Viola and M. J. Jones. Robust real-time face detection. International journal of computer vision, 57(2):137–154, 2004.
-  X. Xiong and F. De la Torre. Supervised descent method and its applications to face alignment. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 532–539, 2013.
-  B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel features for multi-view face detection. In Biometrics (IJCB), 2014 IEEE International Joint Conference on, pages 1–8. IEEE, 2014.
-  S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts responses to face detection: A deep learning approach. In Proceedings of the IEEE International Conference on Computer Vision, pages 3676–3684, 2015.
-  S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A face detection benchmark. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5525–5533, 2016.
-  S. Yang, Y. Xiong, C. C. Loy, and X. Tang. Face detection through scale-friendly deep convolutional networks. arXiv preprint arXiv:1706.02863, 2017.
-  J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang. Unitbox: An advanced object detection network. In Proceedings of the 2016 ACM on Multimedia Conference, pages 516–520, 2016.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
-  S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S3fd: Single shot scale-invariant face detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 192–201, 2017.
-  C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection. In Deep Learning for Biometrics, pages 57–79. Springer, 2017.
-  X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment across large poses: A 3d solution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 146–155, 2016.
-  X. Zhu, Z. Lei, J. Yan, D. Yi, and S. Z. Li. High-fidelity pose and expression normalization for face recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 787–796, 2015.