Face detection has been well studied in recent years. From the pioneering work of Viola-Jones face detector [Viola and Jones2001] to recent state-of-the-art CNN-based methods, the performance of face detectors has been improved remarkably. For example, the average precision has been boosted to over 98 [Hu and Ramanan2017, Najibi et al.2017, Zhang et al.2017] in the unconstrained FDDB dataset.
Although face detection algorithms have obtained quite good results under general scenarios, detecting faces in specific scenarios is still worth studying. For instance, one of the remaining challenges is partially occluded face detection. Facial occlusions occur frequently, e.g. facial accessories including sunglasses, masks and scarfs.
Occluded faces are only partially visible, and occluded regions have arbitrary appearances that may diverse from normal face regions. Hence occluded faces have significant intra-class variation, leading to difficulties in learning discriminative features for detection. A standard paradigm to address this problem is to enlarge the training dataset of occluded faces, but it can’t solve this problem in essence. Moreover, the lack of large-scale occluded face datasets makes it harder to handle this obstacle.
In this paper, we propose a framework for occluded face detection, aiming at formulating a new strategy to tackle the problem of limited occluded face training data, and exploiting the power of CNN representations for the faces with occlusions as far as possible. Firstly, motivated by the remarkable success achieved by adversarial learning in recent years, a deep adversarial network is proposed in our approach to generate face samples with occlusions. A compact constraint is adopted to reinforce the realness of generated masks. Secondly, we introduce an occlusion-aware model by predicting the occlusion segments at the same time with detecting faces. For detection task, a segmentation branch can be of great help in locating heavily-occluded face area. Intuitively, jointly solving these two tasks can be reciprocal.
To sum up, we make contributions in the following aspects:
A novel adversarial framework is proposed to alleviate the lack of occluded training face images by generating occluded or masked face features. We employ a compact constraint to get more realistic occlusions.
Mask prediction is conducted simultaneously while detecting occluded faces. The occluded area will not be regarded as a hindrance but an auxiliary of face detection.
Experimental evaluations on the MAFA dataset demonstrate that the proposed AOFD can significantly improve the face detection accuracy under heavily occlusions. Besides, AOFD can also achieve competitive performance on the unconstrained face detection benchmark.
2 Related Work
The Viola-Jones [Viola and Jones2001] detector can be recognized as a milestone in the field of face detection. Following their work, lots of boosting-based models were proposed [Mathias et al.2014, Zhu and Peng2016], focusing on designing more sophisticated hand-crafted features or improving the boosting strategy. Deformable part models [Felzenszwalb et al.2010, Ghiasi and Fowlkes2014]
were proposed for object detection at first, which acquired impressive accuracy in complex environment. The pipeline of these methods is divided into two stages. Recently, benefitting from the prosperity of social network and big data, numerous end-to-end deep learning based object detection algorithms have been proposed[Girshick2015, Ren et al.2015]. CNN-based detectors have become the mainstream in face detection gradually [Farfade et al.2015, Li et al.2015, Chen and Li2017, Wu et al.2015].
Although many efforts have been made in face detection, the performance of occluded face detection is still far from satisfactory. [Yang et al.2015] inferred faceness scores through local part responses. But additional face-specific attribute annotations needed in this method were very difficult to collect. [Mahbub et al.2016] introduced a partial face detection approach based on detection of facial segments. They mainly focused on detecting the incomplete faces that captured by the front camera of smart phones. Recently, [Ge et al.2017] combined pre-trained CNN features with local linear embedding to get similarity-based descriptors for partially visible faces. [Wang et al.2017a] applied anchor-level attention on Feature Pyramid Networks [Lin et al.2016].
As mentioned above, our work is also related to adversarial learning, which provides a simple yet efficient way to train powerful models via the min-max two-player game between the generator and the discriminator. Recently, researchers began to pay attention to increase the capacity of discriminator by adversarial learning. [Wang et al.2017b] used adversarial learning in generating hard examples for object detection. [Li et al.2017] employed Perceptual GAN to enhance the representations for small objects. Inspired by these applications, we develop an adversarial occlusion-aware model, which can synthesize occlusion-like face features for boosting occluded face detectors.
In this section, we propose an AOFD method to tackle one of the most common and vicious problems in face detection-occlusion problem. We first analyze the occluded face detection problem (Sec. 3.1) and summarizes the overall architecture of AOFD (Sec. 3.2), and then introduce the mask generation and segmentation method in AOFD in Sec. 3.3 and Sec. 3.4, respectively.
3.1 Problem Analysis
In real-world situations, we can generally classify face occlusion problems into three categories: facial landmark occlusion, occluded by faces and occluded by objects. Facial landmark occlusion includes conditions like wearing glasses and gauze masks. Occluded by faces is a complicated situation because a detector easily mis-recognize several faces into one or only detect a part of the faces. The segmentation method is proposed in order to mitigate this problem. When occluded by an object, usually more than half of a face will be directly masked. An original masking strategy is used to mimic these in-the-wild situations.
We also visualized features of occluded faces, finding that occluded areas rarely respond. For some heavily occluded faces, useful information in feature maps is too scarce for a detector to identify. To tackle this problem, we may need to enhance representation ability of exposed area. On the other hand, recognition of occluded area can also bespeak that “there is a face” on the condition that sufficient context information is provided.
3.2 Overall Architecture
In order to detect faces with heavy occlusion, a robust detector needs not only to find the distinctive part like eyes, nose and mouth, but also to transfer the interference of the occlusions into beneficial information. For the former problem, we find that undetected faces are typically those with their characteristic part of face occluded, such as eyes and mouth. One feasible way is to mask the distinctive part of face in training set, forcing the detector to learn what possibly a face looks like even if there is less exposed area. To this end, a mask generator is designed in an adversarial way to generate a mask for each positive sample. It will generate different masks with faces of different poses. A masking strategy is applied for a better utilization of the mask generator as well. More details are illustrated in Sec. 3.3.
For the latter problem, we introduce a segmentation branch to segment occluded areas including hair, glasses, scarves, hands and other objects. This is not an easy task due to few training samples. Therefore, we labeled 374 training samples downloaded from internet and came up with an original training strategy. More details are listed in Sec. 4.1. We denote our training images for segmentation as SFS (small dataset for segmentation).
As is demonstrated in Figure 2, a mask generator is added after a region of interest (RoI) pooling layer, followed by a classification branch and a bounding box regression branch. Finally, a segmentation branch is in responsible to segment the occluded area inside each bounding box. The final result combining classification, bounding box regression and occlusion segmentation will be output in the end. The overall loss of our architecture takes the following multi-task form:
where denotes a binary softmax loss for classification, denotes a smooth L1 loss for bounding box regression. We apply a binary softmax loss for segmentation branch, which is . During training, the coefficients , and are set 1, 1, and 1 respectively.
3.3 Mask Generator
Mask generator: Since human face is very structural, facial features tend to appear in similar locations. However, with different poses, expressions and occlusions, distinguished facial area varies significantly. Our aim is to find this distinguished area and to generate a customized mask. We visualize some examples in Figure 3. As we have observed, occluded area in features rarely respond in real images. To simulate this characteristic, masks are directly operated on RoIs. Therefore, the generator, which contains four convolutional layers with a straight mapping,
is designed simply as it can be regarded as a binary prediction problem. Besides, the peculiarity of our mask generator, to distinguish from [Wang et al.2017b], is the original generating procedure and the mask forms. Since face structures are inherently different from those of objects, they need to be learned in a more subtle and flexible way, or no plausible mask can be obtained.
Masking strategy: The generated mask is a one-channel heat map where 0 represents masked area and 1 otherwise. During training, each pixel value will be squeezed to zero or one. We select a quarter of the minimum values as the mask when training the generator and one-third of the minimum values when training the overall model.
Heavily occluded samples after masking will become an extremely hard source for training, making the model difficult to converge. To this end, three types of masks are proposed and jointly training with the original features. The first type is to use mask generator, which corresponds to facial landmark occlusion. The second type is to mask half of the features, whether left, right, top or bottom, and the third type is randomly dropping half of the pixels. This masking strategy embodies in-the-wild occlusion types analyzed in Sec. 3.1.
When training the mask generator, we employ an adversarial training method. We aim to increase classification loss as much as possible. Since a masked area is limited and a distinguished facial area is comparatively salient in feature maps, the model can easily converge. However, we find it not enough because the occluded area is sometimes strip-like or sporadic, while it is supposed to be more compact in real situations. Recall that the areas with longer or irregular edges will have a larger value for each pixel using a kernel of an edge detector. A kernel to make the occluded area sleeker and more circular is designed as a compact constraint for generated masks. The loss function is:
where denotes the loss for generator, denotes a compact loss, and and are coefficients. is set 1 and is set 1 in order to balance the derivatives. The compact loss is computed with a convolutional layer in a way as follows:
where denotes a convolutional operation, is the first type of mask generated by the mask generator and the last item is the designed kernel, which is
In this way, strip-like or sporadic areas will get very high penalty and more reasonable masks can be obtained.
Design: Previous works on segmentation have proved that CNNs is capable of comprehending the semantic information of a picture and elaborately conduct a pixel-wise classification. When combining detection with segmentation, it is usually designed in RoI level to achieve higher accuracy.
Considering segmenting each RoI, one problem in occluded face situation is that the overlap of two bounding boxes will have different meanings. For example, if one face is occluded by another face, part of the front face should be regarded as an occlusion for the back face, while there shouldn’t be any occluded area for the front face. Since our destination is to utilize the effective information contained in the occluded area to confirm if there is a back face and then make the exposed area more distinguished, ample context information is required. With reasons above, segmentation is conducted in image level to affect features. Therefore, the detector is able to find faces with more informative features embodying image-level signals like the appearance of a person. We call this method as an occlusion-aware method.
The segmentation branch is designed in a fully convolutional way. In order to obviate noise, it follows a bounding box regression branch and only areas inside bounding boxes are maintained (Figure 4). Bounding boxes are enlarged in scale before dropping the noise. Although the final results have proved the feasibility of this method, the edges of segmentation seem to be a bit rough. This is caused by the limited size of the SFS training set. Nevertheless, we have verified the possibility to train the model with very limited training samples.
In this section, we qualitatively evaluate the proposed method with state-of-the-art methods. We first introduce detailed information during training (Sec. 4.1), and then test AOFD on several comparative benchmarks (Sec. 4.2). A series of ablative studies are conducted to verify the effectiveness of our method (Sec. 4.3).
4.1 Training Details
Based on Faster RCNN [Ren et al.2015], we first train a mask generator with settings mentioned above. The detector and the segmentation branch are trained jointly with the mask generator fixed in the second stage. Due to the limitation of training data for segmentation, an unordinary training strategy is needed. We first train on SFS for 10k iterations with original settings, then on the combination of WIDER FACE and SFS for 50k iterations with loss weights for segmentation set 1
and finally tune the model on SFS for 3 epochs with original settings. Derivatives from WIDER FACE training set will be zero for the segmentation branch. The basic learning rate is 0.001. AOFD runs 5 FPS on a TITAN X GPU, which is similar to the original Faster RCNN.
Experiment settings: AOFD is based on a Faster RCNN with a VGG16 backbone [Simonyan and Zisserman2014]. For anchors of PRN, we use three aspect ratios (1.7, 1 and 1.3) and four scales (, , and ). Batch size is set 1. An RoI is treated as foreground if its intersection over union (IoU) with any ground truth bounding box is higher than 0.5. To balance the number of foreground and background training samples, the ratio of foreground RoIs to background RoIs is set 1:3. During training, the short side of an input image is resized to either 512 or 1024 on condition that long side is no longer than 1024.
4.2 Evaluation on benchmarks
databases. Although the MAFA database dose not release its training set, we still obtain state-of-the-art results on the MAFA testing set without fine tuning the model to adjust the variance between different annotation protocols.
FDDB (Face Detection Data Set and Benchmark) is an unconstrained dataset for face detection. It has 2, 845 images with 5, 171 faces. The detection results of different methods are shown in Figure 6. [Liu et al.2017] and several other methods obtain higher continuous score because they have transformed the rectangle bounding boxes into ellipse ones. The fact that we didn’t carry out extra training in the FDDB training set may lead to the increase of localization errors because of the difference of annotation criterion. However, in the comparison with state-of-the-art methods, we observe that AOFD outperforms all the other methods in terms of discrete score, demonstrating its strong ability to detect nearly all large faces even if faces with short side less than around 15 pixels are mostly neglected due to the anchor setting.
Furthermore, our AOFD also obtains a higher recall rate at 1000 FPs on FDDB than other Faster RCNN methods with similar settings by a large margin (Figure 5 and Figure 6). The superior performance also reveals that applying masking strategy and training with a segmentation task are valuable attempts to enhance the model’s capacity.
MAFA is designed for the evaluation of masked face detection, which contains 35806 face annotations with a minimum size of 3232. Since the MAFA testing set uses squares to label faces, the rectangle bounding boxes in our results are transformed into squares to match the annotation.
There are three types of annotations in the MAFA dataset: masked, unmasked and ignored. Blurry or deformed faces or those with side length less than 32 pixels are labeled as ‘Ignored’. But we find that many ‘ignored’ faces are also acceptable. Since the other methods ( [Wang et al.2017a] [Ge et al.2017] [Zhang et al.2016]) didn’t count those annotations labeled as ‘Ignored’, we report our results on both MAFA subsets and the whole testing set for comparison (Table 1).
|Methods||All||‘masked’ only||w/o ‘Ignored’|
As shown in Table 1, the average precision achieves the highest 91.9% (threshold 0.5) if we only evaluate on the faces with ‘masked’ and ‘unmasked’ labels. The result outperforms LLE-CNNs [Ge et al.2017] by a large margin and is also better than the state-of-the-art Face Attention Network (FAN) [Wang et al.2017a]. Since AOFD is proposed to address occlusion problem, we also evaluate our model on faces labeled as ‘masked’ only. AOFD achieves 83.5% and has 7% improvement over the state-of-the-art result obtained by FAN.
Figure 6(c) further shows the PR (Precision-Recall) curves of the three experimental settings. If we only count the faces annotated as ‘masked’ (the orange curve in Figure 6(c)), precision witnesses a sharp drop at the beginning. This is caused by unmasked and unlabeled face detections which are regarded as FPs when evaluating masked faces. More results on MAFA are presented in Figure 7.
Furthermore, we have studied the main obstacle of our model to achieve a higher AP. The minimum IoU threshold for a true positive proposal is modified from 0.5 to 0.45, from which we observe that a slight decrease of IoU threshold can boost AP from 91.9% to 93.8%. This explains that the precision of bounding boxes can still be further improved
|Settings||Recall rate at 1000 FPs|
4.3 Model Analysis
To better understand the function of each part of our model, we ablate each component to observe AOFD’s performance. In this way, the mask generator and segmentation branch are removed one after the another. We delve into the optimal area of the mask as well and find that the mask area is crucial for the functioning of mask generator. Besides, the efficiency of the compact constraint and the comparison with online hard example mining (OHEM) [Shrivastava et al.2016] are also discussed in this section.
Mask facilitates detecting: State-of-the-art detectors are able to detect some of occluded faces, but with lower confidence. As shown in Figure 8, AOFD can increase the confidence of occluded faces by a large margin. Without the mask generator, AOFD pays less attention to exposed area or face structure, and the recall rate at 1000 false positives on FDDB drops by 1.3%(Table 2). The sharp decline (3.2%) of average precision on the MAFA testing set in Figure 5(a) reveals the value of the mask generator as well. It is also observed that AOFD’s results would drop by around 1% with only random and square-like occlusions. Since faces have unique structure characteristics such as facial symmetry, generating adaptive occlusions is essential in order to fool the detector.
Segmentation increases recall: With the segmentation branch, the result in Table 2 witnesses an increase of 0.75%. This improvement is relatively slight because there are not many heavily occluded faces in the FDDB testing set. The drop of average precision from 79.9% to 77.4% (Figure 5(b)) will be more convincing to confirm the effectiveness of the segmentation branch.
Mask area is crucial: We find that the mask would vitiate the detector if a mask area is too large. Nevertheless, it would be of no use if it is too small. Figure 5 gives a brief overview of our experiments, from which we find occluding one-third of features is an ideal area for a mask.
Compact constraint matters: We propose a compact constraint to help generate more practical masks. As is mentioned in Sec. 3.3, the generated masks are discrete or sporadic and are not plausible, e.g., two pixels occlusion on the mouth, three pixels occlusion on the eyes and others on the corners (Figure 3). In our initial experiments, the average precision is 0.785 when masking 1/3 of RoIs without the compact constraint, which is similar to having 1/6 masking area in Figure 5(c). However, masks become harder and more reasonable with , which can account for the increase of performance.
|Methods||AP on MAFA||Recall on FDDB|
|AOFD with OHEM||81.3%||97.88%|
Comparing with OHEM: We compare online hard example mining [Shrivastava et al.2016] with our methods in Table 3. We can see that the performance of training a Faster RCNN with OHEM is generally worse than a single AOFD without OHEM. But the combination of these two methods leads to a better performance. Although a harder training procedure means a more robust detector under this condition, the measurement of hard level needs to be carefully handled. For example, the decrease incurred by too large masking area demonstrated in Figure 5(c).
This paper has proposed a face detection model named AOFD to address the long-standing issue of face occlusions. A novel masking strategy has been integrated into AOFD to increase training complexity, and can plastically mimic different situations of face occlusions. The multitask training method with a segmentation branch provides a feasible solution and verifies the possibility to train an auxiliary task with very limited training data. The superior performance on both general face detection and masked face detection benchmarks demonstrates the effectiveness of AOFD.
- [Chen and Li2017] Yujia Chen and Ce Li. Gm-net: Learning features with more efficiency. arXiv preprint arXiv:1706.06792, 2017.
[Farfade et al.2015]
Sachin Sudhakar Farfade, Mohammad J Saberian, and Li-Jia Li.
Multi-view face detection using deep convolutional neural networks.In ACM ICMR, pages 643–650, 2015.
- [Felzenszwalb et al.2010] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE TPAMI, 32(9):1627–1645, 2010.
- [Ge et al.2017] Shiming Ge, Jia Li, Qiting Ye, and Zhao Luo. Detecting masked faces in the wild with lle-cnns. In IEEE CVPR, 2017.
- [Ghiasi and Fowlkes2014] Golnaz Ghiasi and Charless C Fowlkes. Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model. In IEEE CVPR, pages 2385–2392, 2014.
- [Girshick2015] Ross Girshick. Fast r-cnn. In IEEE ICCV, pages 1440–1448, 2015.
- [Hu and Ramanan2017] Peiyun Hu and Deva Ramanan. Finding tiny faces. In IEEE CVPR, 2017.
- [Huang et al.2017] Rui Huang, Shu Zhang, Tianyu Li, and Ran He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. In IEEE ICCV, 2017.
- [Li et al.2015] Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. A convolutional neural network cascade for face detection. In IEEE CVPR, pages 5325–5334, 2015.
- [Li et al.2017] Jianan Li, Xiaodan Liang, Yunchao Wei, Tingfa Xu, Jiashi Feng, and Shuicheng Yan. Perceptual generative adversarial networks for small object detection. In IEEE CVPR, 2017.
- [Lin et al.2016] Tsung-Yi Lin, Piotr Dollár, Ross B. Girshick, Kaiming He, Bharath Hariharan, and Serge J. Belongie. Feature pyramid networks for object detection. CoRR, 2016.
- [Liu et al.2017] Yu Liu, Hongyang Li, Junjie Yan, Fangyin Wei, Xiaogang Wang, and Xiaoou Tang. Recurrent scale approximation for object detection in cnn. arXiv preprint arXiv:1707.09531, 2017.
- [Mahbub et al.2016] Upal Mahbub, Vishal M Patel, Deepak Chandra, Brandon Barbello, and Rama Chellappa. Partial face detection for continuous authentication. In IEEE ICIP, pages 2991–2995, 2016.
- [Mathias et al.2014] Markus Mathias, Rodrigo Benenson, Marco Pedersoli, and Luc Van Gool. Face detection without bells and whistles. In ECCV, pages 720–735, 2014.
- [Najibi et al.2017] Mahyar Najibi, Pouya Samangouei, Rama Chellappa, and Larry Davis. Ssh: Single stage headless face detector. In IEEE ICCV, 2017.
- [Ren et al.2015] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, pages 91–99, 2015.
- [Shrivastava et al.2016] Abhinav Shrivastava, Abhinav Gupta, and Ross Girshick. Training region-based object detectors with online hard example mining. In CVPR, pages 761–769, 2016.
- [Simonyan and Zisserman2014] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [Viola and Jones2001] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In IEEE CVPR, volume 1, pages I–I, 2001.
- [Wang et al.2017a] Jianfeng Wang, Ye Yuan, and Gang Yu. Face attention network: An effective face detector for the occluded faces. arXiv preprint arXiv:1711.07246v1, 2017.
- [Wang et al.2017b] Xiaolong Wang, Abhinav Shrivastava, and Abhinav Gupta. A-fast-rcnn: Hard positive generation via adversary for object detection. In IEEE CVPR, 2017.
- [Wu et al.2015] Xiang Wu, Ran He, and Zhenan Sun. A lightened cnn for deep face representation. arXiv preprint arXiv:1511.02683, 2015.
- [Yang et al.2015] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. From facial parts responses to face detection: A deep learning approach. In IEEE ICCV, pages 3676–3684, 2015.
- [Yang et al.2016] Shuo Yang, Ping Luo, Chen-Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In IEEE CVPR, pages 5525–5533, 2016.
- [Zhang et al.2016] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
- [Zhang et al.2017] Shifeng Zhang, Xiangyu Zhu, Zhen Lei, Hailin Shi, Xiaobo Wang, and Stan Z Li. S3fd: Single shot scale-invariant face detector. In IEEE ICCV, 2017.
- [Zhu and Peng2016] Chao Zhu and Yuxin Peng. Group cost-sensitive boosting for multi-resolution pedestrian detection. In AAAI, pages 3676–3682, 2016.