Learning Better Features for Face Detection with Feature Fusion and Segmentation Supervision

11/20/2018 ∙ by Wanxin Tian, et al. ∙ Beijing Didi Infinity Technology and Development Co., Ltd. 0

The performance of face detectors has been largely improved with the development of convolutional neural network. However, it remains challenging for face detectors to detect tiny, occluded or blurry faces. Besides, most face detectors can't locate face's position precisely and can't achieve high Intersection-over-Union (IoU) scores. We assume that problems inside are inadequate use of supervision information and imbalance between semantics and details at all level feature maps in CNN even with Feature Pyramid Networks (FPN). In this paper, we present a novel single-shot face detection network, named DF^2S^2 (Detection with Feature Fusion and Segmentation Supervision), which introduces a more effective feature fusion pyramid and a more efficient segmentation branch on ResNet-50 to handle mentioned problems. Specifically, inspired by FPN and SENet, we apply semantic information from higher-level feature maps as contextual cues to augment low-level feature maps via a spatial and channel-wise attention style, preventing details from being covered by too much semantics and making semantics and details complement each other. We further propose a semantic segmentation branch to best utilize detection supervision information meanwhile applying attention mechanism in a self-supervised manner. The segmentation branch is supervised by weak segmentation ground-truth (no extra annotation is required) in a hierarchical manner, deprecated in the inference time so it wouldn't compromise the inference speed. We evaluate our model on WIDER FACE dataset and achieved state-of-art results.



There are no comments yet.


page 2

page 3

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: An example of face detection with our proposed DFS. In the above image, our model can find 827 faces out of 1000 facial images present with little false positives. The detection confidence scores are positively correlative with transparency of bounding-boxes. Best viewed in color.

Face detection is an essential step for many subsequent face-related applications, such as face alignment  [43] face recognition  [44] and face verification  [26], etc. It has been well developed over the past few decades. Following the pioneering work of Viola-Jones face detector  [28]

, most of early works focused on crafting effective features manually and training powerful classifiers. But these hand-crafted features are indiscriminative and each component is isolated, making the face detection pipeline sub-optimal.

Recently, object detection borrows ImageNet  

[14] pre-trained models as the backbone from image classification and have acquired significant improvements. For the task of image classification only needs semantics to recognize the category, feature maps own more semantic information and less detailed information with going deeper in CNN, However, both of semantics and details are in demand for face detectors to detect faces in different locations with various scales and charicteristics. Consequently FPN  [15] presents a divide and conquer principle that different scales of objects are collected and distributed to different feature layers, with a top-down architecture attached to maintain both the high spatial resolution and semantic information.

We observe that FPN obtains semantic enrichment at lower-level layers by adding deformation of higher-level feature maps to lower-level feature maps, which may cause too much semantics from higher-level feature maps damages details in lower-level feature maps. As can be seen in  [35], semantics represents the more semantic meaningful patterns whose receptive filed is larger, while details represent basic visual patterns whose receptive filed is smaller. Intuitively, they will make conflicts when fusing semantics and details in an addition manner. So, the key of feature fusion is to prevent conflicts among different feature maps and loss of information in the process of transformation. To obtain semantic enrichment at lower-level layers and meantime prevent details from being covered by too much semantics, we propose a novel feature pyramidal structure to fuse higher-level feature maps and lower-level feature maps in a spatial and channel-wise attention manner. More specially, we apply semantic information of higher-level feature maps as contextual cues to element-wisely multiply lower-level feature maps. We further avoid loss of semantic information by applying transposed convolution (also called deconvolution  [36]) to transform feature maps.

Secondly, most works divide the task of detection into the classification task and the regression task, both of which handle pre-set anchors. When anchors match objects not well, the objects would be ignored with a waste of detection supervision information, making optimization sub-optimal. So anchor assign strategy decides the ceiling of performance of anchor-based face detection.

In this paper, to complement anchor assign strategy and best utilize detection supervision information, we introduce an efficient segmentation branch like  [18]. The segmentation branch is trained with bounding-box segmentation ground-truth in a hierarchical manner. The segmentation branch can help networks learn more discriminative features from object regions, which has been proved helpful in  [25], in a self-supervised manner. We employ the segmentation in the training phase to apply attention mechanism – a dynamic feature extractor that combines contextual fixations over times, as CNN features are naturally spatial, channel-wise and multi-layer  [4], and there will be no extra parameters in the inference time. We conduct extensive experiments on WIDER FACE  [32] benchmarks to validate the efficacy of our proposed structure.

As a summary, the main contributions of this paper include the following:

• We propose a novel feature pyramidal structure to apply semantic information in higher-level feature maps as contextual cues to augment semantics in lower-level feature maps in a spatial and channel-wise attention manner.

• We improve the typical deep single shot detectors by making up for anchor mechanism with a semantic segmentation branch to apply attention mechanism, without compromising the inference speed.

• We present a novel single-shot face detector, called DFS (Detection with Feature Fusion and Segmentation Supervision), which can learn better features for face detection and therefore can address well the occlusion and multi-scale issues. We demonstrate a qualitative result of our DFS in Figure 1.

• Extensive experiments are carried out on WIDER FACE dataset to demonstrate the efficiency and effectiveness of our model.

• We achieve state-of-art results on WIDER FACE dataset with real-time inference speed.

Figure 2: The network architecture of our proposed DFS. It consists of backbone network, feature fusion pyramid structure and the detection module. The detection module contains the detection branch and the segmentation branch.

2 Related work

Face Detection. Benefiting from the remarkable achievement of deep convolutional networks on image classification  [14] and object detection  [23]  [17]  [24]

, CNN-based face detectors have also gained much performance improvement recently. Deep learning models trained on large-scale image datasets provide more discriminative features for face detector compared to traditional hand-crafted features. Besides, the end-to-end training style promotes better optimization. The performance gap between human and artificial face detectors has been significantly closed. Based on whether following the proposal and refine strategy, deep learning methods can be divided into two categories: one-stage approaches, such as YOLO  

[23] , SSD  [17] , RetinaNet  [16] , and two-stage approaches such as Faster R-CNN  [24] , R-FCN  [6]. UnitBox  [34] presents a new intersection-over-union (IoU) loss to directly optimize IOU target. HR  [12] builds multi-level image pyramids for multi-scale training and testing which finds upscaled tiny faces. SFD  [39] addresses this with scale-equitable framework and new anchor matching strategy. RetinaNet  [16] introduces a new focal loss to relieve the class imbalance problem. PyramidBox  [27] utilizes contextual information with improved SSD network structure.

Attention Mechanism.

Attention mechanism has been proved effective in various computer vision tasks such as image captioning  

[31] and visual question answering [2]. It is inspired by the reasonable assumption that human vision does not tend to process a whole image in its entirety at once; instead, one only focuses on selective parts of the whole visual space when and where as needed  [5]

. Specifically, rather than encoding an image into a static vector, attention mechanism allows image features to evolve from the sentence context at hand, resulting in richer and longer descriptions for cluttered images  


. In this way, attention mechanism can be considered as a dynamic feature extraction mechanism that combines contextual fixations over times  


Segmentation branch. Segmentation branch is initially used in the semantic segmentation task  [18] to classify each pixel in one image. However, Papandreou et al[7] proved that weakly annotated data such as bounding-boxes and image-level labels can also be utilized for semantic segmentation. He et al[9] showed that multi-task training of object detection and instance segmentation can help to improve the object detection task with extra instance segmentation annotation. However, we do not consider extra annotation in our work. Dense-Box  [13] utilizes a unified end-to-end fully convolutional network to detect confidence and bounding box directly. FAN  [30] proposes an anchor-level attention into RetinaNet to detect the occluded faces. In this paper, we introduce segmentation branch into the popular single shot detector with weak segmentation ground-truth, applying attention mechanism without compromising inference speed.

Feature Pyramid. Feature pyramid is a structure which applies skip-connection to combine semantic meaningful features with semantically weak but visually strong features. FPN  [15] proposed a top-down architecture to use high-level semantic feature maps at all scales. FANet  [37] agglomerates multi-scale features to augment lower-level feature maps in a concatenation style. In this paper, we propose a novel feature fusion connection which aggregate multi-scale features in a spatial and channel-wise attention manner.

3 Detection with Feature Fusion and Segmentation Supervision (DFS )

In this section, we present our DFS framework for face detection. First, we present the overall architecture in Section 3.1. Then we propose a novel feature fusion pyramid structure replacing FPN (Feature Pyramid Networks) in Section 3.2 and a segmentation branch to balance semantics and details at all level detection feature maps in Section 3.3. Finally, we will introduce the associated training methodology in Section 3.4.

3.1 Overall architecture

Our goal is to learn more discriminative hierarchical features with enriched semantics and details at all levels to detect hard faces, like tiny faces, partially occluded faces, etc. Figure 2 illustrates our proposed network with feature fusion pyramid and the segmentation branch. To obtain strong capability of generality, we consider the widely used ResNet-50 as the backbone CNN architecture and mimic SFD  [39] to build our single-shot multi-scale face detector.

First, we build our feature fusion pyramid structure based on four layers of from ResNet-50 (colored white in the left-upper part of Figure 2). The structure takes four feature maps from these layers as inputs, and generates four corresponding new feature maps with augmented semantics and details of (highlighted as blue feature maps in the left-bottom part of Figure 2

), whose spatial resolution and the number of channels are identical to input feature maps, respectively. To get larger receptive field to detect bigger faces, we simply max-pool the

feature map twice in succession to get extra two feature maps of

. The six detection feature maps have strides of

, respectively. As shown in Figure 2, the detection and segmentation is performed on feature maps of (ranging from to ) layers.

In the detection branch, the classification subnet applies four convolution layers each with 256 filters, followed by a convolution layer with filters where means the number of classes and means the number of anchors per location. For face detection since we use sigmoid activation, and we use in most experiments. All convolution layers in this subnet share parameters across all pyramid levels to accelerate convergence of parameters. The regression subnet is identical to the classification subnet except that it terminates in convolution filters with linear activation.

To enhance the correlation between the classification subnet and the regression subnet and improve the separation of semantic supervision information and location supervision information, the parameters of the convolutional layers are shared across the detection branch, except for the last prediction layer.

3.2 Segmentation branch

To make up for anchor assign strategy and make full use of detection supervision information, we present our effective and efficient segmentation branch. As is shown in the right lower part of Figure 2, the segmentation branch is parallel to the classification subnet and the regression subnet in the head-architecture. It takes feature maps of as inputs, the same with the detection branch, and is supervised with the bounding-box level segmentation ground-truth in a hierarchical manner. Following the match principle of  [19] and SFD  [39], these hierarchical segmentation maps are associated to the ground-truth faces matching their corresponding receptive field. The receptive field is identical between the segmentation branch and the detection branch to make sure they focus on the same range of face scales. Consequently, our segmentation helps networks learn more discriminative features from face regions, and further makes the tasks of classification and regression easier for detection branch, promoting better optimization.

We add four convolutional layers each with filters after input feature maps, followed by one convolutional layer with filters where means the number of classes. For face detection since we use sigmoid activation. To enhance the impacts of segmentation supervision information to the detection branch and preserve more parameters of segmentation branch, parameters of the former four convolutional layers are further shared with the detection branch. The segmentation branch is deprecated in the inference time for uselessness of segmentation prediction maps.

Our advantage over other usings of segmentation branch is that, instead applying segmentation prediction maps (like FAN  [30]) or the intermediate result (like DES  [40]) to activate feature maps of main branch, we apply the attention mechanism in a self-supervised manner without extra parameters and activation operation. Besides, there is little redundant background region in the bounding-box segmentation ground-truth for face detection, as face regions usually take most places of the bound-box ground-truth, when chaotic backgrounds interfere the learning of discriminative features from object regions. Mathematically, the average IoU (Intersection of Union) between actual segmentation ground truth and bounding-box ground truth for face is so high that influence of redundant context regions is negligible.

BaseNet Attention
Data augmention
Segmentation branch
AP (easy)
AP (medium)

92.9 90.9 74.2
FAN 88.4 88.4 80.9
FAN 94.0 93.0 86.8
FAN 91.7 90.4 84.2
FAN 95.3 94.2 88.8
Ours 92.9 91.4 81.4
Ours 94.0 92.8 84.0
Ours 94.1 92.8 86.8
Ours 95.2 94.3 88.8
Table 1: The comparative experiments with RetinaNet (Baseline) and FAN  [30] on the WIDER FACE validation set. Minimum size of input images for FAN is 1000.

3.3 Feature fusion pyramids

Figure 2 illustrates the idea of the proposed feature fusion pyramid and feature fusion block (called “-block” for short). We apply the “-block” to fuse different feature maps from top to bottom recursively. Mathematically, we express our feature fusion method as and detail our as following formula:


Where and represent the shallower feature map and the deeper one respectively. represents the transposed convolution operation on the high-level feature map, represents the parameters of the transposed convolution. on the left side of the formula represents the new generated feature map after fusion and would continue to participate in the process of feature fusion with lower-level feature maps until the lowest. The element-wise multiplication (represented as ) can be seen as the combination of the spatial and channel-wise attention that maximize mutual information between lower-level and higher-level representations. Furthermore, in order to enhance the detailed information which is essential for detecting hard faces, the low-level feature map is then added to the previously generated feature map after element-wise multiplication.

It is worth noting that when doing transformation to the higher-level feature maps, we apply transposed convolution instead of the combination of up-sampling operation and one convolution. On one hand, if we first up-sample the high-level feature map, it will double the number of parameters for the following convolutional operation, which will compromise the inference speed. On the other hand, if we first convolute the high-level feature map to half the number of channels, we may lose some of the semantics of high-level feature map inevitably, hurting the fusion of features. So we take advantages of the transposed convolution, changes the spatial resolution and channels of feature map in one step.

Figure 3: Precision-recall curves on WIDER FACE validation set.
Figure 4: Precision-recall curves on WIDER FACE testing set.

3.4 Training

In this section, we introduce our anchor assign strategy, loss function, data augmentation and other implementation details.

Anchor assign strategy. Following the scales designing for anchors in SFD, we have six detector layers each associated with a specific scale anchor. Specifically, scales of anchors are carefully designed according to effective receptive field, making the size of anchors four times as the stride of each layer. Thus, we set our anchors from area of to on pyramid levels. In addition, the aspect ratio for our anchor is set as and , because most of frontal faces are approximately square and profile faces can be considered as a rectangle. Specifically, anchors are assigned to a ground-truth box with the highest IoU larger than , and to background if the highest IoU is less than . Unassigned anchors are ignored during training.

Loss function. In the training phase, an extra cross-entropy loss function for the segmentation branch will be added in conjunction with the original face detection loss function to jointly optimize model parameters:


where is the index of feature fusion pyramids level , and represents the set of anchors defined in pyramid level . The ground-truth label is if the anchor is positive, otherwise. is the predicted classification result from our model. is a vector representing the parameterized coordinates of the predicted bounding box, and is that of the ground-truth box associated with a positive anchor.

The classification loss is focal loss introduced in  [16] over two classes (face and background). is the number of anchors in which participate in the classification loss computation. The regression loss is smooth loss. is the indicator function that limits the regression loss only focusing on the positively assigned anchors, and . The segmentation loss is pixel-wise sigmoid cross entropy. is the segmentation prediction map generated per level, and is the weak segmentation ground-truth described in Section 3.2. and are used to balance these three loss terms, here we simply set 1 and discuss more about in Section 4.3.

Data augmentation. According to the statistics from the WiderFace dataset, there are around of faces with occlusion. Among them, around is of serious occlusion. As we are targeting to solve the occluded faces, the number of training samples with occlusion may not be sufficient. Thus, we employ random crop data augmentation. The performance improvement is significant. Besides from the benefits for the occluded face, our random crop augmentation potentially improves the performance of small faces as more small faces will be enlarged after augmentation.

Other implementation details. The training starts from fine-tuning ResNet-50 backbone network using SGD with momentum of , weight decay of , and a total batch size of on GPUs. The newly added layers are initialized with “xavier”. We train our model for epochs and a learning rate of for first 80 epochs and continue training for epochs with and . Our implementation is based on Detectron  [8], and our source code will be made publicly available.

4 Experiments

In this section, we first analyze the effectiveness of our segmentation branch and feature fusion pyramids structure on extensive experiments and ablation studies. Then, we compare our proposed face detector with the state-of-the-art face detectors on popular face detection benchmarks and finally evaluate the inference speed of the proposed face detector.


92.9 90.9 74.2
92.9 91.4 81.4
94.0 92.8 84.0
93.7 91.7 80.3
94.2 93.2 85.5
Table 2: The ablation study of feature fusion pyramid and segmentation branch. Seg. stands for the segmentation branch, Aug. stands for the data augmentation method mentioned in Section 3.4 and Fus. stands for the feature fusion pyramid.

Datasets. We conduct model analysis on the WIDER FACE dataset  [32], which has 32,203 images with about 400k faces for a large range of scales. It consists of three subsets: for training, for validation, and

for testing. The annotations of training and validation sets are online available. According to the difficulty of detection tasks, it has three splits: Easy, Medium and Hard. The evaluation metric is mean average precision (mAP) with Interception-of-Union (IoU) threshold as

. We train our model on the training set of WIDER FACE, and evaluate it on the validation and testing set. If not specified, the results in Table 1, 2 and 3 are obtained by single scale testing in which the shorter size of image is resized to 800 while keeping image aspect ratio.

Baseline. To evaluate our contributions, we conduct comparative experiments with our baseline. We adopt the closely related detector RetinaNet as the baseline. RetinaNet achieved the state-of-the-art results on several well-known face detection benchmarks. It inherited the standard SSD framework with relieving the class imbalance problem via a novel focal loss function. We train all models with identical strategies mentioned in 3.4 for fair comparison.

4.1 Ablation studies on segmentation branch

To examine the impact of our segmentation branch, we have conducted lots of comparative experiments as can be seen in Table 1. The comparison between the first and the sixth rows in Table 1 indicates that our segmentation branch effectively improve the performance, especially for small faces. The AP is increased by and on WIDER FACE medium and hard subsets, respectively, without bells and whistles. The great advancement on detecting tiny faces demonstrates that our segmentation indeed help the model learn more robust features from small faces and make features highlight face regions.


94.2 93.2 85.5
94.1 93.0 85.4
94.3 93.2 85.3
94.4 93.1 83.4
94.3 92.9 83.5

Table 3: Ablation results evaluated on WIDER FACE validation set. is the hyper-parameter controlling the trade-off between the segmentation loss and detection loss in Eq. (3.4).

Besides, we further compare our model with FAN  [30], which also introduce the segmentation branch to make networks pay more attention on face regions. Differences between our model and FAN has been analyzed in 3.2. The results of FAN are obtained by single scale testing in which the shorter size is resized to 1000 while keeping aspect ratio. Without data augmentation and multi-scale testing, our performance is , and higher on easy, medium and hard subset respectively. It indicates that our segmentation branch can bring more effectiveness with self-learning of models and self-adaption of parameters by more comprehensive supervision. On the contrary, FAN apply the attention maps to weight features map in spatial direction for highlighting face features, when taking risks of hurting semantics and details. When coming to data augmentation and multi-scale training, our model is comparable with FAN. Our advantage is that there is no extra parameter in inference time with almost competitive improvements.

4.2 Ablation studies on feature fusion

We build “Ours (Fus.)” replacing the FPN part in RetinaNet with our feature fusion pyramids structure and build “Ours (Seg.)” on RetinaNat with the segmentation branch to conduct comparative experiments. In Table 2, compared with the plain RetinaNet, “Ours (Fus.)” gains improvement , and in easy, medium and hard level respectively, which validates the efficacy of feature fusion pyramids for enriching semantics and details in a balanced manner, and demonstrates the superiority of our model over FPN. With the segmentation branch, the performance further improved, , and in easy, medium and hard level, shown in column , . The great improvements on hard subset proved that our feature fusion method can actually enrich semantics in lower-level feature maps without damaging details.

Algorithms Backbone Easy Med. Hard
MTCNN  [38] - 84.8 82.5 59.8
LDCF+  [22] - 79.0 76.9 52.2
CMS-RCNN  [42] VGG16 89.9 87.4 62.4
MSCNN  [1] VGG16 91.6 90.3 80.2
Face R-CNN  [29] VGG16 93.7 92.1 83.1
SSH  [21] VGG16 93.1 92.1 84.5
SFD  [39] VGG16 93.7 92.5 85.9
PyramidBox  [27] VGG16 96.1 95.0 88.9
FANet  [37] VGG16 95.6 94.7 89.5
HR  [12] ResNet101 92.5 91.0 80.6
Face R-FCN  [6] ResNet101 94.7 93.5 87.3
Zhu  [41] ResNet101 94.9 93.8 86.1
ScaleFace  [33] ResNet50 86.8 86.7 77.2
FAN  [30] ResNet50 95.3 94.2 88.8
DFS(ours) ResNet50 95.6 94.7 89.8
Table 4: Evaluation on WIDER FACE validation set (mAP). The red marked part represents the highest score in the corresponding dataset, and the blue represents the second highest score.

4.3 Experiments on Balancing the loss

Another ablation study is conducted in the weight of the segmentation loss. For the absence of the segmentation branch in the inference time, we assume that it may not be optimal to make these losses numerically consistent. To find the optimal weight, we train our model with different ’s, i.e., 0.05, 0.1, 0.2, 0.5 and 1. In Table 3, experiments show that yields the best performance in total. The small margin among the four performance indicates that our segmentation branch always improve the models with the inside ability of self-optimization and brings little risk of hurting the detection performance.

4.4 Evaluation on WIDER FACE benchmark

We compare our DFS with the state-of-art detectors, such as PyramidBox, FANet, FAN, SFD and etc. Our DFS is trained on WIDER FACE training set with data augmentation mentioned in Section 3.4, and tested on both validation and testing set with multi-scale of . Figure 3 and Figure 4 show the precision-recall curves on WIDER FACE evaluation and testing sets, and Table 4 summarizes the state-of-the-art results on the WIDER FACE validation set. Our algorithm obtains the best result in hard subset and competitive results in medium and easy subsets, i.e. 0.956 (Easy), 0.947 (Medium) and 0.898 (Hard) for validation set, and 0.949 (Easy), 0.940 (Medium) and 0.891 (Hard) for testing set. Considering the hard subset which contains a lot of occluded faces, tiny faces and blurry faces, Our model outperforms the previous state-art-results of PyramidBox with large margin,

in hard task, which validates the effectiveness of our algorithm in handling high scale variances and occlusion issues.

4.5 Inference Speed

Our DFS detector is a single-shot detector and thus enjoys high inference speed. It runs in real-time inference speed with 26.45 FPS for images of input size on a computing environment with NVIDIA GPU Tesla P40 and CuDNN-v7.

5 Conclusion

This paper proposed a novel framework of DFS (Detection with Feature Fusion and Segmentation Supervision) for face detection. Our model achieves the state-of-the-art performance on WIDER FACE dataset, yet still enjoys real-time inference speed on GPU due to the nature of the single-stage detection framework. We present an effective feature fusion pyramids structure and an efficient segmentation branch, both to make model learn better features. Feature fusion pyramids structure applies semantic information from higher-level feature maps as contextual cues to augment low-level feature maps without loss of detailed information in a spatial and channel-wise attention style, making semantics and details complement each other. And the semantic segmentation branch utilizes detection supervision information to direct models to learn more discriminative features from face regions without comprosing the inference speed. We note that both of the mentioned ideas are not restricted to face detection tasks, and might also be beneficial to the general object detection task and even the image segmentation task. For future work, we will mine more potentials of these two ideas.

Acknowledgments. This work was sponsored by DiDi GAIA Research Collaboration Initiative, and partially supported by the National Natural Science Foundation of China under Grant Nos. 61573068 and 61871052, Beijing Nova Program under Grant No. Z161100004916088.