Improved Selective Refinement Network for Face Detection

01/20/2019 ∙ by Shifeng Zhang, et al. ∙, Inc. 0

As a long-standing problem in computer vision, face detection has attracted much attention in recent decades for its practical applications. With the availability of face detection benchmark WIDER FACE dataset, much of the progresses have been made by various algorithms in recent years. Among them, the Selective Refinement Network (SRN) face detector introduces the two-step classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously. Moreover, it designs a receptive field enhancement block to provide more diverse receptive field. In this report, to further improve the performance of SRN, we exploit some existing techniques via extensive experiments, including new data augmentation strategy, improved backbone network, MS COCO pretraining, decoupled classification module, segmentation branch and Squeeze-and-Excitation block. Some of these techniques bring performance improvements, while few of them do not well adapt to our baseline. As a consequence, we present an improved SRN face detector by combining these useful techniques together and obtain the best performance on widely used face detection benchmark WIDER FACE dataset.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Face detection is the primary procedure for other face-related tasks including face alignment, face recognition, face animation, face attribute analysis and human computer interaction, to name a few. The accuracy of face detection systems has a direct impact on these tasks, hence the success of face detection is of crucial importance. Given an arbitrary image, the goal of face detection is to determine whether there are any faces in the image, and if present, return the image location and extent of each face. In recent years, great progress has been made on face detection [15, 26, 17, 44, 8, 31, 29, 1, 51, 49]

due to the development of deep convolutional neural network (CNNs) 

[30, 10, 32, 13] and the collection of WIDER FACE benchmark dataset [41]. This challenging dataset has a high degree of variability in scale, pose and occlusion as well as plenty of tiny faces in various complex scenes, motivating a number of robust CNN-based algorithms.

Figure 1: The brief overview of Selective Refinement Network. It consists of Selective Two-step Classification (STC), Selective Two-step Regression (STR) and Receptive Field Enhancement (RFE).

We first give a brief introduction to these algorithms on the WIDER FACE dataset as follows. ACF [39] borrows the concept of channel features to the face detection domain. Faceness [40] formulates face detection as scoring facial parts responses to detect faces under severe occlusion. MTCNN [47] proposes a joint face detection and alignment method using unified cascaded CNNs for multi-task learning. CMS-RCNN [55] integrates contextual reasoning into the Faster R-CNN algorithm to help reduce the overall detection errors. LDCF+ [24]

utilizes the boosted decision tree classifier to detect faces. The face detection model for finding tiny faces 

[12] trains separate detectors for different scales. Face R-CNN [35] and Face R-FCN [37] apply Faster R-CNN [27] and R-FCN [6] in face detection and achieve promising results. ScaleFace [42] detects different scales of faces with a specialized set of deep convolutional networks with different structures. SSH [22] adds large filters on each prediction head to merge the context information. SFD [50] compensates anchors for small faces with a few strategies in SSD [21] framework. MSCNN [2] performs detection at multiple output layers so as to let receptive fields match objects of different scales. Based on RetinaNet [19], FAN [36] proposes an attention mechanism at anchor level to detect the occluded faces. Zhu et al. [54] propose an Expected Max Overlapping score to evaluate the quality of anchor matching. PyramidBox [33] takes advantage of the information around human faces to improve detection performance. FDNet [45] employs several training and testing techniques to Faster R-CNN to perform face detection. Inspired by RefineDet [48], SRN [5] appends another binary classification and regression stage in RetinaNet, in order to filter out most of simple negative anchors in the large feature maps and coarsely adjust the locations of anchors in the high level feature maps. FANet [46] aggregates higher-level features to augment lower-level features at marginal extra computation cost. DSFD [16] strengthens the representation ability by a feature enhance module. DFS [34] introduces a more effective feature fusion pyramid and a more efficient segmentation branch to handle hard faces. VIM-FD [52] combines many previous techniques on SRN and achieves the state-of-the-art performance.

In this report, we exploit some existing techniques from classification and detection tasks to further improve the performance of SRN, including data augmentation strategy, improved backbone network, MS COCO pretraining, decoupled classification module, segmentation branch and SE block. By conducting extensive experiments, we share some useful techniques that make SRN regain the state-of-the-art performance on WIDER FACE. Meanwhile, we list some techniques that do not work well in our model, probably because (1) we have a strong baseline that causes them to not work well, (2) combination of ideas is not trivial, (3) they are not robust enough for universality, and (4) our implementation is wrong. This does not mean that they are not applicable to other models or other datasets.

2 Review of Baseline

In this section, we present a simple review of our baseline Selective Refinement Network (SRN). As illustrated in Figure 1, it consists of the Selective Two-step Classification (STC), Selective Two-step Regression (STR) and Receptive Field Enhancement (RFE). These three module are elaborated as follows.

2.1 Selective Two-step Classification

For one-stage detectors, numerous anchors with extreme positive/negative sample ratio (e.g., there are about anchors and the positive/negative ratio is approximately in SRN) leads to quite a few false positives. Hence it needs another stage like RPN to filter out some negative examples. Selective Two-step Classification, inherited from RefineDet, effectively rejects lots of negative anchors and alleviates the class imbalance problem.

Specifically, most of anchors (i.e., ) are tiled on the first three low level feature maps, which do not contain adequate context information. So it is necessary to apply STC on these three low level features. Other three high level feature maps only produce anchors with abundant semantic information, which is not suitable for STC. To sum up, the application of STC on three low level features brings advanced results, while on three high level ones will bring ineffective results and more computational cost. STC module suppresses the amount of negative anchors by a large margin, leading the positive/negative sample ratio about times increased (i.e., from around : to :). The shared classification convolution module and the same binary Focal Loss are used in the two-step classification, since both of the targets are distinguishing the faces from the background.

2.2 Selective Two-step Regression

Multi-step regression like Cascade RCNN [3] can improve the accuracy of bounding box locations, especially in some challenging scenes, e.g.

, MS COCO-style evaluation metrics. However, applying multi-step regression to the face detection task without careful consideration may hurt the detection results.

For SRN, the numerous small anchors from three low level feature maps will cause the loss to bias towards regression problem and hinder the essential classification problem. Meanwhile, the feature representations of three lower pyramid levels for small faces are coarse, leading to the obstacle to perform two-step regression. These concerns will not happen while preforming two-step regression on the three high level features, whose detailed features of large faces with large anchor scales help regress to more accurate locations. In summary, Selective Two-step Classification and Regression is a specific and efficient variant of RefineDet on face detection task, especially for small faces and some false positives.

2.3 Receptive Field Enhancement

Current networks usually possess square receptive fields, which affect the detection of objects with different aspect ratios. To address this issue, SRN designs a Receptive Field Enhancement (RFE) to diversify the receptive field of features before predicting classes and locations, which helps to capture faces well in some extreme poses.

3 Description of Improvement

Here we share some existing techniques that make SRN regain the state-of-the-art performance on the WIDER FACE dataset, including data augmentation, feature extractor and training strategy.

3.1 Data Augmentation

We use the original data augmentation strategies of SRN including photometric distortions, randomly expanding by zero-padding operation, randomly cropping patches from images and resizing patches to

. Additionally, with probability of , we utilize the data-anchor-sampling in PyramidBox [33], which randomly selects a face in an image and crops sub-image based anchor. These data augmentation methods are crucial to prevent over-fitting and construct a robust model.

3.2 Feature Extractor

The greatest challenge in WIDER FACE is to accurately detect plenty of tiny faces. We believe that the ResNet-50-FPN [18] backbone of SRN still remains considerable room to improve the accuracy, especially for the tiny faces. Root-ResNet from ScratchDet [56] aims to improve the detection performance of small object, but its training speed is much slower than ResNet. To balance training efficiency and detection accuracy, we improve the ResNet-50 by taking the advantages of Root-ResNet and DRN [43].

Specifically, the downsampling operation (stride=

) to the image in the first convolution layer of ResNet will cause the loss of important information, especially for small faces. After considering the motivation of Root-ResNet and DRN, we change the first conv layer’s stride from to and channel number from to , as well as add two residual blocks (see Figure 2). One residual block is for enriching representational information while the other is for downsampling, whose channel number are reduced to and to balance the parameters. This configuration can keep essential information of small faces without additional overhead.

Figure 2: Network structure illustration. (a) ResNet-18: original structure. (b) Root-ResNet-18: replacing the conv layer with three stacked conv layers and changing the stride to . (c) New-ResNet-18: combining DRN with Root-ResNet-18 to have a training speed/accuracy trade-off backbone for SRN.
Figure 3: A qualitative result. Our detector successfully finds about faces out of the reported faces in the above image. The confidences of the detections are presented in the color bar on the right hand. Best view in color.

3.3 Training Strategy

Because our ResNet-50-FPN backbone have been modified, we can not use the ImageNet pretrained model. One solution is like DRN that trains the modified backbone on ImageNet dataset 

[28] and then finetunes on WIDER FACE. However, He et al. [9]

and ScratchDet have proved that the ImageNet pretraining is not necessary. Thus, we double the training epoch to

epochs and train the model with modified backbone from scratch. One of the key factor to train from scratch is the normalization. Due to the large input size (i.e., ), one G GPU only can be input up to

images, causing Batch Normalization 

[14] to not work well during training from scratch. To this end, we utilize Group Normalization [38] with group= to train this modified ResNet-50 backbone from scratch.

Besides, recent work FA-RPN [23] demonstrates that pretraining the model on the MS COCO dataset [20] is helpful to improve the performance of face detector on the WIDER FACE dataset. We attribute this promotion to a number of examples from people category and the objects with similar small scale (i.e., ground truth area ) in the MS COCO dataset. So we also apply this pretraining strategy.

3.4 Implementation Detail

Anchor Setting and Matching. Two anchor scales (i.e., and , where represents the total stride size at each pyramid level) and one aspect ratios (i.e., ) cover the input images (i.e., ), with the anchor scale ranging from to pixels across pyramid levels. We assign anchors with IOU as positive, anchors with IOU in as negative and others as ignored examples. Empirically, we set and for the first step, and and for the second step.

Optimization. At the training process, we simply sum the STC loss and the STR loss. We pretrain the new-designed backbone network with GroupNorm on MS COCO and finetune on WIDER FACE training set using SGD with momentum, weight decay, and batch size . After 5 epochs warming up, the learning rate is set to for the first epochs, and decayed to and for another and

epochs, respectively. Our method is implemented with the PyTorch library 


Inference. During the inference phase, the STC first filters the anchors on the first three feature maps with the positive confidence scores smaller than the threshold , and then the STR adjusts the anchors on the last three feature maps. The second step keeps top high detections among these refined anchors. Finally, we apply the non-maximum suppression (NMS) with jaccard overlap of to generate high confident results per image. The multi-scale testing strategy is used during the inference phase.

4 Result on WIDER FACE

The WIDER FACE dataset contains images and annotated faces bounding boxes including high degree of variability in scale, pose, facial expression, occlusion and lighting condition. It is split into the training (), validation () and testing () subsets by randomly sampling from each scene category (totally classes), and defines three levels of difficulty: Easy, Medium, Hard, based on the detection rate of EdgeBox [57]. Following the evaluation protocol in WIDER FACE, we only train the model on the training set and test on both the validation and testing sets. To obtain the evaluation results on the testing set, we submit the detection results to the authors for evaluation.

As shown in Figure 4, we compare our method (namely ISRN) with state-of-the-art face detection methods [41, 39, 40, 47, 55, 24, 12, 35, 37, 42, 22, 50, 2, 36, 54, 33, 45, 5, 46, 16, 34, 52]. We find that our model achieves the state-of-the-art performance based on the average precision (AP) across the three evaluation metrics, especially on the Hard subset which contains a large amount of tiny faces. Specifically, it produces the best AP scores in all subsets of both validation and testing sets, i.e., (Easy), (Medium) and (Hard) for validation set, and (Easy), (Medium) and (Hard) for testing set, surpassing all approaches, which demonstrates the superiority of our face detector. We show one qualitative result of the World Largest Selfie in Figure 3. Our detector successfully finds about faces out of the reported faces.

(a) Val: Easy
(b) Test: Easy
(c) Val: Medium
(d) Test: Medium
(e) Val: Hard
(f) Test: Hard
Figure 4: Precision-recall curves on WIDER FACE validation and testing subsets.
Figure 5: The brief overview of Selective Refinement Network with segmentation branch.

5 Things We Tried That Did Not Work Well

This section lists some techniques that do not work well in our model, probably because (1) we have a strong baseline, (2) combination of ideas is not trivial, (3) they are not robust enough for universality, and (4) our implementation is wrong. This does not mean that they are not applicable to other models or other datasets.

Decoupled Classification Refinement (DCR) [4]. It is an extra-stage classifier for the classification refinement. During the training process, DCR samples hard false positives with high condifence scores from the base Faster R-CNN detector, and then trains a stronger classifier. At the inference time, it simply multiplies the score from the base detector and another score from DCR to rerank detection results. Faster R-CNN with DCR gets great improvement on MS COCO [20] and PASCAL VOC [7]

datasets. Therefore, we try to use DCR to suppress the false positives at the beginning and conduct some inquiring experiments based on our SRN baseline. However, with the help of RPN proposals and ROIs, the sampling strategies of DCR for two-stage detectors are much easier to design than the one for one-stage methods. SRN face detector produces too much boxes so we try a lot of sampling heuristics. Besides, we attempt some different crop size of the training examples due to the large scale variance and numerous small faces on WIDER FACE. Considering the scale of training set (positive and negative examples cropped from WIDER FACE training set) and network overfitting, we also try DCR backbones with different order of magnitude. With the setting of crop size=

, DRN-22 backbone and sampling strategy (positive examples: 0.5 IOU 0.8 and negative examples: IOU 0.3), our best result is also slight lower than our baseline detector. Further experiments about DCR need to be conducted for face detection task on WIDER FACE.

Segmentation Branch [53]. A segmentation branch is added on SSD in DES, which is applied to the low level feature map and supervised by weak bounding-box level segmentation ground truth. The low level feature map is reweighed by the output map of segmentation branch. This enhancement can be regarded as a element-wise attention mechanism. As shown in Figure5, we apply segmentation branch to the first low level feature map of SRN but the final results drop a bit on three metrics.

Figure 6: Applying SE block to the detection feature.

Squeeze-and-Excitation (SE) Block [11]. It adaptively reweighs channel-wise features by using global information to selectively emphasise informative features and suppress useless ones. It can be regarded as a channel-wise attention mechanism with a squeeze-and-excitation feature map, and the original feature will be reweighed to generate more representational one. As shown in Figure6, we apply SE block to the final detection feature map of SRN, but the final results drop , and respectively on Easy, Medium and Hard metric.

6 Conclusion

To further boost the performance of SRN, we exploit some existing techniques including new data augmentation strategy, improved backbone network, MS COCO pretraining, decoupled classification module, segmentation branch and SE block. By conducting extensive experiments on the WIDER FACE dataset, we find that some of these techniques bring performance improvements, while few of them do not well adapt to our baseline. By combining these useful techniques together, we present an improved SRN detector and obtain the state-of-the-art performance on the widely used face detection benchmark WIDER FACE dataset.