Human detection serves as a key component for a wide range of real-world applications, such as advanced human-machine interactions, video surveillance or crowd analysis 
. In recent years, the performance of human detectors has been rapidly improved with the development of deep convolutional neural networks (CNN)[36, 39, 22].
However, crowd occlusion  is a challenging problem for human detection systems. Examples are illustrated in Figure 1. Crowded scenarios that happen frequently in real life bring several challenges for CNN-based detectors. First, there are large variations in scales, ratios, and poses in crowd scenes [33, 16] so robustness is a challenging issue. Second, when people overlap largely with each other, semantic features of different instances also interweave and make detectors difficult to discriminate instance boundaries. As a result, detectors may treat the crowd as a whole, or shift the target bounding box to another person mistakenly . Finally, even though the detectors succeed to identify different human instances in a crowd, the highly overlapped results may also be suppressed by the post-processing of non-maximum suppression (NMS). A higher NMS threshold is required to keep the crowded bounding boxes at the expense of bringing more false positives.
A common solution to alleviate crowd occlusion problem is to focus on instance parts [8, 21, 23, 24, 29, 37, 38]. When a full-body detector fails to recognize an occluded person, the visible parts may give high confidences and guide the detector to discriminate instances crowded together. For part-based solutions, the reliability of part detectors is of great importance. Most previous works [8, 21, 23, 24] generate part labels by leveraging the differences between visible-region and full-body bounding boxes for each person. These methods are usually designed for pedestrian detection, where most objects appear with similar poses and aspect ratios. However, we point out that the case is not suitable for human detection because of the large diversity of poses and occlusions, especially for visible-part human detection.
In this paper, we propose Double Anchor R-CNN to improve human detection in crowded scenes by detecting the body and head for each person at the same time. Compared with the human body, the head usually has a smaller scale, less overlap, and a better view in real-world images, and thus is more robust to pose variations and crowd occlusion. This is especially useful in crowded scenarios: Figure 1 shows a heavily crowded scene. The human detector is unable to discriminate instance boundaries since the parts of different instances interweave each other, which may lead to false positives. In this case, features of heads may significantly help discriminate different instances so that such false positive human detections that are not consistent with the head detections can be removed. Moreover, human detection is difficult in crowded situations due to the heavy occlusion or suppression by NMS, which may lead to false negatives. While the heads are still visible and overlap softly, which can notably help to recover heavily occluded humans.
The main contributions of this work are threefold:
We propose double anchor region proposal network (Double Anchor RPN) to detect human heads and bodies at the same time. The head and body of each person are naturally coupled and supply each other for human detection in a crowd.
A proposal crossover strategy is developed to generate high-quality proposals for both parts as a training augmentation. In addition, features of heads and bodies are aggregated efficiently to make the final prediction more reliable. A Joint NMS algorithm is introduced to suppress false positive results in a crowd and improve the robustness of the post-processing.
State-of-the-art results are reported on various challenging human detection datasets. We have achieved a remarkable performance improvement of MR of at least 3pp on the CrowdHuman dataset, COCOPersons (crowded sub-dataset) and CrowdPose (crowded sub-dataset).
2 Related Work
2.1 General Object Detection
The advances in human detection systems have been driven by powerful baseline systems of general object detection. Modern object detection systems can be divided into two categories of one-stage detectors and two-stage detectors. Generally speaking, two-stage approaches on representative of Faster R-CNN  adopt a coarse-to-fine manner and focus on achieving top performances on various benchmarks [10, 18]. As a comparison, one-stage approaches aim at achieving real-time speed while maintaining comparable performance [25, 26, 20].
2.2 Human Detection
Besides detecting human as a simple category with general detectors, many works have been proposed to handle the occlusion and scale-variation problems in human detection [14, 16, 4, 39, 35]. SA-Fast R-CNN tries to handle the scale variation problem by extending Fast R-CNN with jointly training small-scale and large-scale networks . Lin et al. propose an approach to incorporate fine-grained attention masks to extract better semantic features . Zhang et al. propose an attention mechanism to focus on visible body regions instead of learning various parts .
Several works have been proposed to detect human in a crowd by leveraging part-based detectors [8, 9, 30, 21]. The part-based detectors assume that the visible parts are able to generate high confidence prediction and reveal the occluded body. Pioneer works usually train detectors of different parts independently. Later works exploit relationships between different parts by learning various part features in a joint way [24, 29, 38, 37]
. Most of the previous works generate part labels in a style of semi-supervised learning by comparing visible and full-body annotations of pedestrians[29, 38, 36]. However, the solution is hard to extend to human detection because of the huge diversity of poses and occlusions in real-world scenarios.
Special losses are also proposed to discriminate overlapped people in crowded scenes better. Wang et al. propose repulsion loss to make surrounding proposals from different targets repel each other . Zhang et al. design an aggregation loss to enforce proposals closer to the ground truth . Besides, variants of NMS like Soft-NMS  and Adaptive-NMS  are proposed to soften the sensitivity of NMS threshold in crowded scenarios.
3 Double Anchor R-CNN
The framework of Double Anchor R-CNN is illustrated in Figure 2. The architecture is designed on top of the Feature Pyramid Network (FPN)  and can be easily extended to other frameworks like Faster R-CNN and Mask R-CNN. Double Anchor R-CNN framework consists of the following phases: (i). a double anchor region proposal network to generate head and body proposals in pairs, (ii). a proposal crossover module to generate high-quality training samples for the R-CNN part, (iii). an aggregation module to fuse features of heads and bodies effectively, and (iv). a Joint NMS algorithm for post-processing. In this section, we introduce each part sequentially.
3.1 Double Anchor RPN
The original region proposal network first slides a small network over the convolutional feature maps and regresses the target bounding boxes from pre-designed anchors. On top of that, Double Anchor RPN is conceptually simple: the network will regress both the head offsets and the body offsets for each human instance simultaneously from the same anchor. The method is shown in Figure 3.
It should be noted that Double Anchor RPN requires to select one principal part in anchor matching. For example, we can set principal anchors to heads. Anchors overlap with the head ground-truths with high intersection-over-union (IoU) will be matched first. Then the network is forced to regress the attached body part based on the principal head anchors. We called this branch the head-body branch in this paper. To cover both parts better, two branches, i.e., the head-body branch and body-head branch, are employed in the framework. Each branch sets either heads or bodies as principal parts in Double Anchor RPN. Besides, Double Anchor RPN only predicts one classification score for each anchor, since region proposals are used to distinguish the foreground and background in a class-agnostic style. Finally, the loss function for Double Anchor RPN module is designed as follows:
where is the cross-entropy loss for classification of foreground and background. and are regression losses (e.g. the Smooth loss) for head bounding boxes and body bounding boxes, respectively.
For detailed implementation, we assign positive labels for anchors when the anchor overlaps with principal part ground-truth (e.g., the head ground-truth for the head-body branch) with an IoU larger than a threshold (0.7 in our work). Only one ground-truth with the highest IoU will be assigned as the target for offset regression. For positive anchors, we calculate the regression targets for both heads and bodies based on the same anchor.
3.2 Proposal Crossover
Double Anchor RPN generates proposals in pairs of heads and bodies. The top confident pairs of proposals will be fed to the second RCNN stages with RoI module to predict final results. As mentioned in Cascade R-CNN , high-quality detection requires sufficient high-quality positive samples. However, as illustrated in Figure 4, we discover that the quality of the attached part is not guaranteed since Double Anchor RPN module only considers the principal part when assigning the pair label.
A simple method to generate high-quality proposal pairs is to constrain the IoU thresholds for both parts in the pair. However, as discussed later in Section 4.3, this method does not work for Double Anchor R-CNN due to the insufficient positives that have qualified IoU for both parts. The network will be dominated by the noisy proposals and cannot discriminate “good” and “bad” proposals finally.
In order to generate more qualified proposal pairs, we introduce a training augmentation strategy named Proposal Crossover, which generates adequate augmented positive training samples by utilizing the complementary. To be specific, we add a body-head branch as an augmentation along with the head-body branch as illustrated in Figure 2. First we can obtain the labels of the pairs from each branch, by calculating the overlaps between the principal parts of each branch and corresponding ground-truths. The pairs are regarded as positives if the overlap is larger than a threshold (0.5 in our work). It should be noted that the principal parts are qualified here but the attached parts are noisy since they are given the same positive labels without consideration of their own overlaps. Then we crossover the proposals between the head-body branch and a body-head branch to generate final paired proposals qualified for both parts. Overlaps between the attached part of head-body branch (a.k.a. body proposals) and the principal part of body-head branch (also the body proposals) are calculated. If the maximum overlap exceeds a certain threshold (0.5 in our work), the body proposals from the head-body branch will be replaced by the body proposals from the body-head branch with the maximum overlap. New pairs of proposals consist of original head proposals from the head-body branch, and crossover body proposals from the body-head branch are generated with good quality. Finally the crossover method generates adequate high-quality proposals for R-CNN and effectively leads to a better training procedure.
It should be noted that the proposal crossover is not needed at inference time and will not introduce extra complexity since it only serves as an effective training augmentation for the R-CNN part.
3.3 Feature Aggregation
Features of heads may significantly help discriminate instances from the crowd. In the meanwhile, semantic information from body will also benefit the head prediction by providing effective context. Therefore, features of heads and bodies are aggregated in Double Anchor R-CNN.
Aggregating features of heads and bodies have different ways. A simple solution is to directly combine the spatial feature maps or fully-connected (FC) vectors together. In this work, we try both the two methods and choose the latter implementation to avoid the misalignments between head features and body features. Moreover, the classification task usually requires more global information and the localization task demands better spatial resolution. Therefore, we decouple the classification and localization tasks into two branches. The classification features of heads and bodies are extracted by the aggregated FC vectors. Regression tasks of heads and bodies are performed independently on individual feature maps, respectively.
Region Feature Extraction
|Shao et al. ||RoI Pooling||55.94||52.06||-0.58||-1.72|
|Baseline ( + RoI Align)||RoI Align||55.36||50.34||-||-|
|Baseline + Multi-task||RoI Align||54.72||-||+0.64||-|
|Repulsion Loss ||RoI Align||54.64||-||+0.72||-|
|Soft-NMS ||RoI Align||60.05||-||-7.30||-|
|DA-RCNN + J-NMS||RoI Align||51.79||49.68||+3.57||+0.66|
3.4 Joint NMS
Non-Maximum Suppression (NMS) is an essential step for removing duplicated predictions in detection frameworks. The performance of detectors is greatly affected by the NMS threshold, especially in crowded situations. Applying a higher threshold like 0.7 will increase false positives while a lower threshold like 0.3 may lead to a bad recall.
In this work, Joint NMS is adopted to improve the robustness of the post-processing procedure of human detection in crowded scenes. One of the biggest problems of human detection in a crowd lies in a large number of false positive predictions with high confidences . Therefore, we propose to suppress false positive predictions by taking both the head parts and body parts into consideration. To be specific, the confidences between the two parts will be weighted together, and boxes with lower confidence will be suppressed if either the head overlap or body overlap exceeds the threshold. The Joint NMS algorithm is formally described in Algorithm 1.
The benefit of Joint NMS can be summarized in two aspects. First, joint score follows the idea of ensemble and is more reliable than a single score of human body. Second, the original NMS only takes one branch into consideration. False positives caused by the other branch are not suppressed. In contrast, Joint NMS suppresses false positives from both branches at the same time. As a result, the proposed Joint NMS is more robust to hyperparameters compared to the original NMS.
4 Experimental Results
4.1 Datasets and Evaluation Metric
CrowdHuman Dataset. The CrowdHuman dataset  is a human detection benchmark aimed at evaluating detectors in crowded scenarios. Different from other datasets for pedestrian detection such as Caltech , KITTI  and CityPersons , there are more crowded cases in CrowdHuman dataset and the average number of persons in an image is much larger. Three categories of bounding boxes annotations are provided: head bounding boxes, human visible-region bounding boxes and human full-body bounding boxes. Detecting visible-region is more difficult since the aspect ratios are more diverse than the full-body annotations. We benchmark the proposed method with the visible-region and head annotations. All the experiments are trained on the training set, and evaluated on the validation set.
COCOPersons and CrowdPose Dataset. COCOPersons and CrowdPose are both benchmark datasets for human detection. COCOPersons is a subset of MSCOCO  from the images with ground-truth bounding boxes of “person”. According to our statistics, there are 64115 images in the “trainval minus minival” dataset, and the “minival” has 2693 images for validation. CrowdPose  is a recent dataset which extracts crowded images containing humans from MSCOCO , MPII  and AI Challenger . It should be noted that all the persons labeled in COCOPersons and CrowdPose are annotated like visible body and there aren’t head bounding boxes annotations. To verify the effectiveness of our method, we annotate the head bounding boxes for persons in these two datasets. However, these two datasets are less crowded than CrowdHuman dataset, so we split out crowded sub-datasets with the images containing at least one pair of human boxes with an IoU greater than 0.5 from COCOPersons and CrowdPose, respectively. Visual comparisons between normal dataset and crowded sub-dataset of CrowdPose can be seen in Figure 5.
Evaluation Metric. Standard log-average miss rate (MR)  is chosen as a main metric in our experiments, which is the official metric of Caltech, CityPersons, and CrowdHuman dataset. The MR is computed in the false positive per image (FPPI) with a range of (). Besides, AP
is also evaluated following the standard COCO evaluation metric.
4.2 Implementation Details
model pre-trained on ImageNet dataset as our baseline. RoI Align  is adopted for better feature extraction. The head and visible body detection results for the baseline are obtained using two models trained for head and visible body separately. For all of CrowdHuman, COCOPersons and CrowdPose datasets, the anchor ratios for both human head and visible body detection are set to 1:2, 1:1, 2:1. Considering the various sizes of images in the dataset, the input image is re-scaled such that its shortest edge is 800 pixels, and the longest side is not beyond 1400 pixels. Synchronized SGD is adopted over 8 GPUs with a total of 16 images per minibatch and the initial learning rate is
. For CrowdHuman and CrowdPose dataset, we train 40 epochs in total and decrease the learning rate by 0.1 at epoch 20 and 30. As for COCOPersons dataset, we train 100k iterations in total and the learning rate is decreased by a factor of 10 afterand iterations.
4.3 Detection Results on CrowdHuman
|Baseline (our implementation)||55.36||50.34|
|DA-RPN, sample by head||79.52||48.35|
|DA-RPN, sample by both-0.4||73.37||51.05|
|DA-RPN, sample by both-0.5||72.09||53.33|
|DA-RPN + crossover||52.75||50.12|
The detection results on CrowdHuman are shown in Table 1. FPN and FPN with RoI Align are tested with original NMS on the head and visible body separately. For the performance of body detection represented by “MR-B”, DA-RCNN makes an improvement of 3.06pp compared to the baseline result. To further demonstrate that the performance improvement gains mainly from our method rather than collecting more annotations for the head boxes, we compare our method DA-RCNN with the multi-task learning, which detects heads and bodies as a multi-category task. DA-RCNN makes an improvement of 2.42pp compared to the multi-task learning. Moreover, Joint NMS can bring extra gains of 0.49pp for the human body detection based on the DA-RCNN, while the results of Soft-NMS is not optimistic. We argue that Soft-NMS maintains lots of long-tail detection results for improving recall at the expense of bringing more false positives, which leads to negative impact on human detection especially for the metric of MR. It is worth noting that the DA-RCNN with Joint NMS can surpass state-of-the-art method using Repulsion Loss on CrowdHuman dataset for human body detection, which indicates the effectiveness of our method to detect the human in crowded scenes. Besides, the performance of head detection is improved by 0.36pp, benefiting from the context information provided by human body. Example results from our method are visualized in Figure 6.
Ablation Study on Proposal Crossover.
We evaluate different proposal selecting strategies for Double Anchor R-CNN. The results are illustrated in Table 2. The naive implementation samples positive proposals according to the head parts only (termed as “DA-RPN, sample by head”). The method brings an improvement of pp on MR for heads, which indicates that constructing the relationship between head and corresponding body is beneficial to head detection. However, as discussed in Section 3.2, sampling proposals by head parts will sacrifice the performance of human body detection since the body proposals are noisy.
Then the sampling strategy switches to an updated version which takes both head and body proposals into account. The method is represented as “sample by both-x” in Table 2 and “x” stands for the positive overlap threshold for person boxes. Obviously, the MR of body (MR-B) is significantly improved compared to the naive sampling strategy. Note that with the increasing overlap threshold, the result for visible body detection is better while the result is worse for head detection. This indicates the trade-off between the number of noisy samples for visible body and the decrease in the number of positive proposals. However, the human detection result is still much worse than the baseline results since the reduction in the number of qualified proposals is very harmful to the detection performance.
Finally, we adopt a proposal crossover module to improve the quantity and quality of paired proposals. Shown as “+crossover” in Table 2, the proposal crossover module brings a significant improvement of pp for the result of body detection. The improvement is benefited from the increasing number of qualified pairs of proposals provided by the proposal crossover module. To prove the assumption, we calculate the number of qualified pairs of proposals in training. There are only 40 positive pairs per image on average if proposals are sampled by requiring a threshold of 0.5 IoU for both body and head parts. In contrast, the average number of positive proposal pairs after the crossover strategy increases to 97 per image. It proved that more qualified proposals are beneficial to detection performance.
Ablation Study on Feature Aggregation.
As discussed in Section 3.3, we adopt FC vectors aggregation module in our work. The results are illustrated in Table 3. Compared with the baseline framework without feature aggregation module, fusing FC vectors leads to a gain of pp on MR-B and also an improvement of pp on MR-H. The results prove the effectiveness of feature aggregation. Besides, compared to aggregating features with FC vectors, fusing spatial feature maps leads to a drop of pp on MR-H because of the misalignments of head and body features.
Ablation Study on Joint NMS.
To prove the validity of Joint NMS, we compare it with original NMS on the human body detection task in Table 4. Threshold of original NMS is set to 0.5 for simplicity. As for the Joint NMS, the weighting factor is a hyper-parameter for balancing the head scores and visible body scores. Different values of are evaluated and the result of visible body detection becomes better as the weight of body score increases. We are also able to find that the result is not sensitive to this factor. Moreover, to validate the effectiveness of suppressing false positives, we compare the results under “FPPI over recall” in Table 5. It is obvious that the proposed method is helpful to reduce false positive effectively under almost all recall settings.
4.4 Results on COCOPersons and CrowdPose
To investigate the generalization capacity of the proposed methods, experimental results on COCOPersons and CrowdPose are reported in Table 6. The proposed Double Anchor R-CNN with Joint NMS is able to improve MR by 1.39pp and 1.59pp on the whole validation datasets of COCOPersons and CrowdPose, respectively. Compared with CrowdHuman, the COCOPersons and CrowdPose dataset are less crowded. As a result, we split out a crowded sub-dataset consisting of images containing at least one pair of human boxes with an IoU greater than 0.5. For the crowded sub-dataset, our method can achieve a huge boost of 3.82pp on MR and 1.28 point on AP for COCOPersons, and a healthy 4.24pp MR gap and 3.78 point AP gap for CrowdPose. The results demonstrate that the proposed framework is also suitable for regular challenging human detection dataset and is more effective on crowded scenarios.
We propose Double Anchor R-CNN for human detection in crowded scenes. The framework is intuitive and effective for handling crowd occlusion problem by naturally coupling the head and body for each person. Through a variety of experiments on challenging human detection datasets, Double Anchor R-CNN is demonstrated to be capable of improving performance and producing a state-of-the-art performance. Our approach is also extensive and can be easily generalized to detect other parts, for example, detecting the head, face and body of each person with triple anchor R-CNN. We hope the proposed method provides insights into future works on human detection and human-object interactions.
-  (2014) Socially-aware large-scale crowd forecasting. In CVPR, Cited by: §1.
2D human pose estimation: new benchmark and state of the art analysis. In CVPR, Cited by: §4.1.
Soft-nms – improving object detection with one line of code.
The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.2, Table 1.
-  (2016) A unified multi-scale deep convolutional neural network for fast object detection. In ECCV, Cited by: §2.2.
-  (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, Cited by: §3.2.
-  (2009) ImageNet: a large-scale hierarchical image database. In CVPR, Cited by: §4.2.
-  (2012) Pedestrian detection: an evaluation of the state of the art. T-PAMI 34 (4), pp. 743–761. Cited by: §4.1, §4.1.
-  (2010) A structural filter approach to human detection. In ECCV, Cited by: §1, §2.2.
-  (2010) Multi-cue pedestrian classification with partial occlusion handling. In CVPR, Cited by: §2.2.
-  (2010) The pascal visual object classes (voc) challenge. IJCV 88 (2), pp. 303–338. Cited by: §2.1.
-  (2012) Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, Cited by: §4.1.
-  (2017) Mask r-cnn. In ICCV, Cited by: §4.2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §4.2.
-  (2017) Scale-aware fast r-cnn for pedestrian detection. T-MM 20 (4), pp. 985–996. Cited by: §2.2.
CrowdPose: efficient crowded scenes pose estimation and a new benchmark. In CVPR, Cited by: §4.1, §4.
Graininess-aware deep feature learning for pedestrian detection. In ECCV, Cited by: §1, §2.2.
-  (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §3, §4.2.
-  (2014) Microsoft coco: common objects in context. In ECCV, Cited by: §2.1, §4.1, §4.
-  (2019) Adaptive nms: refining pedestrian detection in a crowd. In CVPR, Cited by: §2.2.
-  (2016) SSD: single shot multibox detector. In ECCV, Cited by: §2.1.
Handling occlusions with franken-classifiers. In ICCV, Cited by: §1, §2.2.
-  (2018) Improving occlusion and hard negative handling for single-stage pedestrian detectors. In CVPR, Cited by: §1.
-  (2012) A discriminative deep model for pedestrian detection with occlusion handling. In CVPR, Cited by: §1.
Joint deep learning for pedestrian detection. In ICCV, Cited by: §1, §2.2.
-  (2016) You only look once: unified, real-time object detection. In CVPR, Cited by: §2.1.
-  (2017) YOLO9000: better, faster, stronger. In CVPR, Cited by: §2.1.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NIPS, Cited by: §2.1.
-  (2018) CrowdHuman: a benchmark for detecting human in a crowd. arXiv:1805.00123. Cited by: Table 1, §4.1, §4.
-  (2016) Deep learning strong parts for pedestrian detection. In ICCV, Cited by: §1, §2.2.
-  (2012) A discriminative deep model for pedestrian detection with occlusion handling. In CVPR, Cited by: §2.2.
-  (2018) Repulsion loss: detecting pedestrians in a crowd. In CVPR, Cited by: §1, §2.2, §3.4, Table 1.
-  (2017) AI challenger : A large-scale dataset for going deeper in image understanding. arXiv:1711.06475. Cited by: §4.1.
-  (2016) How far are we from solving pedestrian detection?. In CVPR, Cited by: §1.
-  (2017) CityPersons: a diverse dataset for pedestrian detection. In CVPR, Cited by: §4.1.
-  (2018) Occluded pedestrian detection through guided attention in cnns. In CVPR, Cited by: §2.2.
-  (2018) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In ECCV, Cited by: §1, §2.2, §2.2.
-  (2016) Learning to integrate occlusion-specific detectors for heavily occluded pedestrian detection. In ACCV, Cited by: §1, §2.2.
-  (2017) Multi-label learning of part detectors for heavily occluded pedestrian detection. In ICCV, Cited by: §1, §2.2.
-  (2018) Bi-box regression for pedestrian detection and occlusion estimation. In ECCV, Cited by: §1, §2.2.