Pedestrian detection has raised increasing attention in recent years since it plays an essential role in applications such as autonomous cars, smart surveillance, robotics and so on. Despite the technical development from models relying on hand-craft features to deep learning network, pedestrian detection is still a challenging problem due to the complicated real-world scenarios, especially in crowd scenes.
Pedestrian detection can inherit a lot of successful techniques from generic object detection frameworks. In contrast to conventional anchor-based detection approaches such as Faster RCNN  and SSD , emerging anchor-free detection methods such as CornerNet  and CenterNet , enjoy the merits of flexibility in design and achieving more promising results. Our CSID detector is based on the anchor-free object detection framework.
However, pedestrian detection in the real world has its own characteristics and challenges, such as vastly different scales, poor appearance conditions, and extremely challenging occlusion in crowd scenarios. For example, as pointed out in , annotated pedestrians are occluded by other pedestrians in the CityPersons dataset . The level of crowd occlusion is even higher in the CrowdHuman dataset . Our goal is to address the problem of detecting pedestrians in challenging crowd scenarios.
Among the various pedestrian detection approaches, the recently proposed CSP (Center and Scale Prediction) in  is a promising anchor-free detector, which can detect both center and scale for pedestrian detection. Despite addressing the challenge of diverse scales in pedestrian detection, it does not explicitly tackle the crowd occlusion issue which remains a great challenge. In particular, similar to other generic detectors, CSP also adopts a standard greedy Non-Maximum Suppression (NMS), which is a necessary post-processing step to refine final detections by significantly reducing the false positives. However, when detecting pedestrians in a crowd scenario, the greedy NMS (typically with a single filtering threshold) suffers from the dilemma where a lower threshold can result in missing highly overlapped objects while a higher one often brings in more false positives, as indicated in .
To address the crowd occlusion challenge, we propose a new Center, Scale, Identity-and-Density-aware (CSID) pedestrian detector together with a novel Identity-and-Density-aware NMS (ID-NMS) algorithm to improve the state-of-the-art CSP anchor-free detector. Our main contributions include: (i) a novel Identity Density Map (ID-Map) which converts each positive instance into a feature vector which encodes both identity and density information simultaneously; (ii) a modified optimization target in defining ID-loss and addressing the extremely class imbalance issue during training; (iii) a novel ID-NMS algorithm by considering the identity and density information of each predicted box provided by ID-Map to effectively refine the detection results; and (iv) evaluations of the new CSID detector and our novel ID-NMS on two benchmark data sets (CityPersons and CrowdHuman) with new state-of-the-art results.
2 Related Work
Pedestrian detection is a special case of generic object detection. Our work is also closely related to Non-maximum suppression (NMS).
are based on the hand-crafted features and classifiers by finding the objects on the sliding window paradigm or region proposals. With the development of deep learning, anchor-based single-stage SSD, YOLO  or two-stage detectors Fast-RCNN , Faster-RCNN  have dominated in past years. Anchor-free detectors CornerNet , CenterNet , bypass the requirement of anchor boxes, have simpler structure and demonstrated superior performance. Our CSID detector essentially belongs to the anchor-free detection framework.
Pedestrian Detection Recently, CNN-based detectors have dominated the field of pedestrian detection, especially Faster RCNN framework. In , RPN is used to generate proposals and provide CNN features followed with a boosted decision forest.  present a variant of Faster RCNN that detects pedestrian in multi-scale feature maps to match objects of different scales.  adapt plain Faster-RCNN for pedestrian detection. A framework with extra features is presented in  to further improve performance.  propose to detect an object by predicting the top and bottom vertexes.  present the asymptotic localization fitting strategy to gradually refining the localization results.  suggest an anchor-free framework to detect center and scale in pedestrian detection. To deal with occlusions issue, many efforts have been made. Part-based models [16, 30, 31] are commonly used with further fusion mechanisms in occlusion handling. A more recent trend is to deal with the pedestrian detection problem in a crowd. CityPersons and CrowdHuman datasets are collected especially for crowd scenarios. Attention mechanism is used to represent occlusion patterns in . RepLoss  and OR-CNN  design two novel regression losses to generate more compact boxes to tackle the occluded pedestrian detection in crowded scenes.
Non-Maximum Suppression NMS is a critical post-processing step for object detectors. Greedy-NMS suffers from suppressing objects in crowd scenarios. Recently, instead of discarding predicted boxes during suppressing,  decrease the scores of neighbors by an increasing function of their overlap with the higher scored bounding box. 
learn a deep neural network to perform the NMS function using predicted boxes and their corresponding scores. Relation module is used to learn the NMS function in. [24, 9] learn extra localization confidences to guide a better NMS. [8, 17]estimate a crowd density map in people counting task.  propose a quadratic unconstrained binary optimization solution to suppress detection boxes.  propose to estimate the density of predicted boxes, thus enabling setting an adaptive threshold in NMS algorithm.
3 CSID Pedestrian Detector
In this section, we first give the overall architecture of CSID detector. Then we describe each component of CSID including optimization target design, ID-NMS algorithm, and ID-Map as well as ID-loss to train it. Finally, we present the objective function.
3.1 Overall Architecture
Our proposed CSID approach is based on a state-of-the-art anchor-free pedestrian detector . The overall architecture of CSID is illustrated in Fig. 1. We consider the modified DLA-34  as the backbone CNN and anchor-free detector as our framework.
Specifically, we hierarchically incorporate information from feature maps of stride, and final detection is performed on a fused feature map with the down-sampling rate . Upon the fused feature maps , a detection head is appended to parse it into detection results. We attach one convolution layer with 256 channels before each output head. A final convolution produces the desired output. Instead of using a typical optimization target (ground truth center map, scale map and offset map) design as shown in Fig. 2, we present an alternative solution as well as a newly designed ID-Map. Thus, the detection head consists of four branches in total. Offset map is not shown in Fig. 1 for simplicity. Our new optimization target design is proposed for defining ID-loss and addressing the extremely class imbalance during training which is described in the next section. The ID-Map is to convert each positive instance into an embedding vector which encodes both identity and density information simultaneously. The embedding vector holds two properties that, shown in Top-Right of Fig. 1, the length of vector denotes the density while the angle between two vectors indicates the similarity of two predicted boxes (whether two boxes belong to the same identity). The post-processing step ID-NMS is to effectively refine the predicted boxes by considering the identity and density information provided by ID-Map. All these components form our proposed CSID detector, and we give the details of each part in the following.
3.2 Optimization Target in CSID
In CSP , the detection head consists of center map, scale map and offset map. Each map is with the same size as fused feature maps (i.e. ). Given the bounding box annotations, the optimization targets (ground truth maps) are generated as follows. For clarity, we denote and as the center and height of the -th ground truth bounding box, respectively. For center target, the location where an object’s center point falls is assigned as positive while all others are negatives. For scale target, the -th positive location is assigned with the value of corresponding to the -th object. Besides, is also assigned to the negatives with a radius 2 of the positives. An offset branch sibling to the center and scale branch is attached to compensate the precision loss due to the down-sampling rate for defining center points. The offset targets of those centers can be defined as .
In the optimization target design of our proposed CSID detector, we define all corners next to this center as positives while others as negatives for -th ground truth bounding box, in which the real center is . The scale and offset target are modified accordingly. Specifically, we assign to the -th positive regions. Besides, is assigned to the negatives with an extension of one pixel to the positive region, thus a region. As for offset target, we estimate offset for each corner towards the real center in directions, which means that offsets are , , and for top-left, top-right, bottom-left and bottom-right point, respectively. Fig. 2 illustrates our design.
Our optimization target design of CSID enjoys some key advantages as follows.
There is no apparent boundary between corners. In a typical design, a trained detector may predict the other three corners as center as well. However, the offset direction is always bottom-right which makes the localization less precise. Our design regress offsets in all directions towards the real center making the localization more precise.
By setting a region as positives, we make the positive locations times larger than the typical design which eases the extremely class imbalance issue during training to some extent.
Our design assigns more than one positive location to each ground truth bounding box, which enables us to design our ID-loss. More details are described in ID-Map section.
3.3 ID-NMS Algorithm
NMS is a critical post-processing step for object detectors. In this section, we first revisit greedy-NMS and adaptive-NMS algorithm and discuss their limitations, then propose ID-NMS algorithm. Note that, ID-NMS is the key component of our proposed CSID detector.
3.3.1 Greedy-NMS and Adaptive-NMS Revisited
Given a set of predicted bounding boxes with corresponding confidence scores , greedy-NMS firstly selects the one with the maximum score and all boxes in which have the overlap with larger than the threshold are removed. Continuing to select the next bounding box in the remaining until the end. The suppressing step is as follows.
where is the bounding box with highest score in currently and is the confidence score of a box . iou represents the intersection of union score. is a constant threshold. Greedy-NMS suffers from the dilemma that a lower threshold leads to missing highly overlapped objects while a higher one brings in more false positives.
To address this issue, adaptive-NMS  proposes to adaptively adjust the threshold by predicting the density of each predicted box. The ground truth density of each box is defined as the maximum iou with other ground truth boxes in the image.
is the ground truth set. The suppressing step of adaptive-NMS can be described as below.
where is the density of predicted box . denotes the adaptive NMS threshold for . The adaptive threshold is determined by predicted box only.
The purpose of NMS algorithm is to suppress the predicted boxes that belong to the same identity while preserving those from different identities.
We argue that neither greedy-NMS nor adaptive-NMS meets these two conditions simultaneously. While greedy-NMS sets the threshold blindly, adaptive-NMS may behave well in preserving more boxes in crowd scenes, however, it also preserves more boxes belong to the same identity thus leading to more false positives. As shown in Fig. 3a, greedy-NMS meet the trouble when various thresholds are needed in different conditions. Adaptive-NMS also has the problem in highly overlapped scenes, as shown in Fig. 3b.
In CSID detector, we propose a more effective post-processing step called ID-NMS by taking both identity and density information of each box into account. The identity and density are predicted by CSID’s newly designed ID-branch which is described in the next section. The suppressing step of ID-NMS is defined as follows.
is a distance function and is the density of box , we describe it in next ID-Map section. Noted, the threshold is not only a function of but also related to box . We note the properties of our designed threshold function and suppression strategy. (i) is high if and only if when has a high density ( is large) and the two compared boxes and belong to different identities ( is large). Thus, this design satisfies the two conditions mentioned before simultaneously that (ii) When is not in a crowd scene, will be low since is small. (iii) If locates in the crowded region ( is high), we take the distance of two boxes and into account. If and belong to the same identity ( is small), threshold will be low to suppress to reduce false positives. (iv) Otherwise, will be high thus neighboring boxes that belong to different identities are preserved. As shown in Fig. 3c, our ID-NMS can address the issue in a crowd.
The ID-NMS algorithm is formally described in Fig 4. The remaining problem is how can our CSID detector predict identity and density of each box so as to apply ID-NMS.
Our CSID detector contains a newly designed ID-Map. The purpose of ID-Map is for representing the identity and density information of each ground truth annotation simultaneously so as to be applied in our ID-NMS. In this section, we give the definition of ID-Map as well as the loss function defined for training ID-branch.
A natural way for density prediction is to define the density of each box as adaptive NMS . Specifically, in our CSID, the positive locations of -th object are assigned with the value of . As for identity information of each box, since there is no ground truth label, we take the idea from associate embedding technique that the distance between two boxes is defined as the euclidean distance in embedding spaces. In this way, an identity map can be attached that sibling to the center map. We can train this branch by introducing a push-pull loss. However, it raises a question here.
Extra single or dual branches? According to the above analysis, one can design two separate branches to meet the need for identity and density representation or try to design a single branch. Which one is better? Definitely a single branch, not only an elegant formulation, but also we find that with more and more branches attached, CSID detector becomes harder to be trained that leading to deteriorating detection performance. We will conduct an ablation study in Experiment.
Paradox in single value embedding. However, we can not train a single branch that embeds each point into a single value to satisfy identity and density information simultaneously. The paradox lies in that, suppose two identities , have the largest iou with each other, denoted as . CSID targets density prediction at value for both and , thus with a distance . However, the identity prediction requires , to have a large distance.
ID-Map We show that it is still possible to satisfy these two properties with our specially designed ID-Map and ID-loss in a single branch. The solution of CSID is to embed each point into a vector of length .
Suppose the embedding vector of the -th object is . Noted, for each object, we have positive locations as defined on CSID’s optimization target design. Our ID-loss is defined as follows.
Push-pull loss is defined on the normalized embedding vector . is the mean of four embedding vectors from the same object. We set to be and to be in all our experiments. Similar to the offset loss, the loss is only applied at positive locations. We enjoy the benefit of CSID’s optimization target design that each ground truth center corresponds to points thus enabling defining the pull loss. The density loss regresses the length of to . The push loss aims at pushing away embedding vectors from different objects while the pull loss tries to attract embedding vectors that belong to the same object.
Next, we show why our design works. Taking an embedding length of as an example, the embedding vector will fall in the sphere of radius . In our design, the density is defined as the length of vector while identity is implicitly trained by restricting the Euclidean distance between normalized embedding vector (surface point that extended from embedding vector). Density loss is defined on the length of each vector while identity loss is defined on the distance between paired vectors. As shown in fig. 5, two embedding vectors with similar density ( and ) can still have a large distance (). In our experiment, we set .
At the end of this section, we give the definition of the function and as follows.
3.5 Objective Function
In this section, we summarize the objective function used to train CSID detector. Except for ID-Map, we keep the losses defined on the center, scale and offset maps as CSP . The overall objective function can be derived as follows.
where , , and are experimentally set as 0.01, 1, 0.03 and 0.01, respectively.
as the backbone network pre-trained from ImageNet data set with some modification. Specifically, we augment the skip connections with deformable convolution from lower layers to the output as CenterNet. Besides, we replace the last down-sampling convolution layer with a dilated convolution layer with stride 1 and dilation 2 to deal with the specific task of pedestrian detection. As for detection head, a convolution layer with 256 channels is added before each output head. A final convolution layer produces the desired output.
4.2 Experiment Settings
To evaluate the efficacy of CSID pedestrian detector, we conduct experiments on challenging pedestrian benchmarks in a crowd, including Citypersons and CrowdHuman.
Datasets CityPersons  is a new pedestrian detection dataset built on top of the semantic segmentation dataset CityScapes. It is a challenging dataset with various occlusion levels. We train the model on the official training set with 2975 images and test on the validation set with 500 images. CrowdHuman 
has recently been released to specifically target to the crowd issue in the human detection task. It collects 15,000, 4370, and 5,000 images from the Internet for training, validation, and testing, respectively. We train on the training set and test on the validation set, and only the full body region annotations are used for training and evaluation. In all these two datasets, we follow the standard evaluation metric, that is log-average Miss Rate over False Positive Per Image(FPPI) ranging in(denoted as ).
Our method is implemented in Pytorch. For a fair comparison, most of the optimization settings are kept the same as CSP. That is, the data augmentation, Adam optimizer and moving average weights, etc. The training inputs areand for CityPersons and CrowdHumans, respectively. For CityPersons, we optimize the network on 2 GPUs (GTX 1080Ti) with 4 images per GPU for a mini-batch, the learning rate is set as and training is stopped after iterations. For CrowdHuman, a mini-batch contains 16 images with 4 GPUs (GTX 1080Ti), and the learning rate is set as and training is stopped after iterations.
4.3 Ablation study
In this section, we conduct an ablation study of CSID detector on CityPersons dataset.
How does backbone affect the performance? As shown in Table 1, where the only difference is the backbone, our modified DLA-34 structure in CSID boost the performance by approximately compared with commonly adopted ResNet-50 backbone. The benefit comes from the network design that hierarchically incorporating semantics from deeper layers making the feature map more representative. Even with higher performance, our CSID runs faster than ResNet-50 backbone.
How does optimization target design affect the performance? Quantitatively, CSID’s optimization target design outperforms typical design by approximately . The performance gain comes from that our center map design assigns more pixels to positives which ease the extremely class imbalance issue during training, while the offset map design makes the localization more precise. Results are reported in Table 1.
How does ID-map as well as ID-NMS affect the performance? We compared our ID-Map together with ID-NMS algorithm with the following settings: a density map together with density-aware NMS which follows the strategy of adaptive-NMS, an identity map with identity-aware NMS, and two separate density map and identity map together with ID-NMS. The threshold of identity-aware NMS is defined as follows.
are set as and , respectively.
As shown in Table 1, our ID-NMS outperforms than default setting with greedy-NMS. Compared with density-aware NMS or identity-aware NMS, our ID-NMS still outperforms and , respectively. As for two separate branches, our ID-NMS still take effects that outperforms than default setting with greedy-NMS. However, our ID-Map design in CSID outperforms than two separate branches, which we argue that more detection heads deteriorate the detection performance, and this validates our superior design for CSID detector.
4.4 Comparison to the State-of-the-art
CityPersons. The proposed CSID detector is extensively compared with state-of-the art in four settings: Reasonable, Bare, Partial and Heavy. As shown in Table. 2, Our CSID detector outperforms all previous method a large margin. Specifically, we achieve of on the Reasonable setting, better than the best competitor( of Adaptive-NMS). Our method also performs consistently better in all other three settings, , and in Heavy, Partial and Bare, respectively. Besides, we can even run faster than recent state-of-the-art method, compared with of ALF  and of CSP .
CrowdHumans. We compared CSID detector with the most recent paper Adaptive-NMS  which provides the result of both single-stage and two-stage anchor-based method. We estimate both height and width for CrowdHumans since the aspect ratio is not kept in this dataset. As shown in Table 3, our CSID detector without ID-NMS beats the competitor with a clear margin, even compared with two-stage FPN, we still outperform a lot compared with . With ID-NMS algorithm, we can further improve our result from to . It demonstrates the superiority of the components of CSID detector.
In this paper, we propose a CSID detector for pedestrian detection in a crowd. An ID-Map is employed to encode both identity and density information of each predicted box simultaneously. Moreover, an alternative optimization target is designed to define ID-loss and address the extremely class imbalance issue during training. More importantly, a novel ID-NMS algorithm is proposed to refine the bounding boxes more effectively in crowd scenes where the identity and density information are provided by ID-Map. Finally, we conduct extensive experiments to demonstrate the efficacy of our CSID detector. As a result, CSID detector outperforms state-of-the-art methods at a large margin on both CityPersons and CrowdHuman datasets for pedestrian detection. For our future work, we plan to extend our approach for instance segmentation in crowd.
Soft-nms–improving object detection with one line of code.
Proceedings of the IEEE international conference on computer vision, pp. 5561–5569. Cited by: §2.
A unified multi-scale deep convolutional neural network for fast object detection. In european conference on computer vision, pp. 354–370. Cited by: §2.
-  (2014) Fast feature pyramids for object detection. IEEE transactions on pattern analysis and machine intelligence 36 (8), pp. 1532–1545. Cited by: §2.
-  (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §2.
-  (2015) Fast R-CNN. In 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1440–1448. Cited by: §2.
Learning non-maximum suppression.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4507–4515. Cited by: §2.
-  (2018) Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3588–3597. Cited by: §2.
-  (2018) Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–546. Cited by: §2.
-  (2018) Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 784–799. Cited by: §2.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §1, §2.
-  (2019) Adaptive nms: refining pedestrian detection in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6459–6468. Cited by: §1, §2, §3.3.1, §3.4, §4.4, Table 2, Table 3.
-  (2016) SSD: single shot multibox detector. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part I, pp. 21–37. Cited by: §1, §2.
-  (2018) Learning efficient single-stage pedestrian detectors by asymptotic localization fitting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 618–634. Cited by: §2, §4.4, Table 2.
-  (2019) High-level semantic feature detection: a new perspective for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5187–5196. Cited by: §1, §2, §3.1, §3.2, §3.5, Table 1, §4.1, §4.4, Table 2.
-  (2017) What can help pedestrian detection?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3127–3136. Cited by: §2.
-  (2012) A discriminative deep model for pedestrian detection with occlusion handling. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3258–3265. Cited by: §2.
-  (2018) Iterative crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–285. Cited by: §2.
-  (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.
-  (2013) Optimized pedestrian detection for multiple and occluded people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3690–3697. Cited by: §2.
-  (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. Cited by: §1, §4.2, Table 3.
-  (2018) Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 536–551. Cited by: §2, Table 2.
-  (2015) Pedestrian detection aided by deep learning semantic tasks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5079–5087. Cited by: §2.
-  (2018) Improving object localization with fitness nms and bounded iou loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6877–6885. Cited by: §2.
-  (2018) Repulsion loss: detecting pedestrians in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7774–7783. Cited by: §1, §2, Table 2.
-  (2018) Deep layer aggregation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2403–2412. Cited by: §3.1, §4.1.
-  (2016) Is faster r-cnn doing well for pedestrian detection?. In European conference on computer vision, pp. 443–457. Cited by: §2.
-  (2017) CityPersons: A diverse dataset for pedestrian detection. See DBLP:conf/cvpr/2017, pp. 4457–4465. External Links: Cited by: §1, §2, §4.2, Table 2.
-  (2018) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 637–653. Cited by: §2, Table 2.
-  (2017) Multi-label learning of part detectors for heavily occluded pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3486–3495. Cited by: §2.
-  (2018) Bi-box regression for pedestrian detection and occlusion estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–151. Cited by: §2.
-  (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §1, §2, §4.1.