Resisting the Distracting-factors in Pedestrian Detection

Pedestrian detection has been heavily studied in the last decade due to its wide applications. Despite incremental progress, several distracting-factors in the aspect of geometry and appearance still remain. In this paper, we first analyze these impeding factors and their effect on the general region-based detection framework. We then present a novel model that is resistant to these factors by incorporating methods that are not solely restricted to pedestrian detection domain. Specifically, to address the geometry distraction, we design a novel coulomb loss as a regulator on bounding box regression, in which proposals are attracted by their target instance and repelled by the adjacent non-target instances. For appearance distraction, we propose an efficient semantic-driven strategy for selecting anchor locations, which can sample informative negative examples at training phase for classification refinement. Our detector can be trained in an end-to-end manner, and achieves consistently high performance on both the Caltech-USA and CityPersons benchmarks. Code will be publicly available upon publication.


page 2

page 8

page 12

page 13


Which to Match? Selecting Consistent GT-Proposal Assignment for Pedestrian Detection

Accurate pedestrian classification and localization have received consid...

F2DNet: Fast Focal Detection Network for Pedestrian Detection

Two-stage detectors are state-of-the-art in object detection as well as ...

Box-level Segmentation Supervised Deep Neural Networks for Accurate and Real-time Multispectral Pedestrian Detection

Effective fusion of complementary information captured by multi-modal se...

Generalizable Multi-Camera 3D Pedestrian Detection

We present a multi-camera 3D pedestrian detection method that does not n...

How Far are We from Solving Pedestrian Detection?

Encouraged by the recent progress in pedestrian detection, we investigat...

The Cross-Modality Disparity Problem in Multispectral Pedestrian Detection

Aggregating extra features of novel modality brings great advantages for...

NMS-Loss: Learning with Non-Maximum Suppression for Crowded Pedestrian Detection

Non-Maximum Suppression (NMS) is essential for object detection and affe...

1 Introduction

Pedestrian detection is a canonical sub-problem of object detection that remains a critical research topic in computer vision and has attracted massive research interest in recent years 

[1, 9, 10, 26, 40, 41]. It aims to predict accurate bounding boxes enclosing each pedestrian instance and serves as a key component of various real-world applications such as autonomous driving, robotics, and intelligent video surveillance.

Although promising performances have been achieved, several distracting-factors are still challenging state-of-the-art pedestrian detecting models. Since most detection frameworks adopt region-based approach, the accuracy of region localization and classification directly reflects on detection performance. Therefore, we divide distracting-factors into two categories: geometry and appearance, as illustrated in Figure 1.

Figure 1: Pedestrian detection in the wild. Green boxes represent correct predictions. Red boxes indicate missed targets and misplaced prediction caused by geometry distraction. Blue boxes are the false positive predictions caused by appearance distraction.

Geometry distraction is mainly caused by crowd occlusion (also known as intra-class occlusion). It is the most significant barrier for accurate pedestrian detection in the wild and also the major occlusion case in most pedestrian datasets. When pedestrians gather together and occlude each other, detector is prone to be disturbed by the instance that is adjacent to the target and generates bounding boxes among their overlaps (as the left red box in Figure 1). Even worse, during non-maximum suppression (NMS) processing, misplaced boxes with higher confidence scores may suppress the accurate ones or bigger boxes may suppress their neighbouring small ones. At the same time, it also makes detector sensitive to the threshold of NMS as a higher threshold brings in more false positives while a lower threshold leads to more missed detections [38]

. Appearance distraction represents the false positives that share similar shape with human body (e.g. pillars, light poles). These objects frequently appears in the common scenarios of real world and we define them as human-like structures. Due to the complicated light condition and variant resolutions, detector is unable to correctly recognize these objects and may assign them with higher probabilities to person rather than background (as the blue boxes in Figure 


Several efforts have been made to tackle these two challenges. For the former one, previous methods like [38, 44] introduce an extra penalty term on the bounding box regressor to constraint each sampled proposals. But their regularization is incomplete and is likely to get conflict with the original regression function. Others like [2, 23] try to mitigate negative impact by refining traditional greedy-NMS [14]. However, the effectiveness of these post-processing methods are restricted by the accuracy of the predictions. For the latter one, several hard/soft-sampling methods [21, 31]

are proposed to either mine hard negative samples or re-weight each proposal to train the classifier. Both methods are loss-driven, which means they may easily neglect the semantic relations with the foregrounds, that can be useful clues for negative example sampling.

In this paper, we put forward a novel Distracting-factors Resistant model (dubbed as DR-CNN) based on Faster R-CNN framework [30] to tackle the aforementioned challenges. For geometry distraction, the key point is to generate accurate bounding boxes in occluded scenes. Inspired by the Coulomb Force [16] between two electric charges, we define Attractive Force between proposals and their target ground truth as well as Repulsive Force between proposals and their non-target ground truth. With this insight, we build a physics modeling and use the energy consumption, calculated by work formula

, as the measurement of loss value. The new loss function, termed as

Coulomb Loss (CouLoss), works as a regulator that constrains each proposal during regression process. As for appearance distraction, the breakthrough could be achieved by avoiding misclassification of human-like structures. To this end, we propose an efficient anchor location selecting strategy functioning as informative negative examples mining. By adding an extra branch on region proposal network (RPN) [30], a probability map is yielded and we only process anchors whose probabilities are over a dynamic threshold. These informative negative examples not only cause high loss values but also have semantic relations with pedestrian foregrounds.

To validate the effectiveness of these improvements, we conduct extensive experiments on both Caltech-USA [10] and CityPersons [42] benchmarking datasets. The main contributions are as follows:

  • For geometry distracting-factor, we design a new CouLoss on the basis of work formula that serves as a regulator for bounding box regression. It enforces proposals to minimize the intra-region distance as well as to maximize the inter-region distance.

  • For appearance distracting-factor, we modify RPN [30] with an extra branch for anchor location selecting, and propose a novel sampling method to capture informative negative examples to train the classifier.

  • Experimental results show the superiority of the proposed methods on pedestrian detection benchmarks. We also carry out experiments on PASCAL VOC dataset [13] to validate that our approaches are applicable for other general object detection tasks.

2 Related Work

We briefly review recent work on CNN-based pedestrian detector and discuss related researches on the two target distracting-factors: crowd occlusion and human-like structures.

CNN-Based Pedestrian Detectors. Recently, CNN-based methods have dominated the field of pedestrian detection [3, 20, 26, 27, 35, 43, 45] and achieved state-of-the-art performance [38, 25, 44] on several benchmarks: INRIA [8], ETH [12], Caltech-USA [10], and CityPersons [42]. Most of these models adopt region-based approaches where detectors are trained to localize and classify sampled regions. Similar to general object detection, there are also two different frameworks in pedestrian detection. Two-stage framework like [38, 42, 44] first generate a set of candidate proposals and then sample a small batch of proposals for further bounding box regression and classification. One-stage frameworks like [24, 27, 33] directly predict bounding box offsets and class scores from all anchors at each coordinate.

The proposal stage in two-stage framework can rapidly narrow down the number of candidate proposals and provide processed proposals for the following stage, which makes it more suitable for handling distracting-factors in pedestrian detection. With this consideration, we choose Faster R-CNN [30] as our baseline model.
Crowd occlusion resolving.Attention models are proposed to improve the feature representation of visible parts. [20] generates scale-aware attention masks in semantic segmentation manner. [43] employs a channel-wise attention mechanism from three different attention modules. Anchor-free methods are used to directly predict bounding boxes of each target. In [18]

, bounding boxes are learned through a single convolutional neural network

[25] predicts the center points of targets and regress the height and width of them. Other solutions formulate the issue as a regression problem. To better allocate proposals to each pedestrian, [38] proposes Repulsion Loss to keep proposals away from the non-targeted ground truth and their proposals, while [44] comes up with Aggregation Loss that enforces proposals to locate compactly around each other when they belong to the same target.

Our method shares a common spirit with [38, 44] where an extra regulation term is used in loss function to guide proposal regression. While the distinctive part is that we simultaneously consider both attraction and repulsion progress in the extra regulation term, which makes our constrains theoretically more complete than [38, 44]. What’s more, we propose a physics framework to unify these two progresses and make them compitable with each other.
Human-like structures handling. Multi-classifier is a common structure to refine classification results. [35] employs different patterns that can generate a pool of parts for classifier to choose. [11] trains multiple classifiers in parallel phase and fuse the scores to filter candidates. A set of grid score map from multi-stage is generated by [27] to revise final prediction scores. Methods like [19, 21, 28, 31] balance the region of interest (ROI) to train the classifier. [31] proposes a hard example mining method which only samples negative proposals with high loss values. [21] designs Focal Loss which assigns different weights to all proposals based on their probabilities.

We believe the problem is caused by under-sampling of useful negative examples (foreground-background imbalance). Our strategy is mining human-like structures as negative examples to train the classifier. Different from current sampling methods like [19, 21, 31] which are loss-driven and [5, 28] which are IoU-driven, our method samples regions that have high semantic relativity with pedestrians. We term these regions as informative negative examples since they have higher probabilities to contain human-like structures than others.

3 Proposed Approach

We transform the aforementioned critical issues into two specific tasks respectively: bounding box localization of pedestrians in crowds and sampling human-like structures as negative examples. In this section, we systematically analyze each task and then offer our solution. Besides the benefit to the performance, an important advantage of our method is that we do not increase any computational cost during inference phase.

Specifically, we introduce coulomb loss which is especially designed for crowd scenes in Section 3.1. Then, a novel anchor sampling method is proposed in Section 3.2 to mine informative negative examples. Finally, in Section 3.3, we present the network architecture and the loss function for end-to-end training.

3.1 Coulomb Loss

The ideal detection proposals require not only being close to their target ground truth, but also having minimal intra-region distances and limited maximal inter-region distances, which shares the same goal with visual recognition tasks where both intra-class margin minimization and inter-class margin maximization are required. This insight encourages us to leverage regularization term in loss function for improving localization accuracy of pedestrians in crowds.

Inspired by Coulomb Force, we regard each bounding box as a single charge. Then, we define Attractive Force and Repulsive Force as the interaction between a proposal and its target/non-target ground truth respectively. Suppose is the set of proposals that has high Intersect over Union (IoU) value (e.g., ) with ground truth. We set proposals , and , are the target ground truth of and respectively. For the convenience of analysis, we form a triplet , where is the anchor while and are positive and negative sample respectively.

In physics, work222 is used to measure the energy consumption for moving an object from one place to another. Rationally, we can set this value as the cost of pulling toward or pushing away from , which is exactly the loss we need. To utilize the work formula for calculating, we build a physics modeling at box-level which will be discussed in details in the following.

First and the most important, we have to define the Force between boxes which is related to their distance. Since IoU is a widely used metric for measuring the closeness between two bounding boxes, we refer to the objective function of IoU Loss [39] and formulate the forces as:


Note that the forces only exist when there is an overlap between proposal and ground truth (i.e., ). From Eq. 1 we can see that the lower closeness between a proposal and its target instance, the stronger Attractive Force will be applied to the proposal, whilst the higher closeness between a proposal and its non-target instance, the stronger Repulsive Force will be applied. Numerically, gets extreme large when approaches to 0, which will make the training process unstable. Here, we propose a re-sampling strategy for proposals that shares the similar spirit with [36], where we only select proposals whose center points fall into the region of their corresponding ground truth boxes.

Originally, is introduced because the force may not always has the right direction that moving object toward its target location. In this case, only part of the force is effective. It is also reasonable to follow the same definition in box regression, as illustrated in Figure 2(a). The Attractive Force always pulls at the correct direction, but the Repulsive Force may push deviated from its original target when its direction is not on the center line of . To handle such case, we introduce Effective Force () as the component of original force:


where can be calculated by the law of cosines since we have the coordinates of each proposal and ground truth. With Eq. 2, is defined as the force pulling toward , and is the force pushing toward .

Figure 2: (a) The Attractive and Repulsive Forces between proposal and ground truth. The direction of Attractive Force is always toward the target, while the direction of Repulsive Force may deviate from the target. (b) The proposed DR-RPN module with an extra anchor location branch. The new added branch yields a probability map of the existence of human-shaped structures. During training, a dynamic threshold is used to filter out low-probability regions, and the model only select anchors whose centers fall into the remained regions

At last, we define as the distance between proposal and its target ground truth:


In Eq. 5, are the distance of the center point of proposal to the left, right, top, bottom border of its ground truth respectively, as shown in Figure 2(a). A merit of Eq. 5 is that in the crowd scenes where there is an intrinsic overlap between and , it restricts the repulsion from when is close to . This is another crucial difference with RepLoss [38]. Under their definition, the repulsion is always exist as long as there is an overlap between toward which will push the well-regressed away from .

The work value and overall CouLoss is calculated as:


It is worth noting that we ignore the cases when in Eq. 4 since they do not make any work that move proposals toward their target locations. Last but not least, this new CouLoss can benefit both RPN and Fast-RCNN [15] modules in Faster R-CNN algorithm.

Beside the physics view, we also interpret Eq. 4 in a measurement angle. Since is related to the IoU value, it can be regard as a scale calibration between and ; And can be defined as the distance (similar to the center-ness in [36]) which serves as location calibration from to ; At last, is the dynamic weight for the aggregated value.

3.2 Anchor Location Selecting

Human-like structures always act as false positives in pedestrian detection due to the foreground-background class imbalance. This problem is caused by the detection framework. For instance, in RPN [30], since the only sampling principle for negative examples is the IoU with ground truth bounding boxes (e.g., ), there is a high probability for negative proposals to be sampled in easily distinguished area (e.g., sky and road). Classifier trained with these negative examples will soon converge and lose the ability to learn hard ones. To this end, our solution is trying to mine informative negative examples to train the classifier.

To better sample informative negative examples, we put forward a novel scheme that can erase anchors from easily distinguished areas. As shown in Figure 2(b), an anchor localization branch is added in RPN module which can yield a probability map representing the existence of human-shaped structures (including humans and human-like structures) at each coordinate. The root mean square value of the probability map is set as a dynamic threshold () during the training phase to reserve valuable regions which contain informative negative examples. It is worth noting that, though our modification to RPN is similar to the changes in [37], these two models share completely different designing goals. [37] is trying to generate accurate bounding boxes for foregrounds by learnable shape and location, while we propose to sample negative proposals that have high scores on location confidence map. Our setting is based on the fact that human-like structures usually have similar feature representations with humans. Therefore, there exists strong semantic relations between them, and we make use of these relations as the clues to mine informative negative examples.

To train the anchor localization branch, we employ the ground-truth bounding boxes to generate a binary score map where indicates selected location and indicates the rest. In specific, we categorize three types of regions on each score map as shown in Figure 3.

(1) Positive region (). We define the areas of visible bounding boxes as , since these parts provide the most valuable semantic information.

(2) Ignored region (). The non-visible part is generated by excluding visible part from full-body bounding boxes (). We mark this area () as ignored region. These regions are harmful to classifier, because proposals in might be labeled as positive but without any human feature representations (see Figure 4(b) in [45] for further details).

(3) Negative region (). The rest part of the score map only contains background information and is regarded as .

Figure 3: The construction of anchor location target. We use both full body and visible body box annotation to define , and .

The proposed anchor selecting strategy rapidly narrows down the searching space of generated anchor () to a small scale (). As shown in Figure 5, it can effectively filter out the low-probability regions and select the anchors that have strong semantic relations with foregrounds (e.g. human-like structures) as negative examples. Please note that we only use this strategy during training phase and we can drop the anchor localization branch for computational cost saving during inference phase.

3.3 Network Architecture

Our DR-CNN detector follows the implementation of Faster R-CNN [6] and uses VGG-16 [32] as the backbone. To better fulfill pedestrian detection task, the detector is modified following the settings in [42].

The final loss function is jointly optimized with the following losses:


where represents the original classification and regression loss in both RPN and Fast-RCNN modules. and are the extra regularization term for regression, and is the Focal Loss [22] for training binary classification for anchor location selecting. Coefficients , , and

are the hyperparameters used to balance auxiliary losses.

4 Experiments

4.1 Experimental Setting

Datasets. We conduct experiments on two benchmarks: Caltech-USA [10] and CityPersons [42]. Both benchmarks contain annotations for the visible areas. We use Caltech-USA10x which samples 42,782 frames and 4,024 frames as training and testing datasets respectively. The refined annotation provided by [41] is used in related experiments. CityPersons is a more challenging dataset derived from Cityscapes [7]. It includes 5,000 images in total and 2,975, 500, 1,525 images for training, validation and testing respectively.

Implementation details.

As a common convention, we horizontally flip training images for pre-processing. The Adam solver with 0.0001 weight decay is adopted to optimize the network on 1 Nvidia TITAN GPU. A mini-batch involves 2 image per GPU for computational resource constraint. We set the base learning rate set to 0.0001 and train the network for 16 epochs and 12 epochs on Caltech-USA and CityPersons respectively. Hyperparameters

, , and are empirically set to 1. We only select ground-truth pedestrian examples with height 50 pixels and set the rest as ignored examples for training purpose.

Evaluation protocols. The models are evaluated by log-average miss rate (), which is the average value over the false positive per image (FPPI) range of . The lower value represents better pedestrian detection performance. To further evaluate performances in occluded scenes, pedestrian instances are divided into bare, partial, heavy subset, representing visible ratio , , respectively.

4.2 Comparisons with State-of-the-art Methods

Result on CityPersons dataset. We compare our DR-CNN with state-of-the-art pedestrian detection frameworks, including FRCNN [42], RepLoss [38], OR-CNN [44], ATT-part [43], Bi-Box [45], MGAN [29], ALFNet [24], TLL [34] and CSP [25] on CityPersons validation set. It is noticing that existing pedestrian detection methods employ different detection framework and backbone, and set different input scale, so we also list these components in Table 1 for fair comparison.

The performance results are summarized in Table 1. It is evident that our model achieves best performance on Reasonable subset, e.g. outperforming the second best results by a margin of . Comparing with CSP [25], which is the current best region-based one-stage detector, our DR-CNN improves the on Reasonable subset from to . It is worth mentioning that the extra anchor location branch in DR-RPN is removable during inference, which makes the architecture of our detector no different than FRCNN [42] and RepLoss [38]. We can observe that our DR-CNN surpasses these two models by / on Reasonable subset and / on Heavy subset. Models like OR-CNN [44], ATT-part [43], Bi-Box [45], MGAN [29] modify the network architecture in the second stage which lead to better performance under occlusion cases. Our DR-CNN achieves on Heavy subset, which is competitive with these models.

Method Framework Scale Reasonable Heavy Partial Bare
ATT-part [43] VGG-16 16.0 56.7 - -
TLL [34] ResNet-50 15.5 53.6 17.2 10.0
FRCNN [42] VGG-16 12.9 50.5 - -
ALFNet [24] ResNet-50 12.0 51.9 11.4 8.4
RepLoss [38] ResNet-50 11.6 55.3 14.8 7.0
MGAN [29] VGG-16 11.5 51.7 - -
Bi-Box [45] VGG-16 11.2 44.2 - -
OR-CNN [44] VGG-16 11.0 51.3 13.7 5.9
CSP [25] ResNet-50 11.0 49.3 10.4 7.3
Ours CouLoss ALS
two-stage 12.7 54.2 14.4 7.3
10.5 51.9 11.3 5.8
11.1 51.2 12.2 5.9
10.4 46.9 10.7 5.8
Table 1: Pedestrian detection results on CityPersons validation set. ALS is the short form of Anchor Location Selecting. All models are trained on the trainset. We use as the performance comparing each detectors. The best and the second best are highlighted in red and blue.

Result on Caltech-USA dataset. We conduct extensively comparison with recent methods, including DeepParts [35], RPN+BF [40], MS-CNN [4], SDS-RCNN [3], ATT-part [43], RepLoss [38], Bi-Box [45], and CSP [25]. The results are mainly compared on three occlusion settings: Reasonable, Heavy, and All, where All represents the visible ratio is larger than 0.2.

As shown in Table 2, our DR-CNN achieves superior results comparing with most of the models and performs competitively with state-of-the-art method. Specifically, on Reasonable subset, our model surpasses Bi-Box [45] by a margin of but sightly falls behind CSP [25] by . Comparing with RepLoss [38], on Heavy and All subsets reduce from to and to respectively.

Method Reasonable Heavy All
DeepParts [35] 11.9 60.4 64.8
ATT-part [43] 10.3 45.2 54.5
MS-CNN [4] 10.0 59.9 60.9
RPN+BF [40] 9.6 74.4 64.7
Bi-Box [45] 7.6 44.4 -
SDS-RCNN [3] 7.4 58.6 61.5
RepLoss [38] 5.0 47.9 59.0
CSP [25] 4.5 45.8 56.9
Ours 4.9 45.5 57.0
Table 2: Pedestrian detection results on Caltech-USA test set. It is worth mentioning that all models are directly trained on Caltech-USA. We use as the performance to compare each detectors. The best and the second best are highlighted in red and blue.

4.3 Ablation Study

We carry out comprehensive ablation studies on CityPersons dataset to evaluate the contribution of different model components and the training configurations.

Model Reasonable Heavy Partial Bare
IoULoss [39] 12.4 52.0 12.6 6.9
RepLoss [38] 11.6 55.3 14.8 7.0
AggLoss [44] 11.4 52.6 13.8 6.2
0 0 12.7 54.2 14.4 7.3
1 0 10.9 53.0 11.5 5.8
0 1 11.0 53.8 11.7 6.0
0.3 0.7 10.5 51.9 11.3 5.8
Table 3: Comparison between CouLoss with other loss functions on CityPersons. The fourth row represents the baseline model, the fifth and sixth row represent using CouLoss only in RPN stage and only in Fast-RCNN stage respectively.

Coulomb loss. As shown in Table 3, we denote DR-CNN-A as the detector that uses CouLoss as embedded regularization on original regression loss in the baseline detector. Different , represents different combination of CouLoss on both RPN and Fast-RCNN modules. Comparing the detection result in Table 3, we can observe that when using CouLoss, the on four subsets decrease greatly from baseline by a margin of , , , and respectively. The results of using CouLoss at different stages show that CouLoss can benefit both stages by better aligning proposals around their ground truth. We train our DR-CNN-A for several rounds and find the best combination of CouLoss when setting . What’s more, we also study the effectiveness of Attractive Force and Repulsion Force separately, and observe , improvement on baseline on Reasonable subset respectively.

It is worth mentioning that our CouLoss is superior to the two state-of-the-art methods using AggLoss [44] and RepLoss [38] which also serves as a regularization term on regression function. This proves that the design guideline of CouLoss is more complete and suitable for bounding box regression. We also compare the results with IoULoss [44] since our CouLoss is based on the form of IoULoss. The result show that the promotions are marginally related with using the form of IouLoss.

Since CouLoss can pull proposals to their target ground truths and push them away from non-target ones, DR-CNN-A becomes less sensitive to the NMS threshold. To demonstrate this point, we present the miss rate with CouLoss across various NMS threshold at . As mentioned in Section 1, a high NMS threshold may lead to more false positives, while a low NMS threshold may lead to more false negatives. In Figure  4(a), DR-CNN-A always produces lower miss rate than baseline. It is noteworthy that the curve of DR-CNN-A is smoother than that of baseline, indicating that changing NMS threshold has less impact on DR-CNN-A. In addition, we also visualize the predicted bounding boxes before NMS in crowd scenes in Figure 4(b). Compared with baseline, the predictions of DR-CNN-A locate compactly around ground truths and there are fewer proposals lying in the overlaps between adjacent pedestrians.

Figure 4: (a) Comparison between DR-CNN-A and baseline based on the miss rate across different nms thresholds. The curve of DR-CNN-A is smoother than that of baseline, indicating it is less sensitive to nms threshold. The bar at each point represents the deviance from average value. (b) The visualization of predicted bounding boxes before nms. Compared with baseline, the predictions from CouLoss locate compactly around ground truths and there are fewer proposals lying in the overlaps between adjacent pedestrians

Anchor location selecting. We first demonstrate the effectiveness of DR-RPN architecture by constructing a detector that uses proposed DR-RPN instead of original RPN in baseline. The comparison results are reported in Table 4. In the second row, we set to bias the anchor location selecting process (labeled as w/o selecting). This model outperforms baseline on by a margin of on Heavy subset, indicating that the extra anchor location branch is helpful in centering anchors around foregrounds. When introducing anchor location selecting in place of sliding window anchoring method to DR-RPN, DR-CNN-B processes further improvements of on Reasonable subset. This proves that the proposed anchor sampling strategy can filter out less-informative proposals from negative examples. Since the proposed method is specially designed for false positive cases which can’t be clearly reflected on , we use FPPI to evaluate the model and observe consistent lower score on all subsets.

Results mentioned above are also supported by the visualization results shown in Figure 5, where we present input images, the generated probability maps and the selected anchors sampled by the proposed method. It can be seen that the probability maps in Figure 5(b) are highly correlated with the human-shape structures, which leads the selected anchors to concentrate more on these objects as shown in Figures 5(c).

An additional experiment is done to validate the necessity of introducing ignored region (IR) when training the anchor location branch. As shown in the third row in Table 4, model trained with without IR performs consistently worse on all subsets. This is mainly because it miss-labels proposals as positive examples when they are largely occupied by non-visible parts, as discussed in Section 3.2.

Model Reasonable Heavy Partial Bare
Baseline 12.7/0.22 54.2/0.65 14.4/0.24 7.3/0.04
DR-CNN-B w/o selecting 11.7/0.16 53.2/0.54 11.7/0.20 6.9/0.04
w/o IR 11.9/0.17 54.0/0.53 12.7/0.20 6.2/0.03
w selecting/IR 11.1/0.14 51.2/0.48 12.2/0.20 5.9/0.02
Table 4: Validation of the necessity of anchor location selecting process and ignored region (IR) in DR-CNN-B. The results are reported in the form of . The FPPI is calculated under .
Figure 5: The visualization of the generated probability maps and the selected anchors sampled by proposed method. We can observe that the probability maps in (b) are highly correlated with human-shape structures in (a), which leads the selected anchors to concentrate more on these objects as shown in (c).

5 Extension: Results on PASCAL VOC

In this section, we extend the application of our proposed methods to reveal its universality. The modifications are applied on general object detection application which also suffers from occluded scenes and false positive examples.

Our experiments are performed on PASCAL VOC dataset [13] which is a common benchmark for general objection detection. We employ Faster R-CNN with ResNet-101 [17] as the backbone for baseline detector. The model is trained on the training and validation sets of PASCAL VOC 2007 and PASCAL VOC 2012, and is tested on the testing set of PASCAL VOC 2007. To evaluate high quality detection results from our methods, we use the COCO metrics for evaluation333The annotations of PASCAL VOC are transformed to COCO format and COCO toolbox is used for evaluation.. The results in Table 5 show that the proposed methods have significant improvements on general object detection task, especially under high IoU threshold, which demonstrates the universality of the proposed methods.

Faster R-CNN 49.2 77.2 53.8
+ CouLoss +0.8 +0.2 +1.0
+ Anchor location selecting +0.7 +0.2 +1.1
+ All +2.3 +0.5 +3.7
Table 5: General object detection results on PASCAL VOC 2007 test set. Note that these results are evaluated under COCO metrics which are different from the original VOC metrics.

6 Conclusion

In this paper, we put forward a novel DR-CNN framework to tackle geometry and appearance distracting-factors in pedestrian detection, i.e. crowd occlusion and human-like structures. We transform these factors into two specific tasks respectively: bounding box localization of pedestrians in crowds and sampling human-like structures as negative examples, and devise two general methods to approach them. For geometry distraction, we design a new loss function, termed as CouLoss, to regulate the process of bounding box regression. Specifically, we build a physics framework to unify the proposed Attractive Force and Repulsive Force which can pull proposals towards their target ground truths and push proposals away from non-target ones respectively. For appearance distraction, an efficient semantic-driven strategy for selecting anchor locations is introduced, which can sample human-like structures as informative negative examples at training phase for classification refinement. It is worth mentioning that both methods don’t increase any computational cost during inference.

Our model is trained in an end-to-end fashion and achieves competitive performance on two widely adopted benchmarking datasets, i.e. Caltech-USA and CityPersons. Detailed ablation experiments have demonstrated the effectiveness of each proposed approach respectively. More importantly, the promising preliminary results on PASCAL VOC show that our methods could also be adopted towards other appearance-based object detection tasks.


  • [1] R. Benenson, M. Omran, J. Hosang, and B. Schiele (2014) Ten years of pedestrian detection, what have we learned?. In European Conference on Computer Vision, pp. 613–627. Cited by: §1.
  • [2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017-10) Soft-nms – improving object detection with one line of code. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
  • [3] G. Brazil, X. Yin, and X. Liu (2017) Illuminating pedestrians via simultaneous detection & segmentation. arXiv preprint arXiv:1706.08564. Cited by: §2, §4.2, Table 2.
  • [4] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos (2016) A unified multi-scale deep convolutional neural network for fast object detection. In european conference on computer vision, pp. 354–370. Cited by: §4.2, Table 2.
  • [5] Y. Cao, K. Chen, C. C. Loy, and D. Lin (2019) Prime sample attention in object detection. arXiv preprint arXiv:1904.04821. Cited by: §2.
  • [6] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §3.3.
  • [7] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding


    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 3213–3223. Cited by: §4.1.
  • [8] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, Vol. 1, pp. 886–893. Cited by: §2.
  • [9] P. Dollár, Z. Tu, P. Perona, and S. Belongie (2009) Integral channel features. Cited by: §1.
  • [10] P. Dollár, C. Wojek, B. Schiele, and P. Perona (2009) Pedestrian detection: a benchmark. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pp. 304–311. Cited by: §1, §1, §2, §4.1.
  • [11] X. Du, M. El-Khamy, J. Lee, and L. Davis (2017) Fused dnn: a deep neural network fusion approach to fast and robust pedestrian detection. In Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pp. 953–961. Cited by: §2.
  • [12] A. Ess, B. Leibe, K. Schindler, and L. Van Gool (2008) A mobile vision system for robust multi-person tracking. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8. Cited by: §2.
  • [13] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: 3rd item, §5.
  • [14] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
  • [15] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §3.1.
  • [16] D. Halliday, R. Resnick, and J. Walker (2013) Fundamentals of physics. John Wiley & Sons. Cited by: §1.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §5.
  • [18] L. Huang, Y. Yang, Y. Deng, and Y. Yu (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §2.
  • [19] B. Li, Y. Liu, and X. Wang (2019) Gradient harmonized single-stage detector. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 33, pp. 8577–8584. Cited by: §2, §2.
  • [20] C. Lin, J. Lu, G. Wang, and J. Zhou (2018)

    Graininess-aware deep feature learning for pedestrian detection

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 732–747. Cited by: §2, §2.
  • [21] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar (2017-10) Focal loss for dense object detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §2.
  • [22] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.3.
  • [23] S. Liu, D. Huang, and Y. Wang (2019-06) Adaptive nms: refining pedestrian detection in a crowd. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
  • [24] W. Liu, S. Liao, W. Hu, X. Liang, and X. Chen (2018-09) Learning efficient single-stage pedestrian detectors by asymptotic localization fitting. In The European Conference on Computer Vision (ECCV), Cited by: §2, §4.2, Table 1.
  • [25] W. Liu (2019) High-level semantic feature detection: a new perspective for pedestrian detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2, §2, §4.2, §4.2, §4.2, §4.2, Table 1, Table 2.
  • [26] J. Mao, T. Xiao, Y. Jiang, and Z. Cao (2017) What can help pedestrian detection?. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6034–6043. Cited by: §1, §2.
  • [27] J. Noh, S. Lee, B. Kim, and G. Kim (2018) Improving occlusion and hard negative handling for single-stage pedestrian detectors. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 966–974. Cited by: §2, §2.
  • [28] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin (2019) Libra r-cnn: towards balanced learning for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 821–830. Cited by: §2, §2.
  • [29] Y. Pang, J. Xie, M. H. Khan, R. M. Anwer, F. S. Khan, and L. Shao (2019-10) Mask-guided attention network for occluded pedestrian detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §4.2, §4.2, Table 1.
  • [30] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: 2nd item, §1, §2, §3.2.
  • [31] A. Shrivastava, A. Gupta, and R. Girshick (2016) Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 761–769. Cited by: §1, §2, §2.
  • [32] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.3.
  • [33] T. Song, L. Sun, D. Xie, H. Sun, and S. Pu (2018-09) Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [34] T. Song, L. Sun, D. Xie, H. Sun, and S. Pu (2018) Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 536–551. Cited by: §4.2, Table 1.
  • [35] Y. Tian, P. Luo, X. Wang, and X. Tang (2015) Deep learning strong parts for pedestrian detection. In Proceedings of the IEEE international conference on computer vision, pp. 1904–1912. Cited by: §2, §2, §4.2, Table 2.
  • [36] Z. Tian, C. Shen, H. Chen, and T. He (2019-10) FCOS: fully convolutional one-stage object detection. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §3.1, §3.1.
  • [37] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. arXiv preprint arXiv:1901.03278. Cited by: §3.2.
  • [38] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen (2018) Repulsion loss: detecting pedestrians in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7774–7783. Cited by: §1, §1, §2, §2, §2, §3.1, §4.2, §4.2, §4.2, §4.2, §4.3, Table 1, Table 2, Table 3.
  • [39] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang (2016) Unitbox: an advanced object detection network. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 516–520. Cited by: §3.1, Table 3.
  • [40] L. Zhang, L. Lin, X. Liang, and K. He (2016) Is faster r-cnn doing well for pedestrian detection?. In European Conference on Computer Vision, pp. 443–457. Cited by: §1, §4.2, Table 2.
  • [41] S. Zhang, R. Benenson, M. Omran, J. Hosang, and B. Schiele (2016) How far are we from solving pedestrian detection?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1259–1267. Cited by: §1, §4.1.
  • [42] S. Zhang, R. Benenson, and B. Schiele (2017) Citypersons: a diverse dataset for pedestrian detection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 3. Cited by: §1, §2, §3.3, §4.1, §4.2, §4.2, Table 1.
  • [43] S. Zhang, J. Yang, and B. Schiele (2018) Occluded pedestrian detection through guided attention in cnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6995–7003. Cited by: §2, §2, §4.2, §4.2, §4.2, Table 1, Table 2.
  • [44] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li (2018) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 637–653. Cited by: §1, §2, §2, §2, §4.2, §4.2, §4.3, Table 1, Table 3.
  • [45] C. Zhou and J. Yuan (2018)

    Bi-box regression for pedestrian detection and occlusion estimation

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 135–151. Cited by: §2, §3.2, §4.2, §4.2, §4.2, §4.2, Table 1, Table 2.