NOH-NMS: Improving Pedestrian Detection by Nearby Objects Hallucination

07/27/2020 ∙ by Penghao Zhou, et al. ∙ 0

Greedy-NMS inherently raises a dilemma, where a lower NMS threshold will potentially lead to a lower recall rate and a higher threshold introduces more false positives. This problem is more severe in pedestrian detection because the instance density varies more intensively. However, previous works on NMS don't consider or vaguely consider the factor of the existent of nearby pedestrians. Thus, we propose Nearby Objects Hallucinator (NOH), which pinpoints the objects nearby each proposal with a Gaussian distribution, together with NOH-NMS, which dynamically eases the suppression for the space that might contain other objects with a high likelihood. Compared to Greedy-NMS, our method, as the state-of-the-art, improves by 3.9% AP, 5.1% Recall, and 0.8% MR^-2 on CrowdHuman to 89.0% AP and 92.9% Recall, and 43.9% MR^-2 respectively.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 4

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Non-maximum Suppression (NMS) is widely used in proposal-based object detectors (Redmon et al., 2016; Redmon and Farhadi, 2017, 2018; Liu et al., 2016; Fu et al., 2017; Girshick et al., 2014; Girshick, 2015; Ren et al., 2015; Lin et al., 2017a; He et al., 2017; Cai and Vasconcelos, 2018; Dai et al., 2016, 2017; Lin et al., 2017b), as the post processing step to eliminate the redundant detections. Ideally, the proposal with the maximum score should suppress and only suppress all the other proposals of the same object. However, NMS distinguishes objects solely by a universal Intersection over Union (IoU) threshold. That is, if two proposals have an IoU above the pre-defined threshold, they will be considered as detecting the same object and one of them will be eliminated as the duplicate.

Figure 1. Comparison among various NMS methods The blue dotted box in (b) shows the mistakenly suppressed detection, which is caused by the NMS threshold dilemma in Greedy-NMS. The red dotted box in (c) highlights the false positive introduced by Adaptive-NMS as it is unable to pinpoint the overlapping areas. In order to recall the detection of the boy and suppress the red box, adapting the IoU threshold is not enough since , and is designed for filling this gap.

This scheme works fine in generic object detection task. However, it raises a dilemma in pedestrian detection task where the object density varies a lot, making it infeasible to find a perfect universal IoU threshold as a higher threshold fits for the regions with higher density and the less crowded regions desire a lower threshold (See Fig. 1).

Previous work tries to address this issue of the rigid NMS threshold. Soft-NMS (Bodla et al., 2017) proposes to degrade the score of nearby highly overlapped proposals instead of eliminating them, but just like Greedy-NMS, it still blindly penalizes the highly overlapped boxes. Adaptive-NMS (Liu et al., 2019) suggests directly predicting a proper NMS threshold for each proposal. However, even though the proposal can sense the density of the nearby objects, it is not aware of the locations and spread of the crowded regions, which results in a new dilemma, as shown in Fig. 1, where the left to the proposal is not dense at all and the right is rather crowded.

Thus, to tackle this problem, we propose () and . Our key observation is, in a crowded scene, the visual information inside a bounding box of one pedestrian will mostly contain the cues of the locations and sizes of other pedestrians. Therefore, we design , which hallucinates the objects nearby each proposal based on the Region-of-Interest (RoI) feature and represents the hallucination with a Gaussian distribution. Furthermore, we propose to perform a novel NMS strategy leveraging the Gaussian distribution.

The proposed and can be integrated naturally into both one-stage and two-stage object detectors with marginal computation cost and acquire no more extra annotations other than the full-body bounding boxes during training.

To evaluate the effectiveness of our method, we have conducted quantitative and qualitative experiments on CityPersons (Zhang et al., 2017) and CrowdHuman (Shao et al., 2018) datasets (see Sec. 4). As a result, we achieve state-of-the-art performance with AP, Recall, on CrowdHuman, and on CityPersons.

Our contributions can be summarized as follow:

  • We propose , which is aware of the existence of other nearby objects when performing the suppression, to address the rigid NMS threshold problem in pedestrian detection.

  • We design to pinpoint the objects nearby each proposal with a Gaussian distribution.

  • Our method achieves state-of-the-art performance on CityPersons and Crowdhuman with negligible overhead.

2. Related Work

Over the past decade, deep convolutional neural networks (CNNs) have made great strides in image recognition

(He et al., 2016)

. To adapt an image classifier into an object detector, the current common practice, called proposal-based object detector, leverages sliding window to densely predict, for each proposal, a set of category confidence scores and proposal refinement coefficients. These refined proposals are then fed into the NMS algorithm to get rid of the redundant detections. According to different strategies to generate the proposals, proposal-based object detectors can be classified into one-stage, where proposals are pre-defined anchors, and two-stage, where Region Proposal Networks (RPNs) are used for proposal generation. In addition, great progress has been made in multiple scaling 

(Lin et al., 2017a; Liu et al., 2018a), learnable anchors (Wang et al., 2019; Yang et al., 2018), deformable feature sampling (Dai et al., 2017; Zhu et al., 2019), etc.

Even though state-of-the-art generic object detectors show promising performance on benchmark datasets, such as COCO (Lin et al., 2014) and Pascal VOC (Everingham et al., 2015), it is non-trivial to adapt them into the pedestrian detection task, because the occlusion is much more severe and frequent in pedestrian detection datasets.

Occlusion can be divided into two categories, namely inter-class occlusion and intra-class occlusion. In intra-class occlusion scenarios, the pedestrian is occluded by other pedestrians. And the inter-class occlusion results in the partially visible feature of pedestrians mixed with the feature of background objects.

To address the problem of inter-class occlusion, some algorithms (Zhou and Yuan, 2018; Pang et al., 2019; Zhang et al., 2018) seek to leverage the annotated visible bounding box (VBB). (Zhou and Yuan, 2018)

introduces a visible part estimation branch and a new training sample selecting strategy assisted by VBB. OR-CNN 

(Zhang et al., 2018)

exploits the topological structure of the pedestrian with visibility prediction for occluded pedestrian detection. To emphasize on visible pedestrian regions during feature extraction, MGAN 

(Pang et al., 2019) proposes an attention module supervised by VBB.

In intra-class occlusion scenarios, the pedestrian is occluded by other pedestrians, which occurs frequently in the crowd scene. The heavily occluded between pedestrians confuses the models as it’s hard to distinguish instance boundaries. To alleviate this problem, OR-CNN (Zhang et al., 2018) designs aggregation loss to enforce generating more compact bounding boxes. In addition, RepLoss (Wang et al., 2018) proposes a novel repulsion loss to prevent the proposal from shifting to surrounding objects.

Though OR-CNN (Zhang et al., 2018) and RepLoss (Wang et al., 2018) successfully ease the localization problem in the crowded scenes, there still exists an even worse issue during the post-processing stage. In the post-processing stage, Non-maximum Suppression (NMS) is wildly used to suppress false positive proposals (i.e., the redundant pedestrian proposals belong to the same identity). However, NMS may also suppress true positive proposals (i.e., the highly overlapped pedestrian proposals belong to different identities). Therefore, a lower threshold leads to a lower Recall while a higher threshold results in lower precision.

To address this dilemma,  (Bodla et al., 2017) proposes Soft-NMS to replace the elimination operation with decaying the detection scores according to the IoU. And (Zhang et al., 2019; Chi et al., 2019) suggest using additional annotated head bounding boxes to solve the problem of NMS in a crowd, as the head parts usually suffer less from occlusion. More recently, Adaptive-NMS (Liu et al., 2019) proposes to predict the adaptive IoU threshold in NMS for each proposal. It aims at predicting a higher NMS threshold if the objects gather together and occlude each other, and predicting lower NMS threshold if the objects are sparse. However, even though Adaptive-NMS could predict accurate density for each proposal, a density scalar is not enough to precisely express the spatial locations of the crowded areas. In other words, the proposal is capable of sensing how crowded its surrounding is, but cannot tell if the area to its left is more crowded than the area to its right. As a result, Adaptive-NMS gets stuck into a new dilemma when different spatial locations to one object desire different IoU thresholds, as shown in Fig.1.

We observe this inflexibility in Adaptive-NMS and thus propose and to address this problem. Specifically, for , we design a mini 2-fc branch to predict, for each proposal, not only a density scalar but also a Gaussian distribution which highlights the surrounding objects. In addition, our leverages the output from as the auxiliary information, together with the normal NMS input (detection boxes with class confidence), to perform a nearby-objects-aware NMS.

3. Our Method

In this section, we first briefly recap the previous NMS algorithms (Sec. 3.1). Then we propose our which integrates the nearby-objects distribution into the NMS pipeline (Sec. 3.2). In addition, we illustrate how our module learns to predict the nearby-objects distribution just from the box-level supervision (Sec. 3.3). Finally, we compare our method with the state-of-the-art NMS counterparts in with visualization (Sec. 3.4).

Input : 

is the list of initial detection boxes
contains corresponding detection scores
contains corresponding detection densities
contains the parameters of nearby-objects distribution of corresponding detection

is the NMS threshold
begin
      
       while  do
            
            
            
             for  do
                   if iou then
                         [ standard jigsaw, opacityback=0, colframe=myred, text width=150pt, boxsep=3pt, left=0pt,right=0pt,top=0pt,bottom=0pt, ] Greedy-NMS [ standard jigsaw, opacityback=0, colframe=mygreen, text width=150pt, boxsep=3pt, left=0pt,right=0pt,top=0pt,bottom=0pt, ]
                   end if
                  
             end for
            
       end while
      return
end
Algorithm 1 Algorithm pseudo code replaces the pruning step (highlighted in red) in Greedy-NMS with a nearby-objects-aware re-scoring function (marked with green).

3.1. Background

A proposal-based object detection framework consists of the following five stages: (1) extracting full-image-level feature; (2) generating bounding box proposals; (3) extracting proposal-region-level feature; (4) performing classification and box regression for each proposal; (5) removing redundant detections. In this pipeline, the proposals are usually densely arranged and there is no punishment if two or more detections are detecting the same object. Thus, prior to stage 5, it is rather common that one object area is occupied with multiple detections whereas only one of them counts towards true positive, and the rest are considered as false positive.

To avoid the aforementioned problem, Greedy-NMS selects the detection with the maximum score and eliminates its surrounding inferior detections whose IoU with is above certain threshold , and then repeats this pruning process with the next best detection, as shown in Fig. 1. The pruning step, as the core of the NMS algorithm, can be formulated into a re-scoring function as follow:

(1)

where and denote the confidence score and bounding box coefficients of the inferior detections. will be either left unmodified or completely removed depending solely on its IoU with . This introduces two problems. (1) The consequence is too extreme and IoU, as the only metric, is not robust enough, which makes the performance very sensitive to the choice of the NMS threshold. E.g., when is set to , detection will be eliminated if iou equals to , however, with a slight perturbation, iou could become , which makes survive. (2) There is no such NMS threshold that makes everyone happy. E.g., an image occupied with objects might desire as the threshold, while it is not suitable for the image with a single object.

In response to the first problem, Soft-NMS softens the consequence by gradually decaying the score of the overlapped detections instead of eliminating them. Below shows its re-scoring function:

(2)

where decaying function is chosen to be:

(3)

For the second problem, Adaptive-NMS customizes an NMS IoU threshold for each proposal and follows the design of Greedy-NMS except now the IoU threshold varies with the current best detection . Their strategy can be formulated as:

(4)
(5)

where is the density prediction of proposal .

As we carefully re-visit Adaptive-NMS, we find that due to the maximum function, Adaptive-NMS can be re-written into a super case of Soft-NMS:

(6)

where

(7)

As shown in Eq. 2 and Eq. 6, compared to Greedy-NMS, Soft-NMS adds the location of into consideration when suppressing and Adaptive-NMS further considers the density of . However, both of them cannot accurately distinguish whether is detecting a nearby object or is a false positive. Although equipped with density prediction, Adaptive-NMS still cannot tell where the objects around are, let alone Soft-NMS.

3.2.

The key idea of our is to introduce the nearby-objects distribution into the NMS pipeline, where

denotes its parameters. The nearby-objects distribution could be obtained by any probability distribution functions (PDFs), and we will cover our choice of generating

in Sec. 3.3. Note that, in the pedestrian detection task, the only object category we care is human, therefore the nearby objects refer to nearby pedestrians mostly in this paper. However, our method can also be used in other tasks where the nearby objects won’t be limited to humans only.

Figure 2. Architecture The illustration of integrating () into the two-stage object detector, such as Faster-RCNN (Ren et al., 2015). Note that our can fit in single-stage object detectors as well by placing the branch in parallel with the detection head. In this example, the lady at the front left is highly overlapped with the lady behind her, and our NOH pinpoints the location and shape of the lady behind so that the detection of her won’t be mistakenly suppressed whereas other false positives will be eliminated.

consists of two components, namely overlap detector and NOH-Suppressor.

Overlap Detector Since our assumption is that the bounding box area of one pedestrian will mostly contain the cues of other pedestrians, we need to first rule out the cases where the cues are not abundant (e.g. a pedestrian is by its alone). Thus, we propose a simple overlap detector, which predicts the IoU between the and the object overlapped with the most. If the predicted IoU is less than a threshold , which we empirically set to , then NOH-Suppressor won’t be triggered because of insufficient cues, and we will follow the design of Greedy-NMS () or Soft-NMS ().

NOH-Suppressor If the cues are predicted to be sufficient, we will perform NOH-Suppression, which re-scores the by multiplying the probability of being a nearby object. In this way, when a neighboring box meets the attributes of being a nearby object, the suppression on it will be dynamically eased, whereas if it is very unlikely to be a nearby object, then we treat it as detecting the same object of , which should be degraded. We formally describe the difference between and Greedy-NMS in Fig. 1. As we only replace the re-scoring function with a Gaussian function runs at , we haven’t introduced computational complexity into the NMS pipeline. In addition, since we leverage the mini 2-fc branch to predict both distribution parameters and density directly from RoI feature, the overhead is negligible (See Fig. 2).

In summary, the strategy we adopt can be described as follow:

(8)
(9)

Note that, if the step function in Eq. 7 is used as the PDF, then our degenerates to Adaptive-NMS. However, the step function is rarely used for modeling the natural distributions because (1) it is not continuous, and (2) it is oversimplified. Thus, we propose (Sec. 3.3) to better capture the true nearby-objects distribution using the Gaussian distribution.

3.3. ()

is responsible for generating the nearby-objects distribution for each . We achieve this by hallucinating the locations and shapes of the nearby objects from the cues in region , and expressing the hallucination with a Gaussian distribution. We term this process as hallucination because different from proposal-based instance recognition, which predicts box coefficients from the proper RoI feature, our NOH could only rely on partially visible cues.

Essentially, based on the features extracted from region , multiple hallucination objects could be proposed. However, for simplicity, we only capture one nearby object which overlaps with the most. We represent the hallucinated object with its relative center location, width, height with , denoted as . Since the hallucinated object is predicted by partially visible cues, the prediction is expected to be imprecise. Thus, we decay the nearby-objects likelihood with a Gaussian distribution which centers at and spreads with a hyper-parameter .

With all the definition above, our NOH applies the following strategy:

(10)
(11)
Figure 3. Visualization of the suppression degree The suppression degree is a function of the relative center location and relative shape of two boxes, resulting in 4-d freedom. To visualize it in 2-d space, we unify the shapes of all the boxes so that each box can be represented by its center point. The detection score is attached to the corner of the box. The color map shows to what extent the detection with the maximum score (blue box) suppresses its surrounding inferior detections. For instance, the center point of the green box in (d) lies in the red area (keeping area), meaning it is very likely to survive the suppression, whereas the red box will be penalized harshly as its center sits in the blue area (suppressing area).

We implement NOH with a prediction head in parallel with the classification and regression head of Faster-RCNN. The training target of NOH is derived from the relative box coefficients of the most nearby object with , and we impose Smooth-L1 loss as the training loss. Note that, the Gaussian function is not represented during the training. However, we could convert the training target from the relative box coefficients into a Dirac delta function, and supervise it with KL Loss (He et al., 2019). In this paper, we keep the training process simple, as we find it works up to the expectation, and stick with the Smooth-L1 loss.

3.4. Comparisons with other NMS Strategies

To better understand the difference among NMS strategies that we and other methods propose, we visualize the suppression effect of on overlapped other detections in Fig. 3. According to the figure, Greedy-NMS harshly eliminates the detections around , and Soft-NMS gradually adds keeping area. Adaptive-NMS, on the other hand, adds a more harsh keeping area, as the result of the usage of the step function, but the proportion of such area is adaptive to the pedestrian density. Note that when combining Soft-NMS and Adaptive-NMS together, the keeping area will be both continuous and adaptive. However, all the aforementioned methods cannot shift the center of the keeping area because they don’t explicitly predict the distribution of the nearby pedestrians, whereas our method places the keeping area more accurate thanks to the module.

4. Experiments

In this section, we first cover the datasets and metrics that we use for all the experiments (Sec. 4.1). We then reveal our implementation details in Sec. 4.2 and show quantitative results of compared to various NMS methods (Sec. 4.3). We also conduct sensitivity analysis (Sec. 4.4) to prove the robustness of our method. Qualitative results are also prepared in Sec. 4.5 for better visualization.

4.1. Datasets and Metrics

CityPersons CityPersons (Zhang et al., 2017) is a currently wildly used benchmark dataset in the pedestrian detection task. Based on the images in the Cityscapes (Cordts et al., 2016) dataset, CityPersons creates more fine-grained bounding box annotations which dedicate to pedestrian detection. In total, CityPersons covers k person and ignore region (fake humans like statues) annotations. In addition, CityPersons aims at including persons with heavy occlusion and small scale, yielding an average density of persons per image.

CrowdHuman CrowdHuman (Shao et al., 2018) was released more recently, which further emphasizes the crowd issue. It contains images, with k person and k ignore region annotations. The person density is significantly higher than CityPersons and reaches persons per image with pairwise overlapping instances (IoU larger than ).

Evaluation metrics

We follow the evaluation metrics used in CityPersons and CrowdHuman, denotes as MR

-2, AP, and Recall:

  • MR-2, or log-average Miss Rate on False Positive Per Image (FPPI) in , is commonly used to evaluate detectors whose applications have an upper limit on the acceptable FPPI rate independent of object density. Thus, MR-2 is particularly sensitive to false positives.

  • Average Precision (AP) is the most popular metric in generic object detection, which summarizes the precision-recall curve of the detection results. In the following experiments, we follow the AP metric in PASCAL VOC (Everingham et al., 2015), where a prediction is positive if IoU .

  • Recall is short for the maximum recall given a fixed number of detections. As both Soft-NMS, Adaptive-NMS, and aim at recalling the mis-eliminated true positives, as shown in Fig .3, this metric reflects the effectiveness of this intention. For fair comparisons, we set the allowed number of detections to be for all NMS methods.

Methods Extra Anno. Backbone Scale Reasonable Bare Partial Heavy
OR-CNN (Zhang et al., 2018) VGG-16 11.0 5.9 13.7 51.3
MGAN (Pang et al., 2019) VGG-16 10.5 - - 47.2
JointDet (Chi et al., 2019) ResNet-50 10.2 - - -
TLL (MRF) (Song et al., 2018) ResNet-50 - 14.4 - - -
Adapted Faster RCNN (Zhang et al., 2017) VGG-16 13.0 - - -
ALFNet (Liu et al., 2018b) VGG-16 12.0 8.4 11.4 51.9
RepLoss (Wang et al., 2018) ResNet-50 11.6 7.0 14.8 55.3
Adaptive-NMS w/ AggLoss (Liu et al., 2019) VGG-16 10.8 6.2 11.4 54.0
Our baseline ResNet-50 11.9 7.4 12.3 53.0
ResNet-50 10.8 6.6 11.2 53.0
Table 1. Performance on the CityPersons validation set. is used as the metric (lower is better). Scale is short for input scale.

4.2. Implementation Details

For all the experiments, we adapt the Faster-RCNN (Ren et al., 2015) with FPN (Lin et al., 2017a) as our baseline and build various NMS methods upon the same baseline for fair comparisons. In specific, we choose the standard ResNet-50 (He et al., 2016) as the backbone and replace the ROIPooling operation in the original Faster-RCNN with the RoIAlign (He et al., 2017). We also change the aspect ratios of the anchors to for CrowdHuman and for CityPersons, as the original anchor settings are optimized towards COCO (Lin et al., 2014). Following the choice of input size in (Zhang et al., 2017) and (Shao et al., 2018), we enlarge the input height and width of CityPersons by times and resize the input of CrowdHuman so that the shorter edge of input equals to 800 pixels while keeping the longer edge no longer than 1,400 pixels.

During training, we randomly initialize all the parameters of the model by Kaiming initialization (He et al., 2015)

, except the ResNet-50 backbone, whose initial parameters are loaded from ImageNet 

(Russakovsky et al., 2015) pre-train. We use SGD with momentum and weight decay as the optimizer and train the model with and iterations in total for CityPersons and CrowdHuman respectively. The initial learning rate is and decreases by a factor of after and iterations for CityPersons (CrowdHuman). The batch size is set to be for both datasets. Note that we train on 8 GPUs without Synchronized BN.

For CityPersons, a sample will be assigned as positive if its IoU with ground-truth is greater than , and as negative if the IoU is less than , otherwise the sample will be ignored and won’t contribute to the loss. For CrowdHuman, samples with IoU greater than qualify as the positive and otherwise are considered as negative. In addition, we clip the ground-truth bounding boxes at the image boundary for CityPersons, while don’t apply this operation in CrowdHuman.

During inference, we set the NMS IoU threshold to for all NMS methods and allow at most detections per image. We also follow the same input resizing operation as mentioned in the training stage.

Methods Backbone AP Recall
Repulsion Loss (Wang et al., 2018) R50 - - 45.7
JointDet* (Chi et al., 2019) R50 - - 46.5
Baseline in (Liu et al., 2019) R50 83.0 90.6 52.4
Adaptive-NMS (Liu et al., 2019) R50 84.7 91.3 49.7
Our Baseline R50 85.1 87.8 44.7
R50 89.0 92.9 43.9

Table 2. Performance on the CrowdHuman validation set. R50 denotes ResNet-50. * marks the methods which leverage extra annotations (e.g. head box) during training.
Methods AP Recall
Greedy-NMS 0.5 85.1 87.8 44.7
Soft-NMS (Bodla et al., 2017) 0.5 86.4 90.6 44.6
Adaptive-NMS (Liu et al., 2019) 0.5 87.1 89.2 45.0
0.5 89.0 92.9 43.9

Table 3. Comparison of different NMS methods on the CrowdHuman validation set. All the methods are implemented by us, and for fair comparisons, we show the best results from multiple runs.

4.3. Results

CityPersons We report the results of and other state-of-the-art pedestrian detectors on CityPersons validation set in Tab. 1. In particular, according to the level of occlusion, the CityPersons has four splits, namely Bare, Partial, Reasonable, and Heavy, whose ratios of visible parts are . Within the group of the methods which don’t use extra annotation, achieves the best performance on Reasonable, which is the most valued, and Partial splits. Moreover, our performance is comparable to that of the methods using additional annotations (e.g. head bounding boxes, visible bounding boxes).

CrowdHuman Tab. 2 shows the performance on CrowdHuman validation set. To have a comprehensive evaluation, three evaluation metrics are chosen to evaluate our method, which are AP, Recall, and . We re-implement a strong FPN (Lin et al., 2017a) baseline. Our baseline achieves AP, Recall and , which outperforms the baseline in Adaptive-NMS (Liu et al., 2019) by AP and . Although compared to our strong Greedy-NMS baseline, still significantly improves the AP, Recall, and by , , and . Moreover, compared to other state-of-the-art methods, superior performance demonstrates the effectiveness of our method.

To better demonstrate that our performance gain is not from the strong baseline, and show more clearly the advantage of compared with its counterparts, we re-implement Soft-NMS and Adaptive-NMS on our strong baseline. The results are shown in Tab. 3. According to the results, still delivers the best performance across all the evaluation metrics.

Figure 4. Precision vs. Recall at multiple NMS IoU thresholds Experiments are conducted on the CrowdHuman validation set and all the NMS methods are implemented by us based on the same baseline.
Figure 5. Sensitivity to hyper-parameters We show the effect of the different choices of and on . All the experiments are done on the CrowdHuman validation set.

4.4. Sensitivity Analysis

Although introduces two more hyper-parameters (density threshold

and Gaussian standard deviation

) than the other NMS methods, as we analyze later, it is not only robust to the choice of and , but also less sensitive to the common hyper-parameter than other NMS.

IoU threshold As shown in Fig. 4, we plot the precision vs. recall curves on various NMS IoU thresholds for both Greedy-NMS and . We conclude two points from the figure. (1) Even though both methods degrade with sub-optimal IoU threshold hyper-parameter, is less sensitive as it outperforms the Greedy-NMS in all recall levels across all the choice of . (2) Simply flexing the IoU threshold for Greedy-NMS does recall more true positives but also introduces even more false positives that overwhelm the overall performance.

Density threshold As one of the additional hyper-parameters we introduce, the density threshold determines what it takes to be considered as having abundant cues to support the existence of other nearby pedestrians. As shown in Fig. 5, the performance for AP, recall, and MR-2 jitters slightly with a wild range of (from to with an interval of ), which proves the robustness of .

Gaussian standard deviation controls the spread of the Gaussian distribution we use in . Even though we empirically set it to in our previous experiments, it is proven to be not very sensitive as illustrated in Fig. 5. Note that, if using KL Loss during training, the can be trained end-to-end, and will no longer be a hyper-parameter. However, we leave this as future work since it is not the focus of this paper.

4.5. Qualitative Results

Qualitative results are given in two aspects: (1) detections visualization compared with Greedy-NMS and Adaptive-NMS (Fig. 6); (2) illustration of the effectiveness of the nearby objects hallucination (Fig. 7).

As shown in Fig. 6, our successfully recalls the highly overlapped detections that other methods fail to do so. Moreover, in Fig. 7, the works as expect, pinpointing the nearby persons with a reasonable Gaussian distribution, which contributes significantly to helping ease the suppression on the highly overlapped areas.

5. Conclusion

In this paper, we present a novel algorithm that improves the performance of pedestrian detection by taking into account the distribution of nearby objects. As the core part of our algorithm, learns to predict the Gaussian distribution of nearby objects from only full-body box annotations and introduces marginal overhead. Comprehensive experiments and analyses are done on CityPersons (Zhang et al., 2017) and CrowdHuman (Shao et al., 2018) to show the strength of .

Figure 6. Qualitative results Evaluation results on the CrowdHuman validation set. The NMS IoU threshold is set to for all the methods. The dotted boxes show the missing detections.
Figure 7. The visualization of the nearby objects hallucination results models the distribution of nearby objects with a 4-d Gaussian whose mean

represents the expectation of the location and shape of the nearest object (shown in the dotted blue box). The variance of the 2-d transition of the center points is illustrated in red (we don’t show the shape variance). The green boxes show the prediction for

.

References

  • N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017) Soft-nms–improving object detection with one line of code. In

    Proceedings of the IEEE international conference on computer vision

    ,
    pp. 5561–5569. Cited by: §1, §2, Table 3.
  • Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 6154–6162. Cited by: §1.
  • C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou (2019) Relational learning for joint head and human detection. arXiv preprint arXiv:1909.10674. Cited by: §2, Table 1, Table 2.
  • M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016)

    The cityscapes dataset for semantic urban scene understanding

    .
    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3213–3223. Cited by: §4.1.
  • J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: §1, §2.
  • M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: a retrospective. International journal of computer vision 111 (1), pp. 98–136. Cited by: §2, 2nd item.
  • C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §1.
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
  • R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pp. 1026–1034. Cited by: §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2, §4.2.
  • Y. He, C. Zhu, J. Wang, M. Savvides, and X. Zhang (2019) Bounding box regression with uncertainty for accurate object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2888–2897. Cited by: §3.3.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §1, §2, §4.2, §4.3.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §2, §4.2.
  • S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018a) Path aggregation network for instance segmentation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • S. Liu, D. Huang, and Y. Wang (2019) Adaptive nms: refining pedestrian detection in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6459–6468. Cited by: §1, §2, §4.3, Table 1, Table 2, Table 3.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
  • W. Liu, S. Liao, W. Hu, X. Liang, and X. Chen (2018b) Learning efficient single-stage pedestrian detectors by asymptotic localization fitting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 618–634. Cited by: Table 1.
  • Y. Pang, J. Xie, M. H. Khan, R. M. Anwer, F. S. Khan, and L. Shao (2019) Mask-guided attention network for occluded pedestrian detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4967–4975. Cited by: §2, Table 1.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1.
  • J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, Figure 2, §4.2.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §4.2.
  • S. Shao, Z. Zhao, B. Li, T. Xiao, G. Yu, X. Zhang, and J. Sun (2018) Crowdhuman: a benchmark for detecting human in a crowd. arXiv preprint arXiv:1805.00123. Cited by: §1, §4.1, §4.2, §5.
  • T. Song, L. Sun, D. Xie, H. Sun, and S. Pu (2018) Small-scale pedestrian detection based on topological line localization and temporal feature aggregation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 536–551. Cited by: Table 1.
  • J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2965–2974. Cited by: §2.
  • X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen (2018) Repulsion loss: detecting pedestrians in a crowd. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7774–7783. Cited by: §2, §2, Table 1, Table 2.
  • T. Yang, X. Zhang, Z. Li, W. Zhang, and J. Sun (2018) Metaanchor: learning to detect objects with customized anchors. In Advances in Neural Information Processing Systems, pp. 320–330. Cited by: §2.
  • K. Zhang, F. Xiong, P. Sun, L. Hu, B. Li, and G. Yu (2019) Double anchor r-cnn for human detection in a crowd. arXiv preprint arXiv:1909.09998. Cited by: §2.
  • S. Zhang, R. Benenson, and B. Schiele (2017) Citypersons: a diverse dataset for pedestrian detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3221. Cited by: §1, §4.1, §4.2, Table 1, §5.
  • S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li (2018) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 637–653. Cited by: §2, §2, §2, Table 1.
  • C. Zhou and J. Yuan (2018) Bi-box regression for pedestrian detection and occlusion estimation. In ECCV, pp. 135–151. Cited by: §2.
  • X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets v2: more deformable, better results. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9308–9316. Cited by: §2.