Which to Match? Selecting Consistent GT-Proposal Assignment for Pedestrian Detection

03/18/2021 ∙ by Yan Luo, et al. ∙ 0

Accurate pedestrian classification and localization have received considerable attention due to their wide applications such as security monitoring, autonomous driving, etc. Although pedestrian detectors have made great progress in recent years, the fixed Intersection over Union (IoU) based assignment-regression manner still limits their performance. Two main factors are responsible for this: 1) the IoU threshold faces a dilemma that a lower one will result in more false positives, while a higher one will filter out the matched positives; 2) the IoU-based GT-Proposal assignment suffers from the inconsistent supervision problem that spatially adjacent proposals with similar features are assigned to different ground-truth boxes, which means some very similar proposals may be forced to regress towards different targets, and thus confuses the bounding-box regression when predicting the location results. In this paper, we first put forward the question that Regression Direction would affect the performance for pedestrian detection. Consequently, we address the weakness of IoU by introducing one geometric sensitive search algorithm as a new assignment and regression metric. Different from the previous IoU-based one-to-one assignment manner of one proposal to one ground-truth box, the proposed method attempts to seek a reasonable matching between the sets of proposals and ground-truth boxes. Specifically, we boost the MR-FPPI under R_75 by 8.8% on Citypersons dataset. Furthermore, by incorporating this method as a metric into the state-of-the-art pedestrian detectors, we show a consistent improvement.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 4

page 5

page 6

page 8

page 9

page 10

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: IoU-based GT-Proposal assignment may result in inconsistent regression supervision: the two proposals and (in the left picture), which have close location, area, and similar features, are assigned to two much different ground-truth boxes: and , respectively. This inconsistent supervision may confuse the regression training, and thus one low-quality regression box, in the right picture, will be predicted. With the proposed DRNet, will be assigned to due to less depth cost, and then one correct predicted box, in the right, will be got.

Pedestrian detection is a key problem in a number of real-world applications including auto-driving systems and surveillance systems, and is required to have both high classification and localization accuracy. Driven by the success of general object detection, many of the proposed pedestrian detectors [27, 11, 14, 6] still follow the basic practice such as Faster R-CNN [16] and SSD [10]. Therefore, Intersection over Union (IoU), which is required to assign proposals, has been one indispensable metric in the recent frameworks.

However, the commonly used IoU metric has two main drawbacks. (1) On one hand, it is difficult to set a proper threshold for IoU during training. A lower IoU threshold (e.g. 0.5) could keep adequate number of positive samples but will result in many ”close but not correct” false positives during inference [11]. A relative higher threshold (e.g. 0.7) can reject low-quality proposals but will reduce a large number of matched positives. Although the Cascade R-CNN [3] and ALFNet [11] have provide solutions that gradually refine the proposals by several stages with a set of IoU thresholds, the hand-crafted IoU threshold, whether single or multiple, is still not the best choice. (2) On the other hand, the IoU-based GT-Proposal assignment suffers from the inconsistent supervision problem that spatially adjacent proposals with similar features are assigned to different ground-truth boxes, which means some very similar proposals may be forced to regress towards different targets, and thus confuses the bounding-box regression when predicting the location results. This disadvantage is more prominent in the scene with large scale variation and occlusion. As illustrated in Figure 1, overlapped proposals with almost the same features are assigned to different ground-truth boxes, which results in the confusion during training and thus brings about low-quality regression results during inference (Figure 1 red solid box).

The above analyses motivate us to address the weakness of IoU-based GT-Proposal matching mechanism and propose one new assignment and regression metric for pedestrian detection. Different from the previous one-to-one assignment manner of one proposal to one ground-truth box, we conduct the assignment in a set-to-set process of finding a reasonable matching between the two sets (proposals and ground-truth boxes). This searching-based method does not depend on a fixed hyper-parameter (e.g. IoU threshold), instead it constantly searches for the matched pairs along the dynamic training procedure. Furthermore, the search algorithm consists of one cost function to answer the question of ”which to match

” for each proposal. Motivated by the discovery that the distribution of pedestrian scales is highly connected with the depth-variant direction, the introduced cost function starts with the depth estimation, and gradually assigns all proposals step by step, pushing more proposals closer to ground-truth boxes with smaller depth (scale) variance. By this way, the regressor will be optimized towards consistent location direction: regress to the direction with smaller depth variance. On top of this, we are the first to point out that

”Regression needs Direction” to the best of our knowledge, and a novel pedestrian detection architecture is thus constructed, denoted as Directional Regression Network (DRNet).

To demonstrate the generality of the proposed depth-guided search algorithm, we evaluate various CNN-based pedestrian detectors on both Citypersons [26] and Caltech [5] dataset including OR-CNN [29], PBM [6] and CSP[12].To sum up, the main contributions of this work lie in:

(1) We first attempt to put forward the inconsistent supervision problem of IoU-based assignment mechanism and propose one search algorithm as a new assignment metric, which is both dynamic and direction-sensitive during training procedure.

(2) We first attempt to point out that ”Regression needs Direction” and propose one directional regression network, named DRNet, to tackle with the challenging problem of large scale variation for accurate pedestrian localization.

(3) We achieve state-of-the-art accuracy on widely-used datasets including Caltech and Citypersons, especially under the challenging setting of R by 8.8% on Citypersons dataset. We incorporate the proposed method into the most popular pedestrian detection algorithms such as CSP and PBM [6], and show further performance improvement.

Figure 2: An example of the matching process of the proposed method. The second row shows matching cost of different proposal to GTs. In each step, the ground-truth box is assigned with several proposals with minimum matching cost. Besides, if exists any conflict, which means the proposal has already assigned to one ground-truth box, the matching process would be rerun. Detailed procedure is shown in Algorithm 1.

2 Related Work

2.1 Pedestrian Detection

Driven by the success of general object detection, many pedestrian detectors follow the anchor-based/anchor-free paradigm. Specifically, the anchor-based detectors are proposed in the two-stage/one-stage framework. RPN+BF [25] discusses the issue that improves performance of Faster R-CNN for pedestrian detection with a RPN followed by shared boosted forests. MS-CNN [2] attempts to introduce various receptive fields that match different object scales to tackle with multi-scale problem in pedestrian detection. Recently, methods such as PBM [6]

use both part and full-body features to more accurately classify and localize pedestrians especially in crowded scenes. However, aforementioned detectors focus more on feature representation, and less attention is paid to the inherent drawbacks of IoU-based assignment method during the training procedure.

2.2 Increment-based Label Assignment

Most recently, similar to our intuition, some researchers attempt to revisit the role of IoU for object detection. First, methods such as Cascade R-CNN [3] and ALFNet [11] try to replace the single threshold with an incremental IoU setting. Although these detectors have provided solutions that gradually refine the proposals, they are still not free from the hand-crafted setting and perform somewhat inflexible during training. Most recently, a progressive network [21] is proposed with three-phase progression to gradually refine anchors following human annotation process. However, the problem is also obvious. On the one hand, although the gradual strategy could bring performance improvements, it also brings double the computational cost. On the other hand, some incremental settings are not flexible. For example, Cascade R-CNN and ALFNet both use manually defined IoU (e.g, 0.5, 0.6 and 0.7) to measure the quality.

2.3 Statistics-based Label Assignment

Different from previous methods that choose a fixed number of best scoring anchors, statistics-based label assignment detectors model the anchor selection procedure as a likelihood estimation for a probability distribution. ATSS 

[28]

proposes an adaptive training sample selection method that uses the sum of mean and standard deviation as the IoU Threshold. PAA 

[7] proposes a probabilistic anchor assignment strategy that adaptively separates a set of anchors into positive and negative samples for a GT box according to the learning status of the model associated with it. These methods have taken a step towards more accurate assignment strategies. Nevertheless, to the best of our knowledge, these still get caught into the IoU-based assignment, which means they still not escape from the inconsistent problem our work raised.

2.4 Prior-based Label Assignment

To relief the extra computation cost, some works introduce human prior to improve label assignment. FreeAnchor [22] proposes one detector that guarantees at least one anchor’s prediction close to the ground-truth for each object. GuideAnchoring [19] and MetaAnchor [23] try to dynamically change the shape of anchors to fit various distributions of objects. The aforementioned methods, whether prior-related (e.g., GuideAnchoring) or instance-related (e.g., FreeAnchor), ignore the inconsistent supervision problem existing during the assignment process. Therefore, they are still not free from the confusion caused by ”similar proposals, different regression targets”, and thus bring about false regression results during inference. Compared to their, ours is not only dynamic and flexible, but also ensures the consistency of matching, so it better answers the question of ”which to match” and thus shows further improvement for accurate pedestrian localization. On top of above, combined with depth estimation (z axis), our work is also the first attempt that tries to solve the assignment problem from a 3D perspective compared with other 2D assignment (x-y axis), e.g. ATSS [28] or PAA [7].

3 Proposed Method

Generally speaking, the proposed method consists of one search algorithm and one depth-guided function to solve the main task: ”which to match” for each proposal. The overall process is shown in Figure 2.

3.1 How to Design Depth-guided Assignment?

Which to Match” is an important question especially for pedestrian detection. It is essentially a matching problem, that is how to match reasonable ground-truth boxes for each proposal so as to push the network training under a better direction. From the above analysis, the IoU metric performs as one-to-one assignment manner of one proposal to one ground-truth, but brings about inconsistent supervision problem during training. Different from IoU, we regard the assignment as the process of finding a reasonable matching between two sets (proposals and ground-truth boxes). We formulate this process in the following. The ground-truth boxes and proposals are denoted as and , in which and represent the number of ground-truth boxes and proposals, respectively.

Figure 3: The example of the same shortest path length but with several possible paths. The red, yellow, green and blue paths all have the same shortest Manhatten length of 14. The two dark red points represent the start and end, respectively.

To compute the relevance between one proposal and one ground-truth box, the proposed method defines one search function, which can be divided into two parts. The first one is Manhattan distance, denoted as , which measures the distance between the paired proposal and ground-truth box. It is worth noting that in the view of paths, there exist several shortest paths with the smallest Manhattan distance, as is shown in Figure 3. The red, green, yellow and blue paths all have the same shortest path length, and this also brings some uncertainty to our second part of depth estimation, denoted as . From the start of one proposal to the end of one ground-truth box, Manhattan distance proposes several possible paths, denoted as , in which represents the k-th path and T is the total number of paths. Various paths in shows various depth variance, denoted as , in which is the sum of the depth changes along the k-th path, formulated as:

(1)

where represents the aggregation function to calculate the depth variance, represents the total number of points included in the path , and represents the q-th depth variance between two adjacent points. We select the path with the minimal depth variance among different paths to calculate the depth variance cost along the matching process. It canbe defined as the following:

(2)

Consider the example in Figure 2, which schematically shows a detector with five generated proposals and three ground-truth boxes, the overall matching cost can be formulated as:

(3)

Input:
      is a set of ranked proposals generated by RPN
      is a set of ground-truth boxes on the image
      and are the number of GTs and proposals
      and are the number of positives and negatives
      is the matching cost with and .
Output:
      and are sets of positives and negatives

1:  build an empty set for positive samples: ;
2:  build an empty set for negative samples: ;
3:  build an empty set for pending proposals: ;
4:  for each box  do
5:     select the proposal with the minimal cost as the assigned pair;
6:     if the number of assigned pairs of  then
7:        ;
8:     else
9:        re-compare the cost of all proposals that have assigned to ;
10:        if the proposal with the largest cost is  then
11:           ;
12:        else
13:            proposal with the largest cost;
14:           ; abandon ;
15:            put it into the pending set;
16:        end if
17:     end if
18:     ;
19:  end for
20:  if  then
21:     compute IoU between and ;
22:     randomly assign negative samples to fill ;
23:  end if
24:  return  , , ;
Algorithm 1 Search Algorithm

where and represent the value of Manhatten distance and depth variance of all proposals. In the following, we consider a simple assignment strategy based on the estimated matching cost . We sort the proposals with the corresponding confidence estimated by Region Proposal Network (RPN). The proposed matching strategy sequentially assigns proposals to the ground-truth boxes. In details, we define the maximum number of assignment of each ground-truth box as , in which represents the number of positive samples and represents the number of ground-truth boxes of the input image. For each proposal in , we sequentially assign the ground-truth box with the smallest during its matching. Detailed matching process is shown in Algorithm 1.

It is worth noting that (1) each ground-truth boxes can assign up to proposals; (2) the proposal that was previously assigned to ground-truth box but was removed latter is sampled as hard negative samples; (3) if the negative samples are still not full at the end, IoU is also introduced to fill the negative set. Besides, in order to enrich the diversity of negative samples, half of negative proposals are randomly sampled from the negative set , and the remaining half are from pending set

. The matching algorithm assigns a unique proposal to each ground-truth box. Given this algorithm, we define a loss function on pairs of sets

and as:

(4)

where is a regression loss, is the matching cost and is a cross-entropy loss on a proposal’s confidence that it would be matched to a ground-truth box. The label for this cross-entropy loss is provided by . , and are the parameters trading off between the three parts. Note that for the proposed matching, we can update the network by back-propagating the gradient of this loss function.

3.2 Implement Details

The above algorithm attempts to answer the question ”which to match” for each proposal. In this section, we hope to further improve the proposed algorithm. We focus on the most important two aspects: (1) how to assign proposals with various scales to different ground-truth boxes? (2) how to relief inconsistency between depth-guided assignment and IoU-based performance evaluation?

To answer the first question, we follow the practice in FCOS [18], which introduces multi-level prediction with FPN [8]. With the use of multi-level feature maps defined as , proposals and ground-truth boxes with different scale would be allocated to the specific feature level. The specific feature level is only used to assign proposals with the specific scale. During experiments, the size range is (0, 64] for , (64, 128] for , (128, 256] for , (256, 512] for and (512, ] for . Since proposals with different sizes are allocated to different feature levels, the proposals and ground-truth boxes on the same level would be closer with larger overlap, which further enhances our algorithm.

In order to ensure our consistency between training and evaluation, we attempt to incorporate assignment into the confidence score of each predicted bounding box. We add one parallel branch to the detection head of classification and regression to predict the matching cost of each proposal. The aforementioned calculated matching cost servers as the estimation target to supervise the whole process. Therefore, the total loss in Equation 4, redefined as:

(5)

where Logistic(

) means the loss function using in standard Logistic Regression. In this way, we try to reduce the confidence of some detected boxes, which regress different from our matching procedure. In details, taking the proposal

as an example, the regressed box of is denoted as . The actual matching cost between and can be calculated following Equation 4, denoted as . The confidence score of is re-measured, formulated as:

(6)
Method Year ex. Backbone Scale R R Heavy Inference Time
Faster R-CNN+ATT [27] CVPR2018 VGG-16 16.0 48.1 56.7 -
TLL [17] ECCV2018 ResNet-50 15.5 44.2 53.6 -
TLL+MRF [17] ECCV2018 ResNet-50 14.4 - 52.0 -
ALFNet [11] ECCV2018 ResNet-50 12.0 36.5 51.9 0.27s / img
RepLoss [20] CVPR2018 ResNet-50 11.6 - 55.3 -
OR-CNN [29] ECCV2018 VGG-16 12.8 - 55.7 -
MGAN [14] ICCV2019 VGG-16 11.5 - 51.7 -
PBM [6] CVPR2020 VGG-16 11.0 - 53.3 -
CSP [12] CVPR2019 ResNet-50 11.0 34.7 49.3 0.33s / img
PRNet [21] ECCV2020 ResNet-50 10.8 - 53.3 0.22s / img
DRNet (ours) - VGG-16 10.2 29.7 46.9 0.12s / img
9.5 27.0 43.6 0.14s / img
DRNet (ours) - ResNet-50 10.1 25.9 46.2 0.15s / img
9.1 25.4 42.1 0.18s / img
Table 1: Comparisons with the state-of-the-art methods on Citypersons validation dataset. Results are the MRevaluation metric of the corresponding methods, in which lower is better. Boldface indicates the best performance. ex. means whether the method introduces extra annotations. For example, PBM and PRNet use extra part annotation, and TLL+MRF considers the sequence input.

4 Experiments

We assess the effectiveness of our proposed method for pedestrian detection on widely used datasets Caltech [4] [5] and Citypersons [26].

4.1 Experimental Setup

The proposed method is based on the Faster R-CNN baseline [16]

, pre-trained on the ImageNet. We optimize the network using the Stochastic Gradient Descent (SGD) algorithm with 0.9 momentum and 0.0005 weight decay, which is trained on 1 1080Ti GPU with the mini-batch involving 1 image per GPU. For Caltech dataset, we train the network for

iterations with the initial learning rate of and decay it to for another iterations. For Citypersons dataset, we train the network for iterations with the initial learning rate of and decay it to for another iterations. All images are in the original . scale during training and testing. Other parameters , and are set to 1, 1 and 0.01, respectively. and are both equal to 256 as usual. Specifically, our depth estimation model is VNL [24]. Better depth estimation model may have better performance, but it is not in the main scope of ours method.

4.2 Evaluation Metrics

In experiments, we used the standard average-log miss rate (MR) on False Positive Per Image (FPPI) in [, ]. This kind of metric is a bit similar to Average Precision (MAP) and refers more to the object not detected.

On Caltech and Citypersons, we report results across different occlusion degrees: Reasonable, Heavy and Partial. The visibility ratio is (0.65, 1], (0, 0.65) and (0.65, 1), respectively. In all subsets, the height of pedestrian over 50 pixels is taken for evaluation. It is worth noting that Heavy is designed to evaluate performance in case of severe occlusions. To further demonstrate our performance, we have designed two special settings. The first one is to validate the localization accuracy. We not only test on Reasonable set under IoU (), but also on Reasonable set under IoU (). The second one is that we split the Citypersons into a new subset called Large Height Variation (LHV), in which each image contains height variation of pedestrian instances larger than 50 pixels with some occlusion, in which the visibility ratio is [0.2, 0.9].

Backbone Scale IoU Assignment Direction Assignment Refinement R Heavy Partial Bare
VGG-16 15.8 53.2 16.9 9.7
VGG-16 11.8 49.6 15.5 9.3
VGG-16 10.2 46.9 14.5 6.6
Table 2: Comparisons with different modules on Cityperons validation dataset. Results are the MR evaluation metric of the corresponding methods, in which lower is better. Boldface indicates the best performance. Directional Assignment is described in Sec 3.1 and Refinement is described in Sec 3.2
Figure 4: The visualization results of our DRNet and the state-of-the-art methods, e.g., CSP [12] and OR-CNN [29]. The box with the check mark or cross represents the correct and false results, respectively. The red and yellow cross show false positives of OR-CNN and CSP, while ours performs better.

4.3 Main Results

We compare DRNet with corresponding methods on Citypersons and Caltech dataset in Table 1 and Table 4. For fair comparisons, we report results in terms of backbone, scale, inference time and all challenging subsets.

(1) Citypersons Dataset. Table 1 reports the results compared to state-of-the-arts on Citypersons. First of all, our algorithms has significant performance improvement under various fairness settings. For example, when leveraging ResNet-50, ours achieves the highest accuracy with an improvement of 0.9% MR from the closet competitor CSP [12] on R and 8.8% MR on R. Furthermore, it is worth noting that from Table 1, ours has also demonstrated the self-contained ability to handle occlusion issues in crowded scenes. Especially on Heavy occlusion subset, ours reports the brand-new state-of-the-arts of 42.1% MR

. This is probably because harder samples are mined with the introduced directional assignment metric, and thus training a more discriminant predictor. And we mainly improve the training matching process, there is no additional modification to the network structure. Therefore, there is no extra computation consumption during testing, and our inference time is comparable to others.

(2) Caltech Dataset. We also test our method on Caltech and the comparison with state-of-the-arts on this benchmark is shown in Table 4. Our method achieves MR of 3.08% under the IoU threshold of 0.5, which is comparable to the best competitor (3.80% of CSP [12]). Besides, in the case of a stricter occlusion level of Heavy subset, our method achieves 30.45% MR, outperforming all previous state-of-the-arts with an improvement of 6.05% MR over CSP [12]. It indicates that our method has a substantially better localization accuracy.

4.4 Ablation

To show the effectiveness of each proposed component, we report the overall ablation studies in Table 2.

(1) Search and Assignment. As analyzed above, it can be seen that the IoU metric are suboptimal primarily because it is difficult to answer ”which to match” question for each proposal. The performance is summarized in Table 2. When evaluated respectively, the search and assignment strategy shows the improvement of 4.0% MR compared with the original IoU assignment. Furthermore, by incorporating with the refinement module, we show a consistent improvement of 1.6% MR. In addition, we can also see that from the visualization results in Figure 4, our algorithm effectively reduces false detection results when there are large height variance of pedestrians in each image compared with other algorithms, e.g., CSP [12] or OR-CNN [29]. Taking the first image in Figure 4 as an example, the red dotted box failed to regress toward the green box (boy with white T-shirt), and thus brings about ”close but not correct” false positives. Moreover, our algorithm is particularly effective in two challenging scenarios. One is the scene with large height variance, such as the example in Figure 4 (first row, third column). The confusion in IoU makes it harder to train a high quality regressor, and thus brings about ”close but not correct” false positives during testing, while ours can effectively eliminate this kind of low-quality regression. The second is the occlusion scene. In this case, our algorithm has more consistent matching targets and therefore has better performance (Figure 4, second row, second column).

Method Backbone R LHV
BiBox [30] VGG-16 11.2 -
BiBox VGG-16 11.2 30.6
BiBox VGG-16 10.5 24.7
PBM [6] VGG-16 11.1 -
PBM VGG-16 11.0 30.3
PBM VGG-16 10.2 23.9
CSP [12] ResNet-50 11.0 -
CSP ResNet-50 10.9 29.0
CSP ResNet-50 10.2 23.7
Table 3: Comparisons with the state-of-the-art methods on Citypersons validation dataset. Boldface indicates the best performance. means our re-implementation of referred method, and means the corresponding detector combined with our assignment manner.
Method Backbone R Heavy
DeepParts [13] - 12.90 60.42
MS-CNN [2] - 8.08 59.94
RPN+BF [25] - 7.28 74.36
SDS-RCNN [1] ResNet-50 6.44 58.55
ALFNet [11] ResNet-50 4.50 43.40
RepLoss ResNet-50 4.00 63.36
CSP [12] ResNet-50 3.80 36.50
DRNet (ours) ResNet-50 3.08 30.45
Table 4: Comparisons with the state-of-the-art methods on Caltech testset. Boldface indicates the best performance.
Figure 5: The performance of MR

along training process under different epoch.

LHV means the maximum pedestrian height variance per image.

(2) Large Height Variance. In order to better verify our algorithm, we specially set a Large Height Variance subset (LHV). The separation of this subset is based on the maximum variance in pedestrian height per image. On one hand, in Figure 5, we show that the change of miss rate during training process under different settings. We found that our method can accurately detect under large height variance (LHV). On the other hand, we also integrate the algorithm with other algorithms and verify its performance on the LHV subset (Table 3). In order to make a fair comparison, we reproduced the relevant algorithm and report a comparison between our reproduced performance and the performance reported in corresponding paper. When integrated with other algorithms, the performance has also been greatly improved, especially on the LHV subset. However, we have also noticed that although our algorithm has improved different detectors (e.g., CSP or PBM), results report in Table 3 just show competitive with the bare search algorithm in Table 1. We think this is because there exist some incompatibility between the previous algorithms and ours method. For example, PBM [6] and BiBox [30] both introduce part annotations, which leads to some confusion in our assignment procedure.

Method Backbone AP AP
FPN [8] ResNet50 33.9 -
RetinaNet [9] ResNet50 36.3 38.8
FCOS [18] ResNet50 36.6 38.9
FreeAnchor [22] ResNet50 38.7 41.6
ATSS [28] ResNet50 39.3 42.8
PAA [7] ResNet50 41.1 44.3
FPN ResNet50 37.5 40.9
FCOS ResNet50 41.4 44.7
RetinaNet ResNet50 41.5 45.0
Table 5:

Comparisons with the state-of-the-art methods on COCO minival set.

means the corresponding detector combined with our assignment manner. Boldface indicates the best performance.

(2) Extended Experiments. In order to further validate our algorithm, we also report results on COCO minival dataset compared with other detectors. We use a COCO training setting which is the same as [7]

in the batch size, frozen Batch Normalization, learning rate, etc. For ablation studies, we use ResNet-50 backbone and run 135 iterations of training. We notice that the performance of the proposed label assignment is improved by adding it to other widely-used frameworks, such as 3.6% on FPN 

[8], 5.2% on RetinaNet [9] and 4.8% on FCOS [18]. Furthermore, compared with other label assignment strategies, ours also show competitive performance in Table 5. Although our algorithm is designed specially for pedestrian detection, it still has some performance improvement for more general detection dataset such as COCO. We believe that the performance will be further improved if there is a more refined design for the general purpose.

5 Conclusion

In this paper, we present a simple yet effective pedestrian detector with a novel assignment strategy, achieving competitive accuracy while performing competitive inference time with the state-of-the-art methods. On top of a backbone, the proposed method can serve as a metric incorporated into the other pedestrian detectors, and experimental results show a consistent improvement on some popular pedestrian detection benchmarks, e.g. Citypersons and Caltech. This novel design is flexible and independent of any backbone network, without being limited by the two-stage detection framework. Therefore, it is also interesting to incorporate the proposed assignment strategy with other detectors like FCOS [18] and YOLO [15], which will be studied in future. Furthermore, we also consider extending this method, not only for pedestrian detection, but also study the proposed method on more general object detection tasks.

References

  • [1] G. Brazil, X. Yin, and X. Liu (2017) Illuminating pedestrians via simultaneous detection segmentation. In

    International Conference on Computational Vision (ICCV)

    ,
    Cited by: Table 4.
  • [2] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. (2016)

    A unified multi-scale deep convolutional neural network for fast object detection

    .
    In European Conference on Computer Vision (ECCV), Cited by: §2.1, Table 4.
  • [3] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    Cited by: §1, §2.2.
  • [4] P. Doll¨¢r, C. Wojek, B. Schiele, and P. Perona (2012) Pedestrian detection: an evaluation of the state of the art. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, pp. 743. Cited by: §4.
  • [5] P. Dollar, C. Wojek, B. Schiele, and P. Perona (2009) Pedestrian detection: a benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 304–311. Cited by: §1, §4.
  • [6] X. Huang, Z. Ge, Z. Jie, and O. Yoshie (2020-06) NMS by representative region: towards crowded pedestrian detection by proposal pairing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §1, §2.1, Table 1, §4.4, Table 3.
  • [7] K. Kim and H. S. Lee (2020) Probabilistic anchor assignment with iou prediction for object detection. arXiv preprint arXiv:2007.08103. Cited by: §2.3, §2.4, §4.4, Table 5.
  • [8] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2, §4.4, Table 5.
  • [9] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In International Conference on Computational Vision (ICCV), Cited by: §4.4, Table 5.
  • [10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In European Conference on Computer Vision (ECCV), Cited by: §1.
  • [11] W. Liu, S. Liao, W. Hu, X. Liang, and X. Chen (2018-09) Learning efficient single-stage pedestrian detectors by asymptotic localization fitting. In European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2.2, Table 1, Table 4.
  • [12] W. Liu (2019) High-level semantic feature detection: a new perspective for pedestrian detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table 1, Figure 4, §4.3, §4.3, §4.4, Table 3, Table 4.
  • [13] Y. Liu and L. Jin (2017) Deep matching prior network: toward tighter multi-oriented text detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 4.
  • [14] Y. Pang, J. Xie, M. H. Khan, and R. M. Anwer (2019) Mask-guided attention network for occluded pedestrian detection. In International Conference on Computational Vision (ICCV), Cited by: §1, Table 1.
  • [15] J. Redmon and Ali. Farhadi (2017) YOLO9000: better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
  • [16] S. Ren, K. He, R. Girshick, and J. Sun (2017) Faster r-cnn: towards real-time object detection with region proposal networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, pp. 1137–1149. Cited by: §1, §4.1.
  • [17] T. Song, L. Sun, D. Xie, H. Sun, and S. Pu (2018) Small-scale pedestrian detection based on somatic topology localization and temporal feature aggregation. In European Conference on Computer Vision (ECCV), Cited by: Table 1.
  • [18] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2, §4.4, Table 5, §5.
  • [19] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.4.
  • [20] X. Wang, T. Xiao, Y. Jiang, S. Shuai, S. Jian, and C. Shen (2018) Repulsion loss: detecting pedestrians in a crowd. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
  • [21] S. Xiaolin, Z. Kaili, C. Wen-Sheng, Z. Honggang, and G. Jun (2020) Progressive refinement network for occluded pedestrian detection. In European Conference on Computer Vision (ECCV), Cited by: §2.2, Table 1.
  • [22] Z. Xiaosong, W. Fang, L. Chang, J. Rongrong, and Y. Qixiang (2019) FreeAnchor: learning to match anchors for visual object detection. In Neural Information Processing Systems, Cited by: §2.4, Table 5.
  • [23] T. Yang, X. Zhang, W. Zhang, and J. Sun (2018) MetaAnchor: learning to detect objects with customized anchors. In Neural Information Processing Systems, Cited by: §2.4.
  • [24] W. Yin, Y. Liu, C. Shen, and Y. Yan (2019) Enforcing geometric constraints of virtual normal for depth prediction. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §4.1.
  • [25] L. Zhang, L. Liang, X. Liang, and K. He (2016) Is faster r-cnn doing well for pedestrian detection?. In European Conference on Computer Vision (ECCV), Cited by: §2.1, Table 4.
  • [26] S. Zhang, R. Benenson, and Bernt. Schiele (2017) CityPersons: a diverse dataset for pedestrian detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.
  • [27] S. Zhang, J. Yang, and B. Schiele. (2018) Occluded pedestrian detection through guided attention in cnns. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table 1.
  • [28] S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2019) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv preprint arXiv:1912.02424. Cited by: §2.3, §2.4, Table 5.
  • [29] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li (2018) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In European Conference on Computer Vision (ECCV), Cited by: §1, Table 1, Figure 4, §4.4.
  • [30] C. Zhou and J. Yuan (2018) Bi-box regression for pedestrian detection and occlusion estimation. In European Conference on Computer Vision (ECCV), Cited by: §4.4, Table 3.