Pedestrian detection is a key problem in a number of real-world applications including auto-driving systems and surveillance systems, and is required to have both high classification and localization accuracy. Driven by the success of general object detection, many of the proposed pedestrian detectors [27, 11, 14, 6] still follow the basic practice such as Faster R-CNN  and SSD . Therefore, Intersection over Union (IoU), which is required to assign proposals, has been one indispensable metric in the recent frameworks.
However, the commonly used IoU metric has two main drawbacks. (1) On one hand, it is difficult to set a proper threshold for IoU during training. A lower IoU threshold (e.g. 0.5) could keep adequate number of positive samples but will result in many ”close but not correct” false positives during inference . A relative higher threshold (e.g. 0.7) can reject low-quality proposals but will reduce a large number of matched positives. Although the Cascade R-CNN  and ALFNet  have provide solutions that gradually refine the proposals by several stages with a set of IoU thresholds, the hand-crafted IoU threshold, whether single or multiple, is still not the best choice. (2) On the other hand, the IoU-based GT-Proposal assignment suffers from the inconsistent supervision problem that spatially adjacent proposals with similar features are assigned to different ground-truth boxes, which means some very similar proposals may be forced to regress towards different targets, and thus confuses the bounding-box regression when predicting the location results. This disadvantage is more prominent in the scene with large scale variation and occlusion. As illustrated in Figure 1, overlapped proposals with almost the same features are assigned to different ground-truth boxes, which results in the confusion during training and thus brings about low-quality regression results during inference (Figure 1 red solid box).
The above analyses motivate us to address the weakness of IoU-based GT-Proposal matching mechanism and propose one new assignment and regression metric for pedestrian detection. Different from the previous one-to-one assignment manner of one proposal to one ground-truth box, we conduct the assignment in a set-to-set process of finding a reasonable matching between the two sets (proposals and ground-truth boxes). This searching-based method does not depend on a fixed hyper-parameter (e.g. IoU threshold), instead it constantly searches for the matched pairs along the dynamic training procedure. Furthermore, the search algorithm consists of one cost function to answer the question of ”which to match
” for each proposal. Motivated by the discovery that the distribution of pedestrian scales is highly connected with the depth-variant direction, the introduced cost function starts with the depth estimation, and gradually assigns all proposals step by step, pushing more proposals closer to ground-truth boxes with smaller depth (scale) variance. By this way, the regressor will be optimized towards consistent location direction: regress to the direction with smaller depth variance. On top of this, we are the first to point out that”Regression needs Direction” to the best of our knowledge, and a novel pedestrian detection architecture is thus constructed, denoted as Directional Regression Network (DRNet).
To demonstrate the generality of the proposed depth-guided search algorithm, we evaluate various CNN-based pedestrian detectors on both Citypersons  and Caltech  dataset including OR-CNN , PBM  and CSP.To sum up, the main contributions of this work lie in:
(1) We first attempt to put forward the inconsistent supervision problem of IoU-based assignment mechanism and propose one search algorithm as a new assignment metric, which is both dynamic and direction-sensitive during training procedure.
(2) We first attempt to point out that ”Regression needs Direction” and propose one directional regression network, named DRNet, to tackle with the challenging problem of large scale variation for accurate pedestrian localization.
(3) We achieve state-of-the-art accuracy on widely-used datasets including Caltech and Citypersons, especially under the challenging setting of R by 8.8% on Citypersons dataset. We incorporate the proposed method into the most popular pedestrian detection algorithms such as CSP and PBM , and show further performance improvement.
2 Related Work
2.1 Pedestrian Detection
Driven by the success of general object detection, many pedestrian detectors follow the anchor-based/anchor-free paradigm. Specifically, the anchor-based detectors are proposed in the two-stage/one-stage framework. RPN+BF  discusses the issue that improves performance of Faster R-CNN for pedestrian detection with a RPN followed by shared boosted forests. MS-CNN  attempts to introduce various receptive fields that match different object scales to tackle with multi-scale problem in pedestrian detection. Recently, methods such as PBM 
use both part and full-body features to more accurately classify and localize pedestrians especially in crowded scenes. However, aforementioned detectors focus more on feature representation, and less attention is paid to the inherent drawbacks of IoU-based assignment method during the training procedure.
2.2 Increment-based Label Assignment
Most recently, similar to our intuition, some researchers attempt to revisit the role of IoU for object detection. First, methods such as Cascade R-CNN  and ALFNet  try to replace the single threshold with an incremental IoU setting. Although these detectors have provided solutions that gradually refine the proposals, they are still not free from the hand-crafted setting and perform somewhat inflexible during training. Most recently, a progressive network  is proposed with three-phase progression to gradually refine anchors following human annotation process. However, the problem is also obvious. On the one hand, although the gradual strategy could bring performance improvements, it also brings double the computational cost. On the other hand, some incremental settings are not flexible. For example, Cascade R-CNN and ALFNet both use manually defined IoU (e.g, 0.5, 0.6 and 0.7) to measure the quality.
2.3 Statistics-based Label Assignment
Different from previous methods that choose a fixed number of best scoring anchors, statistics-based label assignment detectors model the anchor selection procedure as a likelihood estimation for a probability distribution. ATSS
proposes an adaptive training sample selection method that uses the sum of mean and standard deviation as the IoU Threshold. PAA proposes a probabilistic anchor assignment strategy that adaptively separates a set of anchors into positive and negative samples for a GT box according to the learning status of the model associated with it. These methods have taken a step towards more accurate assignment strategies. Nevertheless, to the best of our knowledge, these still get caught into the IoU-based assignment, which means they still not escape from the inconsistent problem our work raised.
2.4 Prior-based Label Assignment
To relief the extra computation cost, some works introduce human prior to improve label assignment. FreeAnchor  proposes one detector that guarantees at least one anchor’s prediction close to the ground-truth for each object. GuideAnchoring  and MetaAnchor  try to dynamically change the shape of anchors to fit various distributions of objects. The aforementioned methods, whether prior-related (e.g., GuideAnchoring) or instance-related (e.g., FreeAnchor), ignore the inconsistent supervision problem existing during the assignment process. Therefore, they are still not free from the confusion caused by ”similar proposals, different regression targets”, and thus bring about false regression results during inference. Compared to their, ours is not only dynamic and flexible, but also ensures the consistency of matching, so it better answers the question of ”which to match” and thus shows further improvement for accurate pedestrian localization. On top of above, combined with depth estimation (z axis), our work is also the first attempt that tries to solve the assignment problem from a 3D perspective compared with other 2D assignment (x-y axis), e.g. ATSS  or PAA .
3 Proposed Method
Generally speaking, the proposed method consists of one search algorithm and one depth-guided function to solve the main task: ”which to match” for each proposal. The overall process is shown in Figure 2.
3.1 How to Design Depth-guided Assignment?
”Which to Match” is an important question especially for pedestrian detection. It is essentially a matching problem, that is how to match reasonable ground-truth boxes for each proposal so as to push the network training under a better direction. From the above analysis, the IoU metric performs as one-to-one assignment manner of one proposal to one ground-truth, but brings about inconsistent supervision problem during training. Different from IoU, we regard the assignment as the process of finding a reasonable matching between two sets (proposals and ground-truth boxes). We formulate this process in the following. The ground-truth boxes and proposals are denoted as and , in which and represent the number of ground-truth boxes and proposals, respectively.
To compute the relevance between one proposal and one ground-truth box, the proposed method defines one search function, which can be divided into two parts. The first one is Manhattan distance, denoted as , which measures the distance between the paired proposal and ground-truth box. It is worth noting that in the view of paths, there exist several shortest paths with the smallest Manhattan distance, as is shown in Figure 3. The red, green, yellow and blue paths all have the same shortest path length, and this also brings some uncertainty to our second part of depth estimation, denoted as . From the start of one proposal to the end of one ground-truth box, Manhattan distance proposes several possible paths, denoted as , in which represents the k-th path and T is the total number of paths. Various paths in shows various depth variance, denoted as , in which is the sum of the depth changes along the k-th path, formulated as:
where represents the aggregation function to calculate the depth variance, represents the total number of points included in the path , and represents the q-th depth variance between two adjacent points. We select the path with the minimal depth variance among different paths to calculate the depth variance cost along the matching process. It canbe defined as the following:
Consider the example in Figure 2, which schematically shows a detector with five generated proposals and three ground-truth boxes, the overall matching cost can be formulated as:
where and represent the value of Manhatten distance and depth variance of all proposals. In the following, we consider a simple assignment strategy based on the estimated matching cost . We sort the proposals with the corresponding confidence estimated by Region Proposal Network (RPN). The proposed matching strategy sequentially assigns proposals to the ground-truth boxes. In details, we define the maximum number of assignment of each ground-truth box as , in which represents the number of positive samples and represents the number of ground-truth boxes of the input image. For each proposal in , we sequentially assign the ground-truth box with the smallest during its matching. Detailed matching process is shown in Algorithm 1.
It is worth noting that (1) each ground-truth boxes can assign up to proposals; (2) the proposal that was previously assigned to ground-truth box but was removed latter is sampled as hard negative samples; (3) if the negative samples are still not full at the end, IoU is also introduced to fill the negative set. Besides, in order to enrich the diversity of negative samples, half of negative proposals are randomly sampled from the negative set , and the remaining half are from pending set
. The matching algorithm assigns a unique proposal to each ground-truth box. Given this algorithm, we define a loss function on pairs of setsand as:
where is a regression loss, is the matching cost and is a cross-entropy loss on a proposal’s confidence that it would be matched to a ground-truth box. The label for this cross-entropy loss is provided by . , and are the parameters trading off between the three parts. Note that for the proposed matching, we can update the network by back-propagating the gradient of this loss function.
3.2 Implement Details
The above algorithm attempts to answer the question ”which to match” for each proposal. In this section, we hope to further improve the proposed algorithm. We focus on the most important two aspects: (1) how to assign proposals with various scales to different ground-truth boxes? (2) how to relief inconsistency between depth-guided assignment and IoU-based performance evaluation?
To answer the first question, we follow the practice in FCOS , which introduces multi-level prediction with FPN . With the use of multi-level feature maps defined as , proposals and ground-truth boxes with different scale would be allocated to the specific feature level. The specific feature level is only used to assign proposals with the specific scale. During experiments, the size range is (0, 64] for , (64, 128] for , (128, 256] for , (256, 512] for and (512, ] for . Since proposals with different sizes are allocated to different feature levels, the proposals and ground-truth boxes on the same level would be closer with larger overlap, which further enhances our algorithm.
In order to ensure our consistency between training and evaluation, we attempt to incorporate assignment into the confidence score of each predicted bounding box. We add one parallel branch to the detection head of classification and regression to predict the matching cost of each proposal. The aforementioned calculated matching cost servers as the estimation target to supervise the whole process. Therefore, the total loss in Equation 4, redefined as:
) means the loss function using in standard Logistic Regression. In this way, we try to reduce the confidence of some detected boxes, which regress different from our matching procedure. In details, taking the proposalas an example, the regressed box of is denoted as . The actual matching cost between and can be calculated following Equation 4, denoted as . The confidence score of is re-measured, formulated as:
|Faster R-CNN+ATT ||CVPR2018||VGG-16||16.0||48.1||56.7||-|
|ALFNet ||ECCV2018||ResNet-50||12.0||36.5||51.9||0.27s / img|
|CSP ||CVPR2019||ResNet-50||11.0||34.7||49.3||0.33s / img|
|PRNet ||ECCV2020||ResNet-50||10.8||-||53.3||0.22s / img|
|DRNet (ours)||-||VGG-16||10.2||29.7||46.9||0.12s / img|
|9.5||27.0||43.6||0.14s / img|
|DRNet (ours)||-||ResNet-50||10.1||25.9||46.2||0.15s / img|
|9.1||25.4||42.1||0.18s / img|
4.1 Experimental Setup
The proposed method is based on the Faster R-CNN baseline 
, pre-trained on the ImageNet. We optimize the network using the Stochastic Gradient Descent (SGD) algorithm with 0.9 momentum and 0.0005 weight decay, which is trained on 1 1080Ti GPU with the mini-batch involving 1 image per GPU. For Caltech dataset, we train the network foriterations with the initial learning rate of and decay it to for another iterations. For Citypersons dataset, we train the network for iterations with the initial learning rate of and decay it to for another iterations. All images are in the original . scale during training and testing. Other parameters , and are set to 1, 1 and 0.01, respectively. and are both equal to 256 as usual. Specifically, our depth estimation model is VNL . Better depth estimation model may have better performance, but it is not in the main scope of ours method.
4.2 Evaluation Metrics
In experiments, we used the standard average-log miss rate (MR) on False Positive Per Image (FPPI) in [, ]. This kind of metric is a bit similar to Average Precision (MAP) and refers more to the object not detected.
On Caltech and Citypersons, we report results across different occlusion degrees: Reasonable, Heavy and Partial. The visibility ratio is (0.65, 1], (0, 0.65) and (0.65, 1), respectively. In all subsets, the height of pedestrian over 50 pixels is taken for evaluation. It is worth noting that Heavy is designed to evaluate performance in case of severe occlusions. To further demonstrate our performance, we have designed two special settings. The first one is to validate the localization accuracy. We not only test on Reasonable set under IoU (), but also on Reasonable set under IoU (). The second one is that we split the Citypersons into a new subset called Large Height Variation (LHV), in which each image contains height variation of pedestrian instances larger than 50 pixels with some occlusion, in which the visibility ratio is [0.2, 0.9].
|Backbone||Scale||IoU Assignment||Direction Assignment||Refinement||R||Heavy||Partial||Bare|
4.3 Main Results
We compare DRNet with corresponding methods on Citypersons and Caltech dataset in Table 1 and Table 4. For fair comparisons, we report results in terms of backbone, scale, inference time and all challenging subsets.
(1) Citypersons Dataset. Table 1 reports the results compared to state-of-the-arts on Citypersons. First of all, our algorithms has significant performance improvement under various fairness settings. For example, when leveraging ResNet-50, ours achieves the highest accuracy with an improvement of 0.9% MR from the closet competitor CSP  on R and 8.8% MR on R. Furthermore, it is worth noting that from Table 1, ours has also demonstrated the self-contained ability to handle occlusion issues in crowded scenes. Especially on Heavy occlusion subset, ours reports the brand-new state-of-the-arts of 42.1% MR
. This is probably because harder samples are mined with the introduced directional assignment metric, and thus training a more discriminant predictor. And we mainly improve the training matching process, there is no additional modification to the network structure. Therefore, there is no extra computation consumption during testing, and our inference time is comparable to others.
(2) Caltech Dataset. We also test our method on Caltech and the comparison with state-of-the-arts on this benchmark is shown in Table 4. Our method achieves MR of 3.08% under the IoU threshold of 0.5, which is comparable to the best competitor (3.80% of CSP ). Besides, in the case of a stricter occlusion level of Heavy subset, our method achieves 30.45% MR, outperforming all previous state-of-the-arts with an improvement of 6.05% MR over CSP . It indicates that our method has a substantially better localization accuracy.
To show the effectiveness of each proposed component, we report the overall ablation studies in Table 2.
(1) Search and Assignment. As analyzed above, it can be seen that the IoU metric are suboptimal primarily because it is difficult to answer ”which to match” question for each proposal. The performance is summarized in Table 2. When evaluated respectively, the search and assignment strategy shows the improvement of 4.0% MR compared with the original IoU assignment. Furthermore, by incorporating with the refinement module, we show a consistent improvement of 1.6% MR. In addition, we can also see that from the visualization results in Figure 4, our algorithm effectively reduces false detection results when there are large height variance of pedestrians in each image compared with other algorithms, e.g., CSP  or OR-CNN . Taking the first image in Figure 4 as an example, the red dotted box failed to regress toward the green box (boy with white T-shirt), and thus brings about ”close but not correct” false positives. Moreover, our algorithm is particularly effective in two challenging scenarios. One is the scene with large height variance, such as the example in Figure 4 (first row, third column). The confusion in IoU makes it harder to train a high quality regressor, and thus brings about ”close but not correct” false positives during testing, while ours can effectively eliminate this kind of low-quality regression. The second is the occlusion scene. In this case, our algorithm has more consistent matching targets and therefore has better performance (Figure 4, second row, second column).
(2) Large Height Variance. In order to better verify our algorithm, we specially set a Large Height Variance subset (LHV). The separation of this subset is based on the maximum variance in pedestrian height per image. On one hand, in Figure 5, we show that the change of miss rate during training process under different settings. We found that our method can accurately detect under large height variance (LHV). On the other hand, we also integrate the algorithm with other algorithms and verify its performance on the LHV subset (Table 3). In order to make a fair comparison, we reproduced the relevant algorithm and report a comparison between our reproduced performance and the performance reported in corresponding paper. When integrated with other algorithms, the performance has also been greatly improved, especially on the LHV subset. However, we have also noticed that although our algorithm has improved different detectors (e.g., CSP or PBM), results report in Table 3 just show competitive with the bare search algorithm in Table 1. We think this is because there exist some incompatibility between the previous algorithms and ours method. For example, PBM  and BiBox  both introduce part annotations, which leads to some confusion in our assignment procedure.
Comparisons with the state-of-the-art methods on COCO minival set.means the corresponding detector combined with our assignment manner. Boldface indicates the best performance.
(2) Extended Experiments. In order to further validate our algorithm, we also report results on COCO minival dataset compared with other detectors. We use a COCO training setting which is the same as 
in the batch size, frozen Batch Normalization, learning rate, etc. For ablation studies, we use ResNet-50 backbone and run 135 iterations of training. We notice that the performance of the proposed label assignment is improved by adding it to other widely-used frameworks, such as 3.6% on FPN, 5.2% on RetinaNet  and 4.8% on FCOS . Furthermore, compared with other label assignment strategies, ours also show competitive performance in Table 5. Although our algorithm is designed specially for pedestrian detection, it still has some performance improvement for more general detection dataset such as COCO. We believe that the performance will be further improved if there is a more refined design for the general purpose.
In this paper, we present a simple yet effective pedestrian detector with a novel assignment strategy, achieving competitive accuracy while performing competitive inference time with the state-of-the-art methods. On top of a backbone, the proposed method can serve as a metric incorporated into the other pedestrian detectors, and experimental results show a consistent improvement on some popular pedestrian detection benchmarks, e.g. Citypersons and Caltech. This novel design is flexible and independent of any backbone network, without being limited by the two-stage detection framework. Therefore, it is also interesting to incorporate the proposed assignment strategy with other detectors like FCOS  and YOLO , which will be studied in future. Furthermore, we also consider extending this method, not only for pedestrian detection, but also study the proposed method on more general object detection tasks.
Illuminating pedestrians via simultaneous detection segmentation.
International Conference on Computational Vision (ICCV), Cited by: Table 4.
A unified multi-scale deep convolutional neural network for fast object detection. In European Conference on Computer Vision (ECCV), Cited by: §2.1, Table 4.
Cascade r-cnn: delving into high quality object detection.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2.
-  (2012) Pedestrian detection: an evaluation of the state of the art. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 34, pp. 743. Cited by: §4.
-  (2009) Pedestrian detection: a benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 304–311. Cited by: §1, §4.
-  (2020-06) NMS by representative region: towards crowded pedestrian detection by proposal pairing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §1, §2.1, Table 1, §4.4, Table 3.
-  (2020) Probabilistic anchor assignment with iou prediction for object detection. arXiv preprint arXiv:2007.08103. Cited by: §2.3, §2.4, §4.4, Table 5.
-  (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2, §4.4, Table 5.
-  (2017) Focal loss for dense object detection. In International Conference on Computational Vision (ICCV), Cited by: §4.4, Table 5.
-  (2016) SSD: single shot multibox detector. In European Conference on Computer Vision (ECCV), Cited by: §1.
-  (2018-09) Learning efficient single-stage pedestrian detectors by asymptotic localization fitting. In European Conference on Computer Vision (ECCV), Cited by: §1, §1, §2.2, Table 1, Table 4.
-  (2019) High-level semantic feature detection: a new perspective for pedestrian detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table 1, Figure 4, §4.3, §4.3, §4.4, Table 3, Table 4.
-  (2017) Deep matching prior network: toward tighter multi-oriented text detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 4.
-  (2019) Mask-guided attention network for occluded pedestrian detection. In International Conference on Computational Vision (ICCV), Cited by: §1, Table 1.
-  (2017) YOLO9000: better, faster, stronger. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
-  (2017) Faster r-cnn: towards real-time object detection with region proposal networks. In IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 39, pp. 1137–1149. Cited by: §1, §4.1.
-  (2018) Small-scale pedestrian detection based on somatic topology localization and temporal feature aggregation. In European Conference on Computer Vision (ECCV), Cited by: Table 1.
-  (2019) FCOS: fully convolutional one-stage object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §3.2, §4.4, Table 5, §5.
-  (2019) Region proposal by guided anchoring. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §2.4.
-  (2018) Repulsion loss: detecting pedestrians in a crowd. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
-  (2020) Progressive refinement network for occluded pedestrian detection. In European Conference on Computer Vision (ECCV), Cited by: §2.2, Table 1.
-  (2019) FreeAnchor: learning to match anchors for visual object detection. In Neural Information Processing Systems, Cited by: §2.4, Table 5.
-  (2018) MetaAnchor: learning to detect objects with customized anchors. In Neural Information Processing Systems, Cited by: §2.4.
-  (2019) Enforcing geometric constraints of virtual normal for depth prediction. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §4.1.
-  (2016) Is faster r-cnn doing well for pedestrian detection?. In European Conference on Computer Vision (ECCV), Cited by: §2.1, Table 4.
-  (2017) CityPersons: a diverse dataset for pedestrian detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.
-  (2018) Occluded pedestrian detection through guided attention in cnns. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, Table 1.
-  (2019) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv preprint arXiv:1912.02424. Cited by: §2.3, §2.4, Table 5.
-  (2018) Occlusion-aware r-cnn: detecting pedestrians in a crowd. In European Conference on Computer Vision (ECCV), Cited by: §1, Table 1, Figure 4, §4.4.
-  (2018) Bi-box regression for pedestrian detection and occlusion estimation. In European Conference on Computer Vision (ECCV), Cited by: §4.4, Table 3.