The vision community has rapidly improved performance on object recognition, especially object detection , instance segmentation , human pose estimation . Among them, Object detection is a prerequisite for many downstream applications. The performance of detectors has dramatically make progress with the help of powerful backbone networks , delicate design on optimization objective  and well-annotated datasets .
However, detecting objects of various scales remains challenging, especially encountering objects of extreme size. As shown in Table 1, AP on small objects falls much compared to medium and large objects. To alleviate the scale variation problem, state-of-the-art detectors rely on feature pyramids  or image pyramids. On the one hand of feature pyramids, FPN  construct a multi-stage network with parallel prediction on objects of isolated scale range. On the other hand of image pyramids, simple multi-scale training & testing strategy still play a role on multiple recognition tasks. In particular, [26, 27, 20] found that ignoring loss signals from extremely tiny and large objects can improve detection accuracy.
It should be pointed out that aforementioned methods have defects respectively. According to , the receptive field and semantic scope of the same RoI should be consistent in feature map. For object of large scale, the receptive field may not be sufficient; while for the small one, the semantic scope is larger comparing to object’s size. So there exists consistency only when object’s size falls into a moderate range. Multi-scale training resize images to different resolutions, therefore resizing some objects to normal scales. But it also let some objects to extreme scales with great inconsistency, which give rise to final accuracy degradation. As shown in Table 2, the detector trained from image pyramid with wider scale range is inferior to the one from normal scale range. FPN 
employed a heuristic rule to make feature map on different stage responsible for RoIs of isolated scale range. It also suffer from the inconsistency when encountering tiny objects (downsampled to sub-pixels and hard to recognize) or large objects (receptive field cannot cover). SNIP tried to ignoring the extremely tiny and large objects, resulting removing the inconsistency on these extreme samples. But there also exists inconsistency about the scale range usage. The detailed analysis will be reported in 3.2.
We propose a simple and effective method approach, called Consistent Scale Normalization (CSN), to further alleviate the inconsistency resulting from scale variation. CSN will also train detector on image pyramid, but only optimize model on objects in moderate scale range. With FPN integrated, the heuristic rule in FPN can distribute RoIs of large scale range to multiple stage which process RoIs in smaller range. This alleviates the learning difficulty and enlarges feasible range on object scales, resulting more available training samples and better generalization. Based on CSN, our contribution can be summarized as follows:
We propose CSN to restricting RoIs of extreme scale in training and testing, this strategy can alleviate the negative impact caused by large scale variation. Without any modification to network structure, SOTA result of single ResNet-101 model is obtained by CSN on COCO object detection benchmark.
We extend CSN to other recognition tasks, i.e., instance segmentation and keypoint detection. Comparable improvements have been achieved, especially for keypoint detection (+3.6% AP), which is more sensitive to scale variation. To the best of our knowledge, best performance of a single ResNet-50 model can be achieved on both tasks by CSN.
CSN benefits detectors of various backbones, especially tiny ones like ResNet-18 and MobileNet-v2 (e.g., On ResNet-18 based Faster R-CNN, CSN boost mAP by 5 point than original multi-scale training & testing). For fast inference purpose, the models are usually tested in single scale. We found model trained with CSN still achieves better performance than the ordinary one in this situation. This makes CSN a totally cost-free freebie for object recognition task in practical applications.
|img-scale [640, 800]||28.8||13.0||30.8||40.8|
|img-scale [160, 1600]||27.8||12.3||29.4||40.4|
2 Related Works
Driven by representation capacity of deep conv-feature, CNN-based detectors [6, 24, 18, 22] are a dominant paradigm in object detection community. The R-CNN series and variants [24, 8, 24] gradually promotes the upper bound of performance on two-stage detectors. In particular, Faster R-CNN  adopts shared backbone network to proposal generation and RoI classification, resulting real-time detection and accuracy rising. Mask R-CNN  introduced a multitask framework for related recognition tasks, such as instance segmentation and human pose estimation, achieving higher accuracy with brief procedure and structure. On the other hand, YOLO-v3 , RetinaNet , CornerNet  and etc, rely on only one stage to predict, while struggling to catch up with the top two-stage detectors. To compare and verify effectiveness of proposed method, CSN model is implemented based on Faster R-CNN  and Mask R-CNN  for various tasks.
There still exists difficulties for detectors on objects with large scale variation. Early solutions choose learning scale invariant representation for object to tackle scale variation problem. For traditional members, Haar face detector [29, 28] and DPM  become more scale-robust with the help of image pyramid . SSD , SSH , MS-CNN  try to detect small objects at lower layers, while big objects at higher layers. To be further, FPN  fuses feature maps at adjacent scale to combine semantics from upper and details from lower. Objects of different size can be predicted at corresponding levels according to a heuristic rule. SAFD
predicts scale distribution histogram of face, which guides zoom-in and zoom-out for face detection. FSAF selected the most suitable level of feature for each object dynamically in training compared with FPN. SNIP  assumes it’s easy for detectors to generalize well to objects of moderate size. Only objects of normal scale range are utilized to make model converge better. Ulteriorly, SNIPER effectively mined on generated chips for better result.
Current models still suffer from large scale variation and cannot obtain satisfactory accuracy even with multi-scale traing & testing. We will introduce our consistent scale normalizatoin(CSN) in this section to deal with it better. Concretely, in section 3.1, the common Faster R-CNN detector is recapped. In section 3.2, we analyze the drawbacks of SNIP which motivates the proposal of CSN. In section 3.3, we detail the object sampling mechanism of CSN technique, including the effects on FPN. In section 3.4, details of CSN on various recognition tasks are described.
3.1 Faster R-CNN detector recap
Faster R-CNN and its variants are leading detectors in object detection community, currently. They basically consists of two steps. In the 1st-stage, region proposal network (RPN) generates a bunch of RoIs (Region of Interest) on basis of a set of pre-defined anchors. These RoIs indicate the region where possible objects exist in the image. Then, in the 2nd-stage, a Fast RCNN extracts fixed size feature (e.g., ) for each RoI from the shared feature with RPN. This can be implemented by RoIPooling  or RoIAlign  operators. Finally, these features will be sent to two independent subnets for category classification and box regression, respectively.
In Faster R-CNN, all ground-truth object bounding-boxes (gt-bboxes) in current image are collected to participate in training.
3.2 Object scale range
We think that ConvNets intrinsically suffer from large scale variation in training, as shown in Table 1. An ordinary solution is diminishing the large variation, namely sampling objects in moderate scale range. SNIP  gives a detailed analysis on effect of object-scale to training. However, we argue that there exists inconsistency in SNIP’s object selection mechanism.
3.2.1 Inconsistency in SNIP
SNIP is a training scheme designed for multi-scale training, e.g., with image pyramid. It aims at excluding objects of extreme size in training phase. In details, the valid training range in original image is carefully tuned for each image resolution. Then for the -th resolution, the RoIs whose area (w.r.t original image) fall in the valid range will participate in training. Otherwise, they are ignored.
However, there is a basically unreasonable case with this training paradigm. That is, the objects with nearly same scale in resized images may not take part in training together. As illustrated in Fig 1, the man on the left in Fig 1(b) and woman player on the leftmost in Fig 1(d) shares same scale in resized resolutions. The former can take part in training while the latter cannot.
Fig 1(e) gives the distribution of training objects in SNIP. As it shows, there are plentiful extremely tiny and huge objects participating in training phase. The Fig 1(f) exhibits the distribution of ignored objects in SNIP. As it is shown, the ignored objects overlaps much with the trained objects, which is the cause of the contradiction in previous unreasonable case. In a word, the ignoring mechanism is implicit and inconsistent in SNIP. So, the model’s behavior is uncertain and the most suitable scale of the model for testing is indeterminate. In SNIP , the valid range for testing is obtained via greedy search in val-set.
So we argue the valid range here is not consistent between training and testing phase, as well as among different resolutions. This results in burdensome tuning of hyper-parameters. For example, for a three-level image pyramid, one need to tune three pairs of valid ranges for training and another three pairs for testing. The inconsistency could also hamper the model performace due to self-contradictory sampling strategy.
3.3 Consistent Scale Normalization
Previous scale normalization method, e.g., SNIP has some self-contradiction in its sampling strategy. Here, consistent scale normalization (CSN) method is an improved multi-scale training & testing strategy with consistent scale normalization on objects.
3.3.1 Consistent scale normalization
With the same chip generation strategy in SNIPER , CSN employs the scale normalization that adopts consistent scale range , called CSN range, for different phases and resolutions. In training phase, firstly, scaling factor set is pre-defined. For the original image with size , it is resized to -th resolution by . Then each RoI whose scale on -th resolution falls in this range will be set to valid currently, else invalid, as shown in Fig 2. The invalid objects constitute the ignored regions which don’t contribute to training. In testing phase for each resized resolution, all predicted RoIs whose scale falls within the same CSN range will be kept for post-processing. In this way, all objects can be normalized to a moderate scale range consistently. This improved selection strategy is obviously more excellent. On the one hand, uniform scale range eliminates the adaption of scale between training and testing phase. On the other hand, the consistent scale normalization reduces scale fluctuation of objects trained on different image resolutions. What’s more, CSN only tune the two parameters, i.e., CSN range , no matter how many image resolutions are used, while SNIP is in linear with that. For example, for 3-level image pyramid, the number of scale range parameters is 12 with SNIP.
3.3.2 Feature pyramid integration
When setting scale range limitation for objects, the number of valid objects will decrease, resulting in possible over-fitting, as  shows. Even taking multi-scale training method, this over-fitting influence cannot be eased.
FPN adopts separate stage, i.e., , to predict objects at disjoint scale range, and has better capability for detecting objects in wide scale range. Even if enlarging the CSN scale range when integrating FPN, each stage can also be trained well because of the reduced range. CSN turns to feature pyramid network (FPN) for help. Not only does FPN bring more powerful feature representation for objects, but also enlarge the feasible scale range.
The experiments also verifies positive effect of CSN on FPN.
|SNIP (w DCN)||ResNet-50||43.6||65.2||48.8||26.4||46.5||55.8|
|SNIPER (w DCN)||ResNet-50||43.5||65.0||48.6||26.1||46.3||56.0|
|CSN (w DCN)||ResNet-50-FPN||45.3||66.9||50.7||32.1||47.1||54.9|
|SNIPER (w DCN)||ResNet-101||46.1||67.0||51.6||29.6||48.9||58.1|
|CSN (w DCN)||ResNet-101-FPN||46.5||67.8||52.0||32.1||48.3||56.4|
|Faster R-CNN (our impl.)||ResNet-50-FPN||38.0||58.6||41.8||22.0||41.5||48.8|
|+MS Train&MS Test||ResNet-50-FPN||40.9||62.2||45.2||26.9||43.9||51.2|
|+MS Train&MS Test||ResNet-50-FPN||43.0||64.7||47.8||28.0||46.2||54.6|
3.4 CSN on recognition
Inspired by success of CSN on object detection, we also verify effect of CSN on other instance-related recognition tasks, such as instance segmentation and human pose estimation. Instance segmentation aims to precisely localize and predict pixel-wise mask for all instances in images; Human pose estimation aims to localize person keypoints of 17 categories accurately. He et al.,  proposed an unified framework called Mask R-CNN to solve both tasks. We try to apply CSN to Mask R-CNN for better recognition accuracy. In details, CSN just filter out objects which are out of given scale range, and use Mask R-CNN in training and testing phases.
4.1 Common settings
We conduct experiments on three tasks, namely object detection, instance segmentation and human pose estimation respectively. Experiments are performed on COCO  dataset following the official dataset split. That is, models are trained on 118k train set and evaluated on the 5k validation set. Final result of detection for comparison is submitted to official test server of COCO.
If without specific description, following settings apply to both baseline and CSN.
All models are implemented on the same codebase for comparison based on MXNet111https://mxnet.apache.org/. The training and testing hyper-parameters are almost the same as Mask R-CNN , while some modifications were made.
Considering network structures, 2-fc layer detection head exists in all models, which is different from  which attaches conv5 as the hidden layers of the head. For small backbones, such as ResNet-18, ResNet-18-FPN and MobileNet-v2-FPN, the dimension of 2 fc-layers was 512 and 256; while others employed two 1024. Besides, we use Soft-NMS to replace conventional NMS for post-processing after detection head. Other settings generally follows Mask R-CNN .
For training mechanism, all models were trained on 8 GPUs with synchronized SGD for 18 epochs. The learning rate () is initialized with which is linear with mini-batch size like . In addition, will be divided by at -th and -th epoch successively. Warm-up strategy  is also used for the -st epoch.
Baseline models are all trained and tested with when considering single-scale strategy. And multi-scale training & testing strategy follows . Concretely, training scales are randomly sampled from pixels with a step of , while testing scales are pixels with a step of . This results in 7 scales for training and 9 scales for testing.
The scaling factor of CSN models is for both training and testing phase. CSN range is set as by experiments(in Sec. 4.2.1).
In the following, experiments of CSN for different recognition tasks are described in details.
4.2 CSN on object detection
Detector is based on the two-stage Faster R-CNN  framework, i.e. the detection part of Mask R-CNN . Backbones used here consists of vanilla ResNet, ResNet-FPN(ResNet with FPN) and MobileNet-v2-FPN. We will report standard COCO detection metrics, including AP(mean AP over multiple IoU thresholds), AP, AP and AP, AP, AP(AP at different scales). All training and testing implementation details are the same as 4.1.
4.2.1 CSN range
CSN introduces consistency on scale range, and the only remaining problem is how to find the best CSN range. We proposes a simple greedy policy which iteratively adjusts the upper bound and lower bound to find the one with best AP locally. Experiments on those different range candidates are performed to evaluate AP. Table 5 shows some CSN range candidates with corresponding accuracy statistics. The following describes the iterative search procedure in details.
The initial range is set to because of the object size limitation of COCO dataset. Firstly is fixed and are evaluated respectively. The AP of is only , while AP of increases to because of excluding many hard tiny objects (about objects’ scales in COCO lie in ) during training. However, further lifting to deteriorates the AP because too many small objects have been ignored in testing phase. So the locally optimal is . Secondly is set to and are evaluated severally. From to , AP, AP and AP increase because some large objects are filtered during training. This indicates extremely large objects could disturb learning process doubtlessly. However, further reducing to deteriorates AP, in particular AP, AP. We consider that each stage in FPN learns from objects of different sizes, e.g., the of original FPN receives objects which scale in by heuristic rule . And discarding objects in causes under-fitting due to lack of training samples. Analogously, deteriorates AP more. So the locally optimal is . Thirdly was set to and is evaluated continuously. The AP hasn’t been improved. This iterative search procedure is completed up to now, and the best CSN range is . And CSN range will be set to for following experiments.
4.2.2 Main Results
The comparison of CSN with other state-of-the-art methods on COCO test-dev is shown in Table 3. To the best of our knowledge, for ResNet-50-FPN backbone(with deformable convolutions) based Faster R-CNN architecture, final mAP on COCO test-dev is , surpassing previous detectors with same backbone and architecture by a large margin. CSN, baseline and other methods are also compared on COCO val, as shown in Table 4. With regard to ResNet-50-FPN backbone, AP 38.0 of single scale baseline which is re-implemented is comparable to the one in Detectron . While multi-scale training & testing improved baseline by 2.9 point, CSN improved much more to 4.9 point. Deformable convolution (DCN) was also introduced because of the good property at modelling object deformation. With DCN integrated, original baseline goes up to 41.1, and CSN also surpasses multi-scale competitor again by 2 point gap. These prove the effectiveness and compatibility of CSN. In addition, CSN promotes a huge 14.5 point compared the SSDLite  version on MobileNet-v2 backbone.
4.2.3 Results on small backbones
We also evaluate CSN on other small network architectures, including vanilla ResNet-18, ResNet-18-FPN and MobileNet-v2-FPN. With respect to ResNet-18 based Faster R-CNN in Table 6, multi-scale training & testing improves the AP of single scale model by about 2 point, and CSN significantly boosts incredible 5 point additionally. The AP, AP, AP get promotion steadily. Moreover as shown in Table 6, using FPN can improve AP of single scale model by 4 point. Multi-scale FPN further increases AP to . CSN with FPN still boosts 3.1 point. This demonstrates that CSN can bring profit to detector consistently no matter the existence of FPN. What’s more, the vanilla Faster R-CNN with CSN even surpasses the multi-scale FPN without CSN by 0.5 point, manifesting CSN’s superiority. We also apply CSN to MobileNet-v2-FPN , as shown in Table 6. CSN improves more than 2 point steadily comparing with multi-scale training & testing.
4.2.4 CSN for efficient detection
Although CSN give model impressive accuracy promotion, it may be criticized owing to its time-consuming multi-scale testing in real application. Experiments are conducted to inspect influence of CSN on tiny models. And extra comparative experiments are conducted on different single-scale testing circumstances for the same tiny model. The one is testing on the same pre-defined resolution for all images, while another is testing on original resolution for each image. All results are shown in Table 7. The trained Baseline, MS train and CSN model are the same as those in Table 6. For the 1st single-scale testing circumstance, the pre-defined resolution for Baseline and MS train models is , while for CSN because it’s closer to the patch size used in CSN training. At both testing circumstances, the AP of CSN defeats other two competitors. When changing from pre-defined resolution to original resolution, AP of Baseline and MS train declines sharply, while AP of CSN only decreases 1 point. This also proves CSN’s robustness for scale variation.
In real application, there exists strict requirement of detection accuracy for objects only in common scale range (e.g. ). As Table 7 shows, AP of CSN surpasses other two methods by a large margin. It’s because CSN training make model focus on learning object representation in this range. In summary, CSN achieves better accuracy in specific scale range while being faster with single-scale testing. This suggests potential usage of CSN in real situation like security monitoring and etc.
4.3 CSN on other recognition tasks
Experiments are also performed on other two object-related recognition tasks, namely instance segmentation and human pose estimation(namely person keypoint detection). Firstly, Evaluation metrics for both are marked by APand AP severally. The higher AP means more accurate for both, which is similar to detection metric. Next, for economy, Backbones used here only includes ResNet-FPN (18 and 50), and only multi-scale strategy is compared to CSN. The following details more.
|ResNet-18-FPN||MS Train&MS Test||36.1||34.1||53.8||36.6||16.2||36.3||49.3||36.6||56.4||67.6|
|ResNet-50-FPN||MS Train&MS Test||41.0||37.6||58.3||40.6||18.3||40.8||53.5||39.1||59.5||70.0|
|ResNet-50-FPN-DCN||MS Train&MS Test||42.6||39.0||60.2||42.0||19.6||42.1||56.1||39.8||59.6||70.2|
|ResNet-18-FPN||MS Train&SS Test||57.8||81.9||62.2||52.7||65.5|
|MS Train&MS Test||58.7||81.7||63.5||54.0||66.6|
|MS Train&SS Test||61.2||84.1||66.4||56.8||68.2|
|MS Train&MS Test||61.8||83.7||66.7||57.3||69.5|
|ResNet-50-FPN-DCN||MS Train&SS Test||62.6||85.0||67.9||57.4||70.7|
|MS Train&MS Test||62.9||84.4||68.2||58.4||70.7|
4.3.1 CSN on instance segmentation
Results of detection and instance segmentation on different backbones are shown in Table 8. As you can see, AP on mask for CSN achieves more than 1 point increment comparing with multi-scale training & testing in both shallow and deep backbones. Even if introducing DCN component, the increment is also obvious and robust. This indicates the compatibility of CSN and other strong components (e.g. DCN). It’s worth noting that CSN can also improve AR metric on objects of different sizes. This is essential to real applications.
4.3.2 CSN on human pose estimation
Generally speaking, there are always two modes of human pose estimation: 1) From image, namely detecting person bounding-box, then cropping it from original image and resizing to standard scale, finally using backbone predicting keypoint heatmaps directly. 2) From feature map, namely detecting person instance and corresponding keypoints simultaneously, which is equivalent to predicting heatmaps from feature map at inter layer. The former mode is always more accurate than the latter, because of the explicit normalization operation on scale in image, i.e., cropping and resizing to uniform resolution. Experiments on CSN for kps are performed on COCO-kps data and Mask R-CNN structure, namely the 2nd mode. The final accuracy boosting on pose estimation in results implies the effect of implicit normalization from CSN.
Original Mask R-CNN  models a keypoint’s location distribution on RoI as an one-hot mask, and predicts heatmaps, each for one keypoint type. However the kps head (the head for human pose estimation) is too heavy for real application, especially for multi-person pose estimation. Therefore, Four modifications on head’s structure are made for both CSN and baseline model. Firstly, to better approximate person’s shape, we use RoIAlign to extract feature map of size for each person RoI, instead of . Secondly, to reduce computation complexity and bandwidth consumption, channel number of the eight
conv layers in kps head is reduced from 512 to 256, and the last deconv layer is also removed. Thirdly, for each keypoint category, kps head will output corresponding dense heatmap to predict whether each position is in vicinity of current keypoint. Finally, the head will also output corresponding 2-D local offset vector for each category and position as, to get keypoint localization more precise. But different from , smooth-L1 loss was used to optimize both outputs severally. In addition, patches used in CSN training here are generated from COCO keypoint data, which includes bounding-box and keypoint information both.
Results of person detection and pose estimation on different backbones are shown in Table 9. Firstly, official Mask R-CNN  is compared to CSN and MS train models on ResNet-50-FPN backbone in the 2-nd row in Table 9. It should be pointed out that, the modified design reduces nearly half float operation number for kps head. We observe that, the MS train model is inferior on AP to official Mask R-CNN . Because the relative change of head capacity from reduced kps head to original detection head, training process might pay more attention to detection branch. This contrary effect is also shown in . However, even with the same light-weight design, CSN still exceeds Mask R-CNN  by 1 point on AP. This proves the effectiveness of CSN for human pose estimation.
Secondly, CSN boosts performance of human pose estimation on several backbones significantly, as shown in Table 9. The AP gets more close to kps method  from image, namely the aforementioned 1st mode , than Mask R-CNN . This phenomenon indicates that implicit scale normalization in CSN can assist kps, consistently with the explicit scale normalization in 1st mode method.
We propose a novel paradigm, consistent scale normalization to solve severe scale variation problem in instance-related vision tasks. CSN integrates image pyramid (for scale normalization) and feature pyramid (for easeful learning in CSN range) in one paradigm, and achieves enhanced scale processing capability. It significantly boosts the performance of multi-scale training & testing on object recognition tasks, i.e., object detection, instance segmentation and human pose estimation over strong baselines. It also can be extended to more efficient detection for real applications. Overall, CSN provides a new perspective to solve problem raising from scale. It should inspire future work to solve other large variation problems.
-  (1984) Pyramid methods in image processing. RCA engineer 29 (6), pp. 33–41. Cited by: §2.
Soft-nms–improving object detection with one line of code.
Proceedings of the IEEE international conference on computer vision, pp. 5561–5569. Cited by: §4.1.
-  (2016) A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the european conference on computer vision, pp. 354–370. Cited by: §2.
-  (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: Table 3.
-  (2010) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1, §2.
Rich feature hierarchies for accurate object detection and semantic segmentation.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.
-  (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: Table 4, §4.2.2.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2, §3.1.
Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §4.1.
-  (2017) Scale-aware face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6186–6195. Cited by: §2.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §2, §3.1, §3.4, Table 4, §4.1, §4.1, §4.1, §4.2, §4.3.2, §4.3.2, §4.3.2, Table 9.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the european conference on computer vision, pp. 734–750. Cited by: §2.
-  (2019) Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892. Cited by: §3.3.2.
-  (2017) Feature pyramid networks for object detection.. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. 4. Cited by: §1, §1, §2, §4.2.1.
-  (2017) Focal loss for dense object detection. arXiv preprint arXiv:1708.02002. Cited by: §1, §2.
-  (2014) Microsoft coco: common objects in context. In Proceedings of the european conference on computer vision, pp. 740–755. Cited by: §1, §4.1, Table 5.
-  (2016) Ssd: single shot multibox detector. In Proceedings of the european conference on computer vision, pp. 21–37. Cited by: §2, §2.
-  (2017) Ssh: single stage headless face detector. In Proceedings of the IEEE international conference on computer vision, pp. 4875–4884. Cited by: §2.
-  (2018) Autofocus: efficient multi-scale inference. arXiv preprint arXiv:1812.01600. Cited by: §1.
-  (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4903–4911. Cited by: §4.3.2.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2, §3.1, §4.2.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: Table 4, §4.2.2.
-  (2018) An analysis of scale invariance in object detection–snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: §1, §2, §2, Figure 1, §3.2.1, §3.2, Table 3.
-  (2018) SNIPER: efficient multi-scale training. In Advances in Neural Information Processing Systems, pp. 9333–9343. Cited by: §1, §1, §2, §2, §3.3.1, Table 3, Table 4.
-  (2004) Robust real-time face detection. International journal of computer vision 57 (2), pp. 137–154. Cited by: §2.
-  (2001) Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. I–I. Cited by: §2.
-  (2019) Region proposal by guided anchoring. arXiv preprint arXiv:1901.03278. Cited by: §1.
-  (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the european conference on computer vision, pp. 466–481. Cited by: §4.3.2.
-  (2019) Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621. Cited by: §2.