Consistent Scale Normalization for Object Recognition

08/20/2019 ∙ by Zewen He, et al. ∙ Horizon Robotics 0

Scale variation remains a challenge problem for object detection. Common paradigms usually adopt multi-scale training & testing (image pyramid) or FPN (feature pyramid network) to process objects in wide scale range. However, multi-scale methods aggravate more variation of scale that even deep convolution neural networks with FPN cannot handle well. In this work, we propose an innovative paradigm called Consistent Scale Normalization (CSN) to resolve above problem. CSN compresses the scale space of objects into a consistent range (CSN range), in both training and testing phase. This reassures problem of scale variation fundamentally, and reduces the difficulty for network learning. Experiments show that CSN surpasses multi-scale counterpart significantly for object detection, instance segmentation and multi-task human pose estimation, on several architectures. On COCO test-dev, our single model based on CSN achieves 46.5 mAP with a ResNet-101 backbone, which is among the state-of-the-art (SOTA) candidates for object detection.



There are no comments yet.


page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The vision community has rapidly improved performance on object recognition, especially object detection [24], instance segmentation [11], human pose estimation [11]. Among them, Object detection is a prerequisite for many downstream applications. The performance of detectors has dramatically make progress with the help of powerful backbone networks [12], delicate design on optimization objective [16] and well-annotated datasets [17].

However, detecting objects of various scales remains challenging, especially encountering objects of extreme size. As shown in Table  1, AP on small objects falls much compared to medium and large objects. To alleviate the scale variation problem, state-of-the-art detectors rely on feature pyramids [15] or image pyramids[5]. On the one hand of feature pyramids, FPN [15] construct a multi-stage network with parallel prediction on objects of isolated scale range. On the other hand of image pyramids, simple multi-scale training & testing strategy still play a role on multiple recognition tasks. In particular,  [26, 27, 20] found that ignoring loss signals from extremely tiny and large objects can improve detection accuracy.

It should be pointed out that aforementioned methods have defects respectively. According to  [30], the receptive field and semantic scope of the same RoI should be consistent in feature map. For object of large scale, the receptive field may not be sufficient; while for the small one, the semantic scope is larger comparing to object’s size. So there exists consistency only when object’s size falls into a moderate range. Multi-scale training resize images to different resolutions, therefore resizing some objects to normal scales. But it also let some objects to extreme scales with great inconsistency, which give rise to final accuracy degradation. As shown in Table  2, the detector trained from image pyramid with wider scale range is inferior to the one from normal scale range. FPN  [15]

employed a heuristic rule to make feature map on different stage responsible for RoIs of isolated scale range. It also suffer from the inconsistency when encountering tiny objects (downsampled to sub-pixels and hard to recognize) or large objects (receptive field cannot cover). SNIP  

[27] tried to ignoring the extremely tiny and large objects, resulting removing the inconsistency on these extreme samples. But there also exists inconsistency about the scale range usage. The detailed analysis will be reported in  3.2.

We propose a simple and effective method approach, called Consistent Scale Normalization (CSN), to further alleviate the inconsistency resulting from scale variation. CSN will also train detector on image pyramid, but only optimize model on objects in moderate scale range. With FPN integrated, the heuristic rule in FPN can distribute RoIs of large scale range to multiple stage which process RoIs in smaller range. This alleviates the learning difficulty and enlarges feasible range on object scales, resulting more available training samples and better generalization. Based on CSN, our contribution can be summarized as follows:

  • We propose CSN to restricting RoIs of extreme scale in training and testing, this strategy can alleviate the negative impact caused by large scale variation. Without any modification to network structure, SOTA result of single ResNet-101 model is obtained by CSN on COCO object detection benchmark.

  • We extend CSN to other recognition tasks, i.e., instance segmentation and keypoint detection. Comparable improvements have been achieved, especially for keypoint detection (+3.6% AP), which is more sensitive to scale variation. To the best of our knowledge, best performance of a single ResNet-50 model can be achieved on both tasks by CSN.

  • CSN benefits detectors of various backbones, especially tiny ones like ResNet-18 and MobileNet-v2 (e.g., On ResNet-18 based Faster R-CNN, CSN boost mAP by 5 point than original multi-scale training & testing). For fast inference purpose, the models are usually tested in single scale. We found model trained with CSN still achieves better performance than the ordinary one in this situation. This makes CSN a totally cost-free freebie for object recognition task in practical applications.

34.7 16.3 38.2 49.3
38.0 22.0 41.5 48.8
Table 1: Results of different scales on Faster R-CNN of ResNet-50
image pyramids AP AP AP AP
img-scale [640, 800] 28.8 13.0 30.8 40.8
img-scale [160, 1600] 27.8 12.3 29.4 40.4
Table 2: We train two detectors using image scales randomly from a normal range and larger range [160, 1600] respectively. Both detectors are tested on single scale . The accuracy from larger scale range is worse than the normal one. Detectors are simple Faster R-CNN on ResNet-18.

2 Related Works

Driven by representation capacity of deep conv-feature, CNN-based detectors [6, 24, 18, 22] are a dominant paradigm in object detection community. The R-CNN series and variants [24, 8, 24] gradually promotes the upper bound of performance on two-stage detectors. In particular, Faster R-CNN [24] adopts shared backbone network to proposal generation and RoI classification, resulting real-time detection and accuracy rising. Mask R-CNN [11] introduced a multitask framework for related recognition tasks, such as instance segmentation and human pose estimation, achieving higher accuracy with brief procedure and structure. On the other hand, YOLO-v3 [23], RetinaNet [16], CornerNet [13] and etc, rely on only one stage to predict, while struggling to catch up with the top two-stage detectors. To compare and verify effectiveness of proposed method, CSN model is implemented based on Faster R-CNN [24] and Mask R-CNN [11] for various tasks.

There still exists difficulties for detectors on objects with large scale variation. Early solutions choose learning scale invariant representation for object to tackle scale variation problem. For traditional members, Haar face detector [29, 28] and DPM [5] become more scale-robust with the help of image pyramid [1]. SSD [18], SSH [19], MS-CNN [3] try to detect small objects at lower layers, while big objects at higher layers. To be further, FPN [15] fuses feature maps at adjacent scale to combine semantics from upper and details from lower. Objects of different size can be predicted at corresponding levels according to a heuristic rule. SAFD[10]

predicts scale distribution histogram of face, which guides zoom-in and zoom-out for face detection. FSAF

[32] selected the most suitable level of feature for each object dynamically in training compared with FPN. SNIP [26] assumes it’s easy for detectors to generalize well to objects of moderate size. Only objects of normal scale range are utilized to make model converge better. Ulteriorly, SNIPER[27] effectively mined on generated chips for better result.

Compared to SNIP etc [26, 27], CSN adopts a consistent scale normalization strategy to select training samples, and integrates FPN to attain better generalization. It also extends to other recognition tasks, validating the effectiveness of our method.

3 Method

Current models still suffer from large scale variation and cannot obtain satisfactory accuracy even with multi-scale traing & testing. We will introduce our consistent scale normalizatoin(CSN) in this section to deal with it better. Concretely, in section 3.1, the common Faster R-CNN detector is recapped. In section 3.2, we analyze the drawbacks of SNIP which motivates the proposal of CSN. In section 3.3, we detail the object sampling mechanism of CSN technique, including the effects on FPN. In section 3.4, details of CSN on various recognition tasks are described.

3.1 Faster R-CNN detector recap

Faster R-CNN and its variants are leading detectors in object detection community, currently. They basically consists of two steps. In the 1st-stage, region proposal network (RPN) generates a bunch of RoIs (Region of Interest) on basis of a set of pre-defined anchors. These RoIs indicate the region where possible objects exist in the image. Then, in the 2nd-stage, a Fast RCNN[8] extracts fixed size feature (e.g., ) for each RoI from the shared feature with RPN. This can be implemented by RoIPooling [24] or RoIAlign [11] operators. Finally, these features will be sent to two independent subnets for category classification and box regression, respectively.

In Faster R-CNN, all ground-truth object bounding-boxes (gt-bboxes) in current image are collected to participate in training.

3.2 Object scale range

We think that ConvNets intrinsically suffer from large scale variation in training, as shown in Table 1. An ordinary solution is diminishing the large variation, namely sampling objects in moderate scale range. SNIP [26] gives a detailed analysis on effect of object-scale to training. However, we argue that there exists inconsistency in SNIP’s object selection mechanism.

(a) Original (480,640)
(b) SNIP (800,1200)
(c) Original (425,640)
(d) SNIP (480,723)
(e) Objects trained in SNIP
(f) Objects ignored in SNIP
Figure 1: Problem from inconsistency in SNIP: For (a)-(d), each bbox with red edge means objects ignored in SNIP, and green bbox means used by SNIP. The black number at top right of each bbox means corresponding scale in that image. In SNIP [26], valid range in the original image for resolution , are and . (a) and (c) are resized to (b) and (d). The man at left in image (b) will be retained because its scale is ; while the woman player at leftmost in image (d) will be ignored because its scale is . When these two have the same scales, they are regarded as different roles in training. For (e)-(f), the former shows the scale distribution of objects in resized images used in SNIP training, while the latter shows distribution of ignored ones. There exists big overlap between them.

3.2.1 Inconsistency in SNIP

SNIP is a training scheme designed for multi-scale training, e.g., with image pyramid. It aims at excluding objects of extreme size in training phase. In details, the valid training range in original image is carefully tuned for each image resolution. Then for the -th resolution, the RoIs whose area (w.r.t original image) fall in the valid range will participate in training. Otherwise, they are ignored.

However, there is a basically unreasonable case with this training paradigm. That is, the objects with nearly same scale in resized images may not take part in training together. As illustrated in Fig 1, the man on the left in Fig 1(b) and woman player on the leftmost in Fig 1(d) shares same scale in resized resolutions. The former can take part in training while the latter cannot.

Fig 1(e) gives the distribution of training objects in SNIP. As it shows, there are plentiful extremely tiny and huge objects participating in training phase. The Fig 1(f) exhibits the distribution of ignored objects in SNIP. As it is shown, the ignored objects overlaps much with the trained objects, which is the cause of the contradiction in previous unreasonable case. In a word, the ignoring mechanism is implicit and inconsistent in SNIP. So, the model’s behavior is uncertain and the most suitable scale of the model for testing is indeterminate. In SNIP [26], the valid range for testing is obtained via greedy search in val-set.

So we argue the valid range here is not consistent between training and testing phase, as well as among different resolutions. This results in burdensome tuning of hyper-parameters. For example, for a three-level image pyramid, one need to tune three pairs of valid ranges for training and another three pairs for testing. The inconsistency could also hamper the model performace due to self-contradictory sampling strategy.

Figure 2: CSN plus FPN: Here shows CSN with 3 scaling factors for multi-scale training & testing. In training, CSN firstly resizes original image times to get -th resolution. Then it select objects which scale falls in CSN range (marked by green boxes) as valid for training, otherwise discards invalid boxes (red). In testing, only the predicted boxes from different resolutions in CSN range are preserved and fused.

3.3 Consistent Scale Normalization

Previous scale normalization method, e.g., SNIP has some self-contradiction in its sampling strategy. Here,  consistent scale normalization (CSN) method is an improved multi-scale training & testing strategy with consistent scale normalization on objects.

3.3.1 Consistent scale normalization

With the same chip generation strategy in SNIPER [27], CSN employs the scale normalization that adopts consistent scale range , called CSN range, for different phases and resolutions. In training phase, firstly, scaling factor set is pre-defined. For the original image with size , it is resized to -th resolution by . Then each RoI whose scale on -th resolution falls in this range will be set to valid currently, else invalid, as shown in Fig 2. The invalid objects constitute the ignored regions which don’t contribute to training. In testing phase for each resized resolution, all predicted RoIs whose scale falls within the same CSN range will be kept for post-processing. In this way, all objects can be normalized to a moderate scale range consistently. This improved selection strategy is obviously more excellent. On the one hand, uniform scale range eliminates the adaption of scale between training and testing phase. On the other hand, the consistent scale normalization reduces scale fluctuation of objects trained on different image resolutions. What’s more, CSN only tune the two parameters, i.e., CSN range , no matter how many image resolutions are used, while SNIP is in linear with that. For example, for 3-level image pyramid, the number of scale range parameters is 12 with SNIP.

3.3.2 Feature pyramid integration

When setting scale range limitation for objects, the number of valid objects will decrease, resulting in possible over-fitting, as  [14] shows. Even taking multi-scale training method, this over-fitting influence cannot be eased.

FPN adopts separate stage, i.e., , to predict objects at disjoint scale range, and has better capability for detecting objects in wide scale range. Even if enlarging the CSN scale range when integrating FPN, each stage can also be trained well because of the reduced range. CSN turns to  feature pyramid network (FPN) for help. Not only does FPN bring more powerful feature representation for objects, but also enlarge the feasible scale range.

The experiments also verifies positive effect of CSN on FPN.

Method Backbone AP AP AP AP AP AP
SNIP (w DCN)[26] ResNet-50 43.6 65.2 48.8 26.4 46.5 55.8
SNIPER (w DCN)[27] ResNet-50 43.5 65.0 48.6 26.1 46.3 56.0
CSN (w DCN) ResNet-50-FPN 45.3 66.9 50.7 32.1 47.1 54.9
SNIPER (w DCN)[27] ResNet-101 46.1 67.0 51.6 29.6 48.9 58.1
CSN (w DCN) ResNet-101-FPN 46.5 67.8 52.0 32.1 48.3 56.4
Table 3: SOTA Comparison on COCO test-dev. w DCN indicates deformable convolution [4] is adopted in network. All methods are trained on COCO train2017. SNIP[26] and SNIPER[27] are recently proposed methods and achieve very impressive performance. CSN is integrated with FPN and surpasses previous methods hugely.
Method Backbone AP AP AP AP AP AP
Faster R-CNN[7] ResNet-50-FPN 37.9
Faster R-CNN (our impl.) ResNet-50-FPN 38.0 58.6 41.8 22.0 41.5 48.8
+MS Train&MS Test ResNet-50-FPN 40.9 62.2 45.2 26.9 43.9 51.2
+CSN ResNet-50-FPN 42.9 64.5 47.5 31.4 45.0 53.9
Faster R-CNN+DCN ResNet-50-FPN 41.1 62.1 45.7 24.6 44.2 53.7
+MS Train&MS Test ResNet-50-FPN 43.0 64.7 47.8 28.0 46.2 54.6
+CSN ResNet-50-FPN 45.0 66.6 50.3 34.0 48.0 56.3
SNIPER[27] ResNet-101 46.1 67.0 51.6 29.6 48.9 58.1
+CSN ResNet-101-FP 46.4 67.8 51.9 35.2 49.0 57.7
SSD[25] MobileNet-v2 22.1
SNIPER[27] MobileNet-v2 34.1 54.4 37.7 18.2 36.9 46.2
Faster R-CNN+CSN MobileNet-v2-FPN 36.6 57.8 40.1 25.5 38.0 46.1
+DCN MobileNet-v2-FPN 38.7 60.8 42.3 27.8 40.8 48.9
Table 4: Comparison on COCO val2017 Baseline is trained and tested with single scale (800,1333). The MS denotes multi-scale (with 7 scales for training and 9 scales for testing) and the detailed scale setting follows [11]. As is shown, CSN consistently improves APs. Methods with superscript * are evaluated on COCO test-dev.

3.4 CSN on recognition

Inspired by success of CSN on object detection, we also verify effect of CSN on other instance-related recognition tasks, such as instance segmentation and human pose estimation. Instance segmentation aims to precisely localize and predict pixel-wise mask for all instances in images; Human pose estimation aims to localize person keypoints of 17 categories accurately. He et al., [11] proposed an unified framework called Mask R-CNN to solve both tasks. We try to apply CSN to Mask R-CNN for better recognition accuracy. In details, CSN just filter out objects which are out of given scale range, and use Mask R-CNN in training and testing phases.

4 Experiments

4.1 Common settings

We conduct experiments on three tasks, namely object detection, instance segmentation and human pose estimation respectively. Experiments are performed on COCO [17] dataset following the official dataset split. That is, models are trained on 118k train set and evaluated on the 5k validation set. Final result of detection for comparison is submitted to official test server of COCO.

implementation details

If without specific description, following settings apply to both baseline and CSN.

All models are implemented on the same codebase for comparison based on MXNet111 The training and testing hyper-parameters are almost the same as Mask R-CNN [11], while some modifications were made.

Considering network structures, 2-fc layer detection head exists in all models, which is different from  [12] which attaches conv5 as the hidden layers of the head. For small backbones, such as ResNet-18, ResNet-18-FPN and MobileNet-v2-FPN, the dimension of 2 fc-layers was 512 and 256; while others employed two 1024. Besides, we use Soft-NMS[2] to replace conventional NMS for post-processing after detection head. Other settings generally follows Mask R-CNN [11].

For training mechanism, all models were trained on 8 GPUs with synchronized SGD for 18 epochs. The learning rate (

) is initialized with which is linear with mini-batch size like  [9]. In addition, will be divided by at -th and -th epoch successively. Warm-up strategy [9] is also used for the -st epoch.

Baseline models are all trained and tested with when considering single-scale strategy. And multi-scale training & testing strategy follows  [11]. Concretely, training scales are randomly sampled from pixels with a step of , while testing scales are pixels with a step of . This results in 7 scales for training and 9 scales for testing.

The scaling factor of CSN models is for both training and testing phase. CSN range is set as by experiments(in Sec. 4.2.1).

In the following, experiments of CSN for different recognition tasks are described in details.

4.2 CSN on object detection

Detector is based on the two-stage Faster R-CNN [24] framework, i.e. the detection part of Mask R-CNN [11]. Backbones used here consists of vanilla ResNet, ResNet-FPN(ResNet with FPN) and MobileNet-v2-FPN. We will report standard COCO detection metrics, including AP(mean AP over multiple IoU thresholds), AP, AP and AP, AP, AP(AP at different scales). All training and testing implementation details are the same as  4.1.

4.2.1 CSN range

CSN introduces consistency on scale range, and the only remaining problem is how to find the best CSN range. We proposes a simple greedy policy which iteratively adjusts the upper bound and lower bound to find the one with best AP locally. Experiments on those different range candidates are performed to evaluate AP. Table 5 shows some CSN range candidates with corresponding accuracy statistics. The following describes the iterative search procedure in details.

The initial range is set to because of the object size limitation of COCO dataset. Firstly is fixed and are evaluated respectively. The AP of is only , while AP of increases to because of excluding many hard tiny objects (about objects’ scales in COCO lie in ) during training. However, further lifting to deteriorates the AP because too many small objects have been ignored in testing phase. So the locally optimal is . Secondly is set to and are evaluated severally. From to , AP, AP and AP increase because some large objects are filtered during training. This indicates extremely large objects could disturb learning process doubtlessly. However, further reducing to deteriorates AP, in particular AP, AP. We consider that each stage in FPN learns from objects of different sizes, e.g., the of original FPN receives objects which scale in by heuristic rule [15]. And discarding objects in causes under-fitting due to lack of training samples. Analogously, deteriorates AP more. So the locally optimal is . Thirdly was set to and is evaluated continuously. The AP hasn’t been improved. This iterative search procedure is completed up to now, and the best CSN range is . And CSN range will be set to for following experiments.

CSN_range [0,640] [16,640] [32,640] [16,560] [16,496] [16,320] [32,560]
AP 26.0 26.9 25.9 26.5 26.2 26.3 27.0
AP 39.4 40.1 40.6 40.5 39.8 39.1 40.6
AP 46.6 47.6 47.9 47.9 46.7 46.8 48.5
AP 37.4 38.2 38.1 38.7 37.9 37.2 38.4
Table 5: ResNet-18-FPN trained with different CSN range: Since object scale in COCO [17] ranges from 0 to 640, the trials of range is limited in that region. Experimental settings are the same in the table, except the CSN range differs.
Method Backbone AP AP AP AP
Baseline ResNet-18 29.3 12.3 31.2 42.5
+MST ResNet-18 31.1 15.6 32.6 44.0
+CSN ResNet-18 36.1 22.6 40.5 47.9
Baseline ResNet-18-FPN 33.3 17.7 35.8 44.0
+MST ResNet-18-FPN 35.6 21.8 37.8 45.7
+CSN ResNet-18-FPN 38.7 26.5 40.5 47.9
Baseline MobileNet-v2-FPN 32.8 18.5 35.3 42.6
+MST MobileNet-v2-FPN 34.3 21.6 36.3 44.3
+CSN MobileNet-v2-FPN 36.6 25.5 38.0 46.1
Table 6: Results on ResNet-18, ResNet-18-FPN and MobileNet-v2-FPN: Baseline is trained and tested with . MST denotes multi-scale training&testing here. Each row adds an individual component to the 1-st row (i.e, Baseline)
MobileNet-v2-FPN AP AR AP AR
Baseline 27.6 54.4 31.0 60.3
Baseline 32.8 60.8 35.9 65.2
+MS Train 29.1 55.7 32.6 61.4
+MS Train 32.9 61.1 35.9 65.4
+CSN 32.6 59.4 37.1 66.9
+CSN 33.6 61.4 37.4 67.1
Table 7: Single scale test on MobileNet-v2-FPN: Baseline is trained in single scale (800,1333). The multi-scale trained model and model trained with ISN are shown. Each row is tested with (i.e., raw image) or with short side 600 or 800, which is denoted by the subscript. Testing with raw image is much faster than with short side 800. represents the AP calculated on GT boxes in [16,560].

4.2.2 Main Results

The comparison of CSN with other state-of-the-art methods on COCO test-dev is shown in Table 3. To the best of our knowledge, for ResNet-50-FPN backbone(with deformable convolutions) based Faster R-CNN architecture, final mAP on COCO test-dev is , surpassing previous detectors with same backbone and architecture by a large margin. CSN, baseline and other methods are also compared on COCO val, as shown in Table 4. With regard to ResNet-50-FPN backbone, AP 38.0 of single scale baseline which is re-implemented is comparable to the one in Detectron [7]. While multi-scale training & testing improved baseline by 2.9 point, CSN improved much more to 4.9 point. Deformable convolution (DCN) was also introduced because of the good property at modelling object deformation. With DCN integrated, original baseline goes up to 41.1, and CSN also surpasses multi-scale competitor again by 2 point gap. These prove the effectiveness and compatibility of CSN. In addition, CSN promotes a huge 14.5 point compared the SSDLite [25] version on MobileNet-v2 backbone.

4.2.3 Results on small backbones

We also evaluate CSN on other small network architectures, including vanilla ResNet-18, ResNet-18-FPN and MobileNet-v2-FPN. With respect to ResNet-18 based Faster R-CNN in Table 6, multi-scale training & testing improves the AP of single scale model by about 2 point, and CSN significantly boosts incredible 5 point additionally. The AP, AP, AP get promotion steadily. Moreover as shown in Table 6, using FPN can improve AP of single scale model by 4 point. Multi-scale FPN further increases AP to . CSN with FPN still boosts 3.1 point. This demonstrates that CSN can bring profit to detector consistently no matter the existence of FPN. What’s more, the vanilla Faster R-CNN with CSN even surpasses the multi-scale FPN without CSN by 0.5 point, manifesting CSN’s superiority. We also apply CSN to MobileNet-v2-FPN , as shown in Table 6. CSN improves more than 2 point steadily comparing with multi-scale training & testing.

4.2.4 CSN for efficient detection

Although CSN give model impressive accuracy promotion, it may be criticized owing to its time-consuming multi-scale testing in real application. Experiments are conducted to inspect influence of CSN on tiny models. And extra comparative experiments are conducted on different single-scale testing circumstances for the same tiny model. The one is testing on the same pre-defined resolution for all images, while another is testing on original resolution for each image. All results are shown in Table 7. The trained Baseline, MS train and CSN model are the same as those in Table 6. For the 1st single-scale testing circumstance, the pre-defined resolution for Baseline and MS train models is , while for CSN because it’s closer to the patch size used in CSN training. At both testing circumstances, the AP of CSN defeats other two competitors. When changing from pre-defined resolution to original resolution, AP of Baseline and MS train declines sharply, while AP of CSN only decreases 1 point. This also proves CSN’s robustness for scale variation.

In real application, there exists strict requirement of detection accuracy for objects only in common scale range (e.g. ). As Table 7 shows, AP of CSN surpasses other two methods by a large margin. It’s because CSN training make model focus on learning object representation in this range. In summary, CSN achieves better accuracy in specific scale range while being faster with single-scale testing. This suggests potential usage of CSN in real situation like security monitoring and etc.

4.3 CSN on other recognition tasks

Experiments are also performed on other two object-related recognition tasks, namely instance segmentation and human pose estimation(namely person keypoint detection). Firstly, Evaluation metrics for both are marked by AP

and AP severally. The higher AP means more accurate for both, which is similar to detection metric. Next, for economy, Backbones used here only includes ResNet-FPN (18 and 50), and only multi-scale strategy is compared to CSN. The following details more.

Backbone Method AP AP AP AP AP AP AP AR AR AR
ResNet-18-FPN MS Train&MS Test 36.1 34.1 53.8 36.6 16.2 36.3 49.3 36.6 56.4 67.6
CSN 37.8 35.2 55.4 38.3 20.5 37.0 49.6 42.6 58.9 66.8
ResNet-50-FPN MS Train&MS Test 41.0 37.6 58.3 40.6 18.3 40.8 53.5 39.1 59.5 70.0
CSN 43.0 39.0 60.0 42.3 22.8 40.6 53.3 45.6 61.6 70.6
ResNet-50-FPN-DCN MS Train&MS Test 42.6 39.0 60.2 42.0 19.6 42.1 56.1 39.8 59.6 70.2
CSN 45.0 40.3 61.9 44.2 24.2 41.7 56.8 45.3 61.8 71.3
Table 8: Mask Results: Results of ResNet-18,50 on COCO2017. CSN achieves more than 1 point improvement of AP and AP on different backbones. Robust and comparable improvements of AR
Backbone Method AP AP AP AP AP
ResNet-18-FPN MS Train&SS Test 57.8 81.9 62.2 52.7 65.5
MS Train&MS Test 58.7 81.7 63.5 54.0 66.6
CSN 62.5 83.1 68.3 58.2 70.0
ResNet-50-FPN Mask R-CNN[11] 64.2 86.6 69.7 58.7 73.0
MS Train&SS Test 61.2 84.1 66.4 56.8 68.2
MS Train&MS Test 61.8 83.7 66.7 57.3 69.5
CSN 65.2 85.1 71.1 60.8 72.7
ResNet-50-FPN-DCN MS Train&SS Test 62.6 85.0 67.9 57.4 70.7
MS Train&MS Test 62.9 84.4 68.2 58.4 70.7
CSN 66.5 86.0 72.5 61.7 74.4
Table 9: CSN v.s. MST: AP of person instance detection and keypoint detection on COCO val2017. The backbone is ResNet-18-FPN, ResNet-50-FPN without and with DCN. CSN improves more than 3.5 point on AP comparing with MST for both backbones. The accuracy promotion on kps is steady whether with DCN or not, proving CSN is compatible with DCN. For comparison to official Mask R-CNN, the MS Train&MS Test with modified kps head is more quick and bandwidth-saving in practice, but inferior on accuracy. The CSN model with the same modified kps head can defeat Mask R-CNN[11] easily.

4.3.1 CSN on instance segmentation

Main results

Results of detection and instance segmentation on different backbones are shown in Table 8. As you can see, AP on mask for CSN achieves more than 1 point increment comparing with multi-scale training & testing in both shallow and deep backbones. Even if introducing DCN component, the increment is also obvious and robust. This indicates the compatibility of CSN and other strong components (e.g. DCN). It’s worth noting that CSN can also improve AR metric on objects of different sizes. This is essential to real applications.

4.3.2 CSN on human pose estimation

Generally speaking, there are always two modes of human pose estimation: 1) From image, namely detecting person bounding-box, then cropping it from original image and resizing to standard scale, finally using backbone predicting keypoint heatmaps directly. 2) From feature map, namely detecting person instance and corresponding keypoints simultaneously, which is equivalent to predicting heatmaps from feature map at inter layer. The former mode is always more accurate than the latter, because of the explicit normalization operation on scale in image, i.e., cropping and resizing to uniform resolution. Experiments on CSN for kps are performed on COCO-kps data and Mask R-CNN structure, namely the 2nd mode. The final accuracy boosting on pose estimation in results implies the effect of implicit normalization from CSN.

Implementation details

Original Mask R-CNN [11] models a keypoint’s location distribution on RoI as an one-hot mask, and predicts heatmaps, each for one keypoint type. However the kps head (the head for human pose estimation) is too heavy for real application, especially for multi-person pose estimation. Therefore, Four modifications on head’s structure are made for both CSN and baseline model. Firstly, to better approximate person’s shape, we use RoIAlign to extract feature map of size for each person RoI, instead of . Secondly, to reduce computation complexity and bandwidth consumption, channel number of the eight

conv layers in kps head is reduced from 512 to 256, and the last deconv layer is also removed. Thirdly, for each keypoint category, kps head will output corresponding dense heatmap to predict whether each position is in vicinity of current keypoint. Finally, the head will also output corresponding 2-D local offset vector for each category and position as  

[21], to get keypoint localization more precise. But different from  [21], smooth-L1 loss was used to optimize both outputs severally. In addition, patches used in CSN training here are generated from COCO keypoint data, which includes bounding-box and keypoint information both.

Main results

Results of person detection and pose estimation on different backbones are shown in Table 9. Firstly, official Mask R-CNN [11] is compared to CSN and MS train models on ResNet-50-FPN backbone in the 2-nd row in Table 9. It should be pointed out that, the modified design reduces nearly half float operation number for kps head. We observe that, the MS train model is inferior on AP to official Mask R-CNN [11]. Because the relative change of head capacity from reduced kps head to original detection head, training process might pay more attention to detection branch. This contrary effect is also shown in  [11]. However, even with the same light-weight design, CSN still exceeds Mask R-CNN [11] by 1 point on AP. This proves the effectiveness of CSN for human pose estimation.

Secondly, CSN boosts performance of human pose estimation on several backbones significantly, as shown in Table 9. The AP gets more close to kps method [31] from image, namely the aforementioned 1st mode , than Mask R-CNN [11]. This phenomenon indicates that implicit scale normalization in CSN can assist kps, consistently with the explicit scale normalization in 1st mode method.

5 Conclusion

We propose a novel paradigm, consistent scale normalization to solve severe scale variation problem in instance-related vision tasks. CSN integrates image pyramid (for scale normalization) and feature pyramid (for easeful learning in CSN range) in one paradigm, and achieves enhanced scale processing capability. It significantly boosts the performance of multi-scale training & testing on object recognition tasks, i.e., object detection, instance segmentation and human pose estimation over strong baselines. It also can be extended to more efficient detection for real applications. Overall, CSN provides a new perspective to solve problem raising from scale. It should inspire future work to solve other large variation problems.


  • [1] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden (1984) Pyramid methods in image processing. RCA engineer 29 (6), pp. 33–41. Cited by: §2.
  • [2] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis (2017) Soft-nms–improving object detection with one line of code. In

    Proceedings of the IEEE international conference on computer vision

    pp. 5561–5569. Cited by: §4.1.
  • [3] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos (2016) A unified multi-scale deep convolutional neural network for fast object detection. In Proceedings of the european conference on computer vision, pp. 354–370. Cited by: §2.
  • [4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: Table 3.
  • [5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2010) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1, §2.
  • [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 580–587. Cited by: §2.
  • [7] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Note: Cited by: Table 4, §4.2.2.
  • [8] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §2, §3.1.
  • [9] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He (2017)

    Accurate, large minibatch sgd: training imagenet in 1 hour

    arXiv preprint arXiv:1706.02677. Cited by: §4.1.
  • [10] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu (2017) Scale-aware face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6186–6195. Cited by: §2.
  • [11] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §2, §3.1, §3.4, Table 4, §4.1, §4.1, §4.1, §4.2, §4.3.2, §4.3.2, §4.3.2, Table 9.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
  • [13] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the european conference on computer vision, pp. 734–750. Cited by: §2.
  • [14] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. arXiv preprint arXiv:1901.01892. Cited by: §3.3.2.
  • [15] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection.. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. 4. Cited by: §1, §1, §2, §4.2.1.
  • [16] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. arXiv preprint arXiv:1708.02002. Cited by: §1, §2.
  • [17] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of the european conference on computer vision, pp. 740–755. Cited by: §1, §4.1, Table 5.
  • [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In Proceedings of the european conference on computer vision, pp. 21–37. Cited by: §2, §2.
  • [19] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis (2017) Ssh: single stage headless face detector. In Proceedings of the IEEE international conference on computer vision, pp. 4875–4884. Cited by: §2.
  • [20] M. Najibi, B. Singh, and L. S. Davis (2018) Autofocus: efficient multi-scale inference. arXiv preprint arXiv:1812.01600. Cited by: §1.
  • [21] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy (2017) Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4903–4911. Cited by: §4.3.2.
  • [22] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.
  • [23] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §2.
  • [24] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2, §3.1, §4.2.
  • [25] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4510–4520. Cited by: Table 4, §4.2.2.
  • [26] B. Singh and L. S. Davis (2018) An analysis of scale invariance in object detection–snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: §1, §2, §2, Figure 1, §3.2.1, §3.2, Table 3.
  • [27] B. Singh, M. Najibi, and L. S. Davis (2018) SNIPER: efficient multi-scale training. In Advances in Neural Information Processing Systems, pp. 9333–9343. Cited by: §1, §1, §2, §2, §3.3.1, Table 3, Table 4.
  • [28] P. Viola and M. J. Jones (2004) Robust real-time face detection. International journal of computer vision 57 (2), pp. 137–154. Cited by: §2.
  • [29] P. Viola and M. Jones (2001) Rapid object detection using a boosted cascade of simple features. In Proceedings of the IEEE conference on computer vision and pattern recognition, Vol. 1, pp. I–I. Cited by: §2.
  • [30] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. arXiv preprint arXiv:1901.03278. Cited by: §1.
  • [31] B. Xiao, H. Wu, and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the european conference on computer vision, pp. 466–481. Cited by: §4.3.2.
  • [32] C. Zhu, Y. He, and M. Savvides (2019) Feature selective anchor-free module for single-shot object detection. arXiv preprint arXiv:1903.00621. Cited by: §2.