Scope Head for Accurate Localizationin Object Detection

by   Geng Zhan, et al.

Existing anchor-based and anchor-free object detectors in multi-stage or one-stage pipelines have achieved very promising detection performance. However, they still encounter the design difficulty in hand-crafted 2D anchor definition and the learning complexity in 1D direct location regression. To tackle these issues, in this paper, we propose a novel detector coined as ScopeNet, which models anchors of each location as a mutually dependent relationship. This approach quantises the prediction space and employs a coarse-to-fine strategy for localisation. It achieves superior flexibility as in the regression based anchor-free methods, while produces more precise prediction. Besides, an inherit anchor selection score is learned to indicate the localisation quality of the detection result, and we propose to better represent the confidence of a detection box by combining the category-classification score and the anchor-selection score. With our concise and effective design, the proposed ScopeNet achieves state-of-the-art results on COCO


Scope Head for Accurate Localization in Object Detection

Existing anchor-based and anchor-free object detectors in multi-stage or...

Soft Anchor-Point Object Detection

Recently, anchor-free detectors have shown great potential to outperform...

Toward Minimal Misalignment at Minimal Cost in One-Stage and Anchor-Free Object Detection

Common object detection models consist of classification and regression ...

Semi-Anchored Detector for One-Stage Object Detection

A standard one-stage detector is comprised of two tasks: classification ...

Localization Uncertainty Estimation for Anchor-Free Object Detection

Since many safety-critical systems such as surgical robots and autonomou...

Revisiting Feature Alignment for One-stage Object Detection

Recently, one-stage object detectors gain much attention due to their si...

Shift Equivariance in Object Detection

Robustness to small image translations is a highly desirable property fo...

1 Introduction

Object detection is a fundamental task in computer vision. Existing object detection approaches can be mainly categorized into multi-stage [20, 5, 1] and one-stage detectors [12, 19, 21]. Multi-stage detectors usually achieve relatively higher accuracy while more computational steps are basically required. In contrast, one-stage detectors are considered to be simpler, faster, and more flexible in design choices. In this paper we focus on the one-stage object detection problem.

The anchor box definition or generation is a critical component for accurate localization in many state-of-the-art one-stage detectors [12, 19]. Anchor boxes are generally pre-defined 2D boxes with a set of fixed box shapes, i.e. aspect ratios and scales, at each location (see Fig. 1(a)). The clear benefit of using anchors is that predicting offsets instead of locations makes it easier for the network to learn [18], and thus facilitates the improvement of the detection accuracy of the detector. However, there are also several drawbacks in commonly used 2D anchor boxes. First, dense prediction over the anchor boxes brings redundancy, as anchors need to be dense enough in order to cover most of the target objects with large variation in box shapes. Then the network has to predict categories and locations for all the anchors. Second, the performance is very sensitive to design choices of anchors. The detection performance remarkably deteriorates with inappropriate anchor designs [12, 8]. Moreover, the representational power of anchors for objects with varying box shapes is limited. Common one-stage anchor-based detectors [12] typically employ 9 anchors to cover 3 scales and 3 aspect ratios, which clearly makes the detection more challenging to objects with large variation in shapes.

To overcome the above-mentioned disadvantage of 2D anchor boxes, researchers recently proposed anchor-free strategies [2, 29, 27, 21] via considering direct regression for object localization (see Fig. 1(b)). Through relaxing the box shape constraints on the anchor boxes and learning 1D predictions of four directional offsets (i.e. left, right, top and down) corresponding to the target location, the anchor-free approaches are ideally able to localize objects with arbitrary shape. Despite the simplicity, they achieve comparable or even better performances than anchor-based methods [21, 8]. However, these anchor-free methods rely on a single regression network for predicting a precise location in the unbounded space, which can be excessively challenging for the network.

Figure 1:

Different approaches for localization prediction. Blue boxes denote ground truth objects. Red boxes in solid line represent the box prediction. Red arrows are regression predictions. Predictions are based on current center locations (black dots). (a) Anchor-based approaches: the dashed box in red denotes one anchor box with possibly high IoU with the target. It first uses classification to select an anchor box that has high IoU with the object, then uses regression to predict a more accurate location. (b) Anchor-free approaches: the four borders of a bounding box are directly regressed. (c) Our 1D anchor-dependent modeling approach: along each of four directions, we first estimate the coarse range of prediction with

softmax classification, and then apply the corresponding regression network and anchor scale (notch in red) for more fine-grained localization.

The analysis above leads to a straightforward question: is it possible to maintain the flexibility and simplicity of anchor-free methods, while simultaneously making the localization more accurate? Since a single regression network may not be capable to effectively handle the unbounded prediction space, it is natural for us to consider restricting the range of the regression. Inspired by the anchor-based methods that use distinct regression networks and anchors for objects with different shapes, we also predict different regression values with the corresponding distinct networks and anchors. However, in contrast to common anchor modelling where each anchor prediction is independent to others, we model the anchor existence as a mutually dependent relationship, i.e. there is only one most possible anchor for the object, as depicted in Fig. 1(c). Specifically, we set different 1D anchor scales. For instance, we can divide the prediction range into several intervals, each interval corresponding to an anchor scale. Each anchor scale is responsible for a certain range of locations, and there is one specific regression network for each anchor scale. During the inference, a classification network first produces a coarse prediction and decides which anchor scale and regression network will be used. Then, the corresponding regression network refines the coarse prediction together with the selected anchor scale. This strategy comes with several benefits. First, it balances both the classification and the regression. By quantizing the prediction space, the regression is bounded into a reasonable range. Besides, using softmax for anchor selection can make the best-matched anchor compete against other ill-matched ones, thus facilitating more accurate localization. This can effectively rule out the predictions of ill-matched anchors before follow-up possible processing steps (e.g. top-K and NMS), which therefore reduces redundancy and saves computation. Moreover, we enable learning the ranges of the anchor scales, which preserves great flexibility as in existing anchor-free approaches.

To summarize, our main contributions are three-fold:

  • We model anchors of each location as a mutually dependent relationship and use the softmax to make the best-matched anchor compete against ill-matched ones. A coarse-to-fine pipeline is further devised for object localization, which achieves superb flexibility as regression based anchor-free methods, while can clearly reduce the output redundancy and also produce significatly more precise prediction.

  • We propose an novel and effective strategy to better represent the confidence of a detection box by combining the category-classification score and the inherit anchor-selection score that indicates the localization quality of the detection result.

  • We design a consise detection framework with the proposed anchor-dependent modeling and loalization strategies termed as ScopeNet, which establishes state-of-the-art results on COCO without bells and whistles.

2 Related Work

2.1 Anchor-based Detectors

In deep learning based object detection, inspired from pioneering works such as Faster R-CNN 

[20] and SSD [14], the concept of anchor boxes is widely adopted in later research in the community. For the classic multi-stage detectors [5, 1, 16], they select regions of interest (ROIs) which are most likely to contain objects using a Region Proposal Network (RPN) in the first stage, and the anchor box is a basic component in RPN [20]

. In the next stages, they first extract deep features from the ROIs 

[20, 5, 4], and two sibling branches are then used for predicting the object class and regressing the object location, respectively. In these pipelines, the anchors are usually ROIs from a previous stage, and the object localization task directly relies on the regression. To improve the localization performance, several recent works consider leveraging classification for localization in further stages. Two representative works are LocNet [3] and Grid R-CNN [15]. LocNet [3] first extracts 1D features along the horizontal and vertical axises, and then performs binary classification to predict the confidence of each position being a bounding box border. It requires several iterative steps to obtain the final results. Grid R-CNN [15] directly predicts the possibility of a border location for each position from its corresponding 2-D feature of each ROI. For single stage detectors, a majority of them are built upon RPN with multi-category classification and regression  [12, 14, 19]. As far as we know, there is no previous work which models the anchor existence possibility at the same location as a mutually dependent relationship. Besides, existing anchor-based methods basically model anchors as 2D boxes instead of 1D scales as we explore.

2.2 Anchor-free Detectors

Based on the localization strategy utilized, anchor-free approaches could be generally divided into classification-based and regression-based ones.

2.2.1 Classification based methods.

Classification based methods [9, 22, 28] typically model 2D bounding boxes as sets of points. CornerNet [9] directly predicts two diagonal corners of the object with classification, and then uses embedding for grouping them into boxes. While DeNet [22] also models the box as four corners and predicts them with per-corner classification, it groups the corners by maximizing an overall confidence. ExtremeNet [28] predicts one point on each side of the box using semantic meaning, and further considers geometric constraints for the grouping. These classification based methods usually require geometric or semantic relations for grouping corners into boxes in an extra stage, however, the proposed approach directly performs the object classification and localization, and does not require any further grouping or associating processing.

2.2.2 Regression based methods.

For two-stage detectors [23, 5, 6], the direct regression is usually performed in region proposal generation and final object localization. While among most of recent single-stage detectors [17, 2, 21], they model the object localization in an anchor-free fashion. For instance, early works such as YOLO [17] predict the bouding boxes with direct regression of the borders. CenterNet [2] learns center heatmaps for localizing the object centers instead of corner heatmaps. RepPoints [25] regresses object boundaries with an iterative dynamic sampling strategy. To develop neat designs, FCOS [21] and FoveaBox [8]

recently employ direct regression of a 4D offset vector at each position to represent the four directions of a bounding box for localization. All these regression based approaches perform coarse regression on bounding box boarders or offsets without considering bounding the regression range. While in our approach, each regression prediction is only responsible for a certain prediction range, which can effectively achieve more fine-grained and precise localization. Furthermore, our design of using the inherit anchor selection score to better represent the confidence of a detection box is also not investigated in these works.

3 The Proposed Approach

Our detection pipeline is illustrated in Fig. 2. The backbone network first produces a deep feature map of the input image, which is further used as input of object classification and an object localization network branch, respectively. For each position on the feature map, the classification branch predicts confidence scores of categories. To predict more accurate localization of objects, the Scope Head detailed in Section 3.1 is used at the localization branch.

3.1 Scope Head

At the scope head, we consider candidate anchors on each of four directions (i.e. left, right, up, and bottom). There are two branches at the scope head, bin classification branch and border regression branch. The bin classification branch is designed to perform a -class classification for learning anchor selection. For these candidate anchors, the border regression branch performs regression for localizing borders along each direction. Specifically, for each direction, the bin classification branch selects an anchor with the highest score and decodes its corresponding regression prediction to obtain the boundary position of the object in each direction. Finally, the object bounding-box is determined by gathering boundary positions along with four directions. Note that in our approach, candidate anchors are learnable parameters.

Separate 1D representations. In most works, an anchor is a 2D representation, e.g. scale, and aspect ratio. In this paper, an anchor is obtained from four separate 1D representations corresponding to the top, bottom, left, and right borders of the anchor. Therefore, the bounding box regression targets are formulated as distances to four boundaries from the position of the current feature point. With such design, it is very flexible to generate anchors with different scales in the 1D space rather than bounding boxes with various shapes in the 2-D space. This naturally brings an inherent advantage, i.e

. much higher degree of freedom for the generation of the anchor boxes. For instance, given

anchors for each direction, it could produce anchor boxes with potentially shapes. Therefore, compared with the traditional 2D anchor generation, the 1D anchor representation is more capable of handling various aspect ratios of the objects.

Figure 2: An overview of the proposed detection pipeline ScopeNet. Given an input image, backbone CNN is used for extracting features. Then there are two main branches, object classification branch and object localization branch. The classification branch performs a -class classification for identifying the object category. The object localization branch is implemented by our Scope Head, which is divided into two branches, bin classification, and border regression. The bin classification performs -class classification for anchor selection in four directions. The border regression performs border predictions for the anchors selected from the bin classification branch.

Learnable 1D anchors. In this work, instead of manually designing anchors, we aim at generating learnable anchors that are data-dependent. In the 1D anchor representation, there are four directions. We denote a direction by , where . On each direction , we divide the distance into bins, where the th bin represents an interval , . Each bin has a corresponding anchor. Following the formulation of the target normalization in [20], we write the boundary prediction for the th bin as follows:


where is the predicted boundary for direction ; is the anchor scale, and is the raw prediction from the border regression network branch. In actual design, the network learns , where is a learnable parameter for the -th anchor, and we use . This is because might be large (consider existing anchor design, where the maximum value of could be ). To avoid potential eccentric large value that might make the learning unstable, we predict which is in a more reasonable range for network prediction. In previous works, is manually designed by assuming a fixed anchor scale and aspect ratio. In our design, is learned to adjust the anchor.

Learning anchor selection. The choice of for anchor is critical for accurate regression. Only the selected anchor will be used for the regression in Eqn. (1). Unlike 2D anchor assignment methods where Intersection-over-Union (IoU) metric is adopted, we assign the anchor for the boundary by determining which bin the target falls in, as illustrated in Fig. 3

. Learning anchor selection aims to select an optimal bin for the target object. A straightforward method is to use the fully connected layer as the classifier and uses simple Softmax function to predict probabilities of the target belonging to different bins.

However, different from the classic object classification where the boundary of each category is clear, our bin classification task is ambiguous. For example, for the right border close to the boundary of two bins, the network tends to output two high confidence scores for both bins. From our experimental observation, such samples would dominate the loss gradient when the network is trained well on other samples with less ambiguity, which thus degrades the learning performance. To tackle this issue, we propose a strategy which smooths the probability distribution from the Softmax function and down-weighting the loss of such samples as follows:


where is temperature [7] indicating the confidence of decision, is a normalized probability for the direction within the -th bin and is the classification score. For classification of classes, the network produces an extra output for predicting . Therefore, is dependent on the training sample. We set which gaurantees is positive.

Localization guided detection score. To further improve the accuracy of object detection, we employ a localization guided detection score. Most previous works only rely on the classification results to evaluate the qualities of box predictions. However, a detection result of high quality not only means category recognized but also requires accurate object localization. Therefore, it is biased to represent the quality of a prediction with merely a classification score. To address this problem, we use the product of localization confidence and classification score as the score of an object for NMS, which is formulated as


where , denotes the maximum probability of bin classification for border . is defined in Eqn. (2). are classification and localization score for object.

3.2 Scope Net Details

Figure 3: Illustration of the Box parameterization and bin assignment. The yellow circles and orange box denote points denote the samples of interest for classification and the target bounding box, respectively. The green vectors that point to the top and left boundaries represent predicted boundary targets and for the current center position. The formulation for the right and bottom boundaries are analogous to these two while not depicted for simplicity.

We adopt the Feature Pyramid Network [11] as our baseline network structure. Four stacked convolution layers are used to extract features and features for the classification network branch and the localization network branch , as in Fig. 2.

The classification network branch. It uses the features for predicting object category scores. For each feature location, only predicts one set of multi-class probability scores. While for existing detectors using 2D anchors, due to the existence of multiple anchors per location, multiple sets of scores need to be predicted.

The localization network branch. , i.e. the proposed Scope Head Network, takes as input and produces localization regression scores for each bin as well as the confidence scores for anchor classification, which are used together for the border prediction.

Label assignment for different prediction heads. An important step for network training is to assign samples to different FPN levels. Existing anchor box based approaches assign anchors of different sizes to feature maps of corresponding sizes. In each FPN level, anchors are set as positive samples if the IoU with any ground truth bounding boxes is above a pre-defined threshold. While in our approach, each FPN level is responsible for different ranges of regression. We consider a feature point as a positive sample for the -th feature level of FPN based on two factors: (i) the point has to be within a distance to the center of a ground-truth object bounding box, denoted as , and (ii) the maximum value of its four location prediction along the four directions, should lie in a reasonable regression range of the -th FPN level. In our experiments, we use , , , , as the regression range for the FPN level from 3 to 7, respectively. For the classification branch , we set as , and both the positive and negative samples are used during training. We set as for the localization branch and only the positive samples contribute in learning . We use one-hot label for category classification and bin selection prediction. We set the target as directly maximizing the IoU between the decoded bounding box prediction and the target box for border prediction.

Loss formulation.

With the ground-truth labels assigned, the overall loss function is defined as:


where denotes an indicator function, which returns 1 if i.e. a positive sample, otherwise returns 0. is the bin classification loss and standard Cross Entropy Loss is adopted; is the feature point classification loss and Focal Loss is used as in [12] without parameter tuning; and is the location regression loss where we use the IoU loss following [26, 21]. Bin classification losses from four directions are averaged as , where

The loss weights and are set to 0.5 and 1 respectively, which makes and . We empirically find that this setting stabilize the training process.

4 Understanding Scope Head Localization

4.1 From the View of Anchor-Free Approach

Recently proposed regression based anchor-free methods apply regression only for localization. They are similar to our approach in some aspects. First, we both perform -class classification per location and use similar approaches for choosing positive/negative training samples. Second, we both model the localization predictions as distances to the four borders of the objects. The main difference is whether there is anchor selection. Other anchor-free methods use a regression network to handle the entire prediction space. In fact, such anchor-free methods [21, 8] belong to a special case of our approach, where . While our approach first quantizes the prediction space into several intervals. This bounds each regression in a reasonable range and thus relieves the burden for regression prediction. Besides, the anchor selection score helps to provide information on the localization quality of the predicted box, while existing anchor-free methods lack this information.

4.2 From the Viewpoint of Anchor-based Approach

There are similarities between our approach and anchor-based ones. First, these anchor-based methods first predict the confidence on whether the anchor box matches an object or not. Our approach also selects the anchor scale, which is a 1D anchor that is similar to the 2D anchor box used in existing anchor-based approaches. Second, both anchor-based methods and our approach use regression to refine the location of the anchor with a high score. The differences are in three-fold. First, the anchor prediction is dependent in ours but independent in anchor-based approaches. Second, our approach reduces the number of confidence scores required in existing anchor-based methods for anchors. Assume there are anchors and categories. For anchor-based ones, the number of confidence scores is . In our approach, the number is . Third, our method can represent anchor boxes more efficiently. In anchor-based approaches, anchors can only represent box shapes. While in our approach, anchors in each direction can represent different shapes of anchor boxes.

5 Experiments

To validate the effectiveness of the proposed ScopeNet detection approach, we conduct experiments on the large scale detection dataset COCO [13]. Following [21, 12], we use the trainval135k split as the training set and conduct ablation study on the minival set. We compare our approach with state-of-the-art methods on the test-dev set.

5.1 Training Details

We adopt ResNet [6] with FPN [11] as the backbone. we trained all our models using 8 Nvidia 1080Ti GPUs and a total batch size of 16. The loss weights for object classification, anchor selection, and border regression are set to be 1, 0.5, and 1, respectively. We use SGD for optimization.

For ablation study, we train our models for 90K iterations, the schedule. The initial learning rate is 0.01 and is reduced by a factor of 10 at 60K and 80K iteration step. Input images are resized with the shorter side being 800, and the longer side being 1333. No augmentation except horizontal flipping with the probability of 0.5 is adopted.

For comparison with state-of-the-art methods, we train our models with the schedule, where we double the iterations to 180K and scale the change points of the learning rate proportionally. The shorter sides of input images range from 640 to 800. We adopt the improvements as mentioned in Table 4. Other settings are the same as the model with 38.0 mAP in Table 3.

5.2 Ablation Study

5.2.1 On baseline of direct regression.

As discusses in Sec. 4, our model has fewer differences to anchor-free detectors with regression only. Clearly, direct regression is a special case of our approach where . Therefore, we set in our method and use this direct regression model as the baseline. As in Table 1, this baseline achieves in mAP.

5.2.2 On multiple anchor scales and localization guided detection score.

Table 1 shows the experimental results on evaluating the new designs in our ScopeNet. If our design of multiple anchors is adopted, the mAP is increased from to . This validates the effectiveness of using multiple anchors. Another advantage is that the anchor-selection score can provide information for the localization quality of the detection box. As in Table 1, further utilizing the localization guided detection score can improve on absolute mAP. This information is missing for the direct regression approach, because there is no anchor selection score in such setting.

Direct regression Multiple anchors Re-score
35.8 55.2 37.5 18.7 40.1 47.6
37.2 56.0 39.4 19.6 41.2 49.6
38.0 56.1 40.6 20.8 41.9 50.1
Table 1: Improvements upon the baseline. ‘Direct regression’ means the baseline predicting the box with regression only, which corresponds to using one anchor scale with . ‘Multiple anchors’ means using multiple anchor scales ( here). ‘Re-score’ denotes using localization confidence for rescoring the detection confidence.

5.2.3 Parameter choices of anchors.

In this part, we present experimental results on several key factors for designing anchors in our model. We mainly discuss how the prediction range of each anchor should be assigned, and the effect of the number of anchors.

The prediction range of each anchor and the number of anchors provide a trade-off on classification and regression for localization. When more anchors are used, the network relies more on classification for selecting a more fine-grained range for the following regression. Thus, the pressure for regression is eased since it focuses on a smaller prediction range. On the contrary, when fewer anchors are adopted, the regression needs to handle a larger range. A good balance that effectively leverages classification and regression could produce a better result. To this end, we vary the prediction range for each anchor and the number of anchors. The results are shown in Table 2.

In general, it is clear that dividing the whole prediction space into several sub-spaces is better when compared with using single regression handling the entire prediction space. Also, it is also observable that our Scope Head is not sensitive to the hyper parameter choices. It indicates that the classification and regression could be better balanced and trained in different settings. While for detectors with independent 2-D anchors, experimental results in [12, 8] show that the performance is more sensitive to the design choices of anchors.

size num
- 1 35.8 55.2 37.5 18.7 40.1 47.6
3 36.9 55.9 38.9 19.3 40.8 48.5
5 37.0 55.6 39.2 19.3 41.2 48.6
7 36.9 55.4 39.4 19.7 40.6 49.1
3 37.1 56.0 39.4 20.1 41.1 48.5
5 37.2 56.0 39.4 19.6 41.2 49.6
7 37.2 56.1 39.4 20.2 40.8 49.4
3 37.1 56.1 39.5 19.3 40.8 49.6
5 37.0 56.0 39.3 19.5 41.0 49.6
7 36.9 55.4 39.4 19.7 40.6 49.1
Table 2: Effects of different parameter choices of anchors. We try different sub-space sizes and anchor numbers. The sizes are given as . Scope Head is not sensitive to hyper parameter choices.

5.2.4 Localization guided box score and uncertainty term.

To show the effectiveness of incorporating localization score in box score and the uncertainty term, we provide results in Table 3. When no localization score is used, i.e. , mAP is achieved. Next, when using localization score, i.e. , a model without uncertainty term achieves in mAP, which is point higher than the previous one. In comparison, our model using uncertainty as in Eqn. (2) achieves in mAP, which is point higher than the previous model with vanilla Softmax function.

- - 37.2 56.0 39.4 51.7 54.8
- 37.7 55.2 40.5 52.3 55.1
38.0 56.1 40.6 52.5 55.4
Table 3: Incorporate anchor confidence and uncertainty for estimating bounding box score.

5.2.5 With strategies from other methods.

We show that the proposed method works well with custom strategies that could boost performance and are adopted in other works. Table 4 reports the results of our approach equipped with different custom strategies. This validates the compatability of our approach with other components.

38.0 -
38.1 0.1
38.4 0.4
39.4 1.4
Table 4: Improvements on COCO minival set with free use of extra components. represents the relative gain against the baseline model without using extra components. ‘Custom FPN’ is introduced in FCOS [21]. ‘NMS’ means changing the NMS threshold from 0.5 to 0.6. ‘GN’ means using Group Normalization [24] in the detection head.

5.2.6 Generalization to input of different sizes.

Besides less sensitive to hyperparameter choices, as mentioned in Sec.

5.2.3, our anchor design also generalizes well for inputs with different sizes. In Table 5, we compared our anchor design with vanilla anchors on various input (image and box) sizes. For our anchor design, we use our ScopeNet with 5 anchors and the uncertainty term. While for vanilla anchor design, we use RetinaNet with 9 anchors in mmdetection without modifications. For both methods, we use the same backbone structure and only change the anchor scheme. It is clear that our anchor modelling consistently outperforms vanilla anchors in terms of handling input of different sizes.

Anchor input size
Vanilla 400 30.6 48.6 32.4 11.0 34.2 47.5
Ours 32.5 49.8 34.2 13.5 34.8 49.6
Vanilla 600 34.3 53.7 36.8 16.6 38.2 48.1
Ours 35.9 54.3 38.1 17.5 39.6 50.0
Vanilla 800 35.8 55.5 38.3 20.1 39.5 47.7
Ours 38.0 56.1 40.6 20.8 41.9 50.1
Table 5: Vanilla anchors vs. our anchors on inputs with different sizes.

5.3 Comparison with State-of-the-art Detectors

We compare ScopeNet with other state-of-the-art object detectors on the test-dev split of COCO dataset. As shown in Table 6, our method surpasses both anchor-based RetinaNet and anchor-free FCOS by a clear margin with the same backbone.

Method Anchor Backbone
Faster R-CNN [12] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2
Mask R-CNN [5] ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2
Cascade R-CNN [1] ResNet-101-FPN 42.8 62.1 46.3 23.7 45.5 55.2
YOLO v3 [19] Darknet-53 33.0 57.9 34.4 18.3 35.4 41.9
CornerNet [9] Hourglass-104 40.6 56.4 43.2 19.1 42.8 54.3
CenterNet [2] Hourglass-104 44.9 62.4 48.1 25.6 47.4 57.4
ExtremeNet [28] Hourglass-104 40.2 55.5 43.2 20.4 43.2 53.1
TridentNet [10] ResNet-101 42.7 63.6 46.5 23.9 46.6 56.6
RepPoints [25] ResNet-101-FPN 41.0 62.9 44.3 23.6 44.1 51.7
RetinaNet [12] ResNet-101-FPN 39.1 59.1 42.3 21.8 42.7 50.2
FSAF [29] ResNet-101-FPN 40.9 61.5 44.0 24.0 44.2 51.3
Fovea [8] ResNet-101-FPN 40.6 60.1 43.5 23.3 45.2 54.5
FCOS [21] ResNet-101-FPN 41.5 60.7 45.0 24.4 44.8 51.6
ScopeNet (ours) - ResNet-101-FPN 43.4 61.2 47.8 26.0 46.8 53.8
Table 6: COCO test-dev results for ScopeNet and other state-of-the-art approaches. ’Anchor’ means anchor boxes are utilized.

In detail, our model achieves comparable results on . While in terms of , our model clearly surpasses all the other methods (except CenterNet with quite different settings [2]). This fully demonstates that our model performs better at accurate localization.

5.4 Qualitive Results

We provide visualization results of our approach in Fig. 4. We use the model that achieves on . We select images contain objects with different aspect ratios and crowded scenes. From those images, we could observe that our approach is able to localize objects accurately if the objects have a reasonable size and can be successfully recognized by the classification network.

Figure 4: Visualization of detection results. Examples shown in the and rows are successful detection samples. Examples in the row contain failure detection cases. For readability, we only draw bounding boxes with prediction scores while not showing class labels on the bounding boxes.

6 Conclusion

We have presented the proposed framework ScopeNet for object detection. It models anchors of each location as a mutually dependent relationship and considers a coarse-to-fine pipeline for object localization. The proposed approach achieves a great flexibility as in other regression based anchor-free methods, while it also clearly reduces the output redundancy and produces better prediction. Moreover, ScopeNet proposes a novel scheme via combining the category-classification score and the inherit anchor selection score that indicates the localization quality of the detection result, which has been shown to be very effective to represent the confidence of a detection box. Extensive experiments demonstrate that the proposed ScopeNet could clearly achieve state-of-ther-art results on the COCO dataset.

7 Acknowledgement

We would like to thank Zhi Tian for constructive discussions.


  • [1] Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 6154–6162. Cited by: §1, §2.1, Table 6.
  • [2] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6569–6578. Cited by: §1, §2.2.2, §5.3, Table 6.
  • [3] S. Gidaris and N. Komodakis (2016) Locnet: improving localization accuracy for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 789–798. Cited by: §2.1.
  • [4] J. Gu, H. Hu, L. Wang, Y. Wei, and J. Dai (2018) Learning region features for object detection. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 381–395. Cited by: §2.1.
  • [5] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §2.1, §2.2.2, Table 6.
  • [6] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.2.2, §5.1.
  • [7] L. P. Kaelbling, M. L. Littman, and A. W. Moore (1996) Reinforcement learning: a survey.

    Journal of artificial intelligence research

    4, pp. 237–285.
    Cited by: §3.1.
  • [8] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi (2019) FoveaBox: beyond anchor-based object detector. arXiv preprint arXiv:1904.03797. Cited by: §1, §1, §2.2.2, §4.1, §5.2.3, Table 6.
  • [9] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2.2.1, Table 6.
  • [10] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6054–6063. Cited by: Table 6.
  • [11] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: §3.2, §5.1.
  • [12] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §1, §2.1, §3.2, §5.2.3, Table 6, §5.
  • [13] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.
  • [14] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.1.
  • [15] X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan (2019) Grid r-cnn. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7363–7372. Cited by: §2.1.
  • [16] W. Ouyang, K. Wang, X. Zhu, and X. Wang (2017) Chained cascade network for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1938–1946. Cited by: §2.1.
  • [17] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.2.2.
  • [18] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1.
  • [19] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1, §1, §2.1, Table 6.
  • [20] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.1, §3.1.
  • [21] Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §1, §1, §2.2.2, §3.2, §4.1, Table 4, Table 6, §5.
  • [22] L. Tychsen-Smith and L. Petersson (2017) Denet: scalable real-time object detection with directed sparse sampling. In Proceedings of the IEEE international conference on computer vision, pp. 428–436. Cited by: §2.2.1.
  • [23] J. Wang, K. Chen, S. Yang, C. C. Loy, and D. Lin (2019) Region proposal by guided anchoring. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2965–2974. Cited by: §2.2.2.
  • [24] Y. Wu and K. He (2018) Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: Table 4.
  • [25] Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019) RepPoints: point set representation for object detection. arXiv preprint arXiv:1904.11490. Cited by: §2.2.2, Table 6.
  • [26] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang (2016) Unitbox: an advanced object detection network. In Proceedings of the 24th ACM international conference on Multimedia, pp. 516–520. Cited by: §3.2.
  • [27] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §1.
  • [28] X. Zhou, J. Zhuo, and P. Krahenbuhl (2019) Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 850–859. Cited by: §2.2.1, Table 6.
  • [29] C. Zhu, Y. He, and M. Savvides (2019) Feature selective anchor-free module for single-shot object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 840–849. Cited by: §1, Table 6.