RepPoints V2: Verification Meets Regression for Object Detection

07/16/2020 ∙ by Yihong Chen, et al. ∙ Microsoft Peking University 10

Verification and regression are two general methodologies for prediction in neural networks. Each has its own strengths: verification can be easier to infer accurately, and regression is more efficient and applicable to continuous target variables. Hence, it is often beneficial to carefully combine them to take advantage of their benefits. In this paper, we take this philosophy to improve state-of-the-art object detection, specifically by RepPoints. Though RepPoints provides high performance, we find that its heavy reliance on regression for object localization leaves room for improvement. We introduce verification tasks into the localization prediction of RepPoints, producing RepPoints v2, which provides consistent improvements of about 2.0 mAP over the original RepPoints on the COCO object detection benchmark using different backbones and training methods. RepPoints v2 also achieves 52.1 mAP on COCO test-dev by a single model. Moreover, we show that the proposed approach can more generally elevate other object detection frameworks as well as applications such as instance segmentation. The code is available at https://github.com/Scalsol/RepPointsV2.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 13

page 14

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Two common methodologies for neural network prediction are verification and regression. While either can drive network features to fit the final task targets, they each have different strengths. For the object localization problem, verification can be easier to infer because each feature is spatially aligned with the target to be verified. On the other hand, regression is often more efficient and it can also predict continuous target variables that enable subtle localization refinement.

To take advantage of all these benefits, earlier object localization methods Girshick (2015a); Liu et al. (2016); Lin et al. (2017b) combined verification and regression by first performing coarse localization through verifying several anchor box hypotheses, and then refining the localization by regressing box offsets. This combination approach was shown to be effective and led to state-of-the-art performance at the time. However, recent methods based purely on regression, which directly regress the object extent from each feature map position Yang et al. (2019a); Tian et al. (2019); Zhang et al. (2019a), could perform competitively or even better, when comparing a representative regression method, RepPoints, to RetinaNet Lin et al. (2017b).

In this work, we examine whether pure regression based methods can be enhanced by the inclusion of verification methodology. We observe that verification has proven to be advantageous when used in certain ways. In CornerNet Law and Deng (2018), feature map points are verified as a bounding box corner or not, in contrast to verifying anchor boxes for coarse hypothesis localization in RetinaNet Lin et al. (2017b). This use of verification leads to significantly better localization performance as shown in Table 1. The difference may be attributed to corner points representing the exact spatial extent of a ground-truth object box, while an anchor box gives only a coarse hypothesis. In addition, each feature in corner point verification is well aligned to the corresponding point, while in anchor verification, the center feature used for verification lies away from the boundary area.

To elevate the performance of regression-based methods, specifically RepPoints Yang et al. (2019a), we thus seek to incorporate effective and compatible forms of verification. However, the different granularity of object representations processed by the two methods, i.e., whole objects in RepPoints and object parts (corners) in corner verification, presents an obstacle. To address this issue, we propose to model verification tasks by auxiliary side-branches that are added to the major regression branch at only the feature level and result level, without affecting intermediate representations. Through these auxiliary side-branches, verification can be fused with regression to provide the following benefits: better features by multi-task learning, feature enhancement through inclusion of verification cues, and joint inference by both methodologies. The fusion is simple, intuitive, general enough to utilize any kind of verification cue, and does not disrupt the flow of the RepPoints algorithm.

Through different techniques for harnessing verification, the localization and classification ability of RepPoints is substantially improved. The resulting detector, called RepPoints v2, shows consistent improvements of about 2.0 mAP over the original RepPoints on the COCO benchmark with different backbones. It also achieves 52.1 mAP on the COCO object detection test-dev set with a single ResNeXt-101-DCN model.

The proposed approach of choosing proper verification tasks and introducing them into a regression framework as auxiliary branches is flexible and general. It can be applied to object detection frameworks other than RepPoints, such as FCOS Tian et al. (2019). The additional corner and within-box verification tasks are shown to improve a vanilla FCOS detector by 1.3 mAP on COCO test-dev using a ResNet-50 model. This approach can be also applied beyond object detection, such as to instance segmentation by Dense RepPoints Yang et al. (2019b), where additional contour and mask verification tasks improve performance by 1.3 mAP using a ResNet-50 model on the COCO instance segmentation test-dev set, reaching 38.9 mask mAP.

 

Method methodology backbone AP

 

RetinaNet Lin et al. (2017b) ver.+reg. ResNeXt-101 40.0 60.9 56.4 48.7 35.8 14.6
CornerNet Law and Deng (2018) verification HG-104 40.6 56.1 52.0 46.8 38.8 23.4
RepPoints* Yang et al. (2019a) regression ResNet-50 39.1 58.8 54.8 48.0 35.5 14.4

 

RepPoints v2 ver.+reg. ResNet-50 41.0 59.9 55.9 49.1 37.2 18.5

 

Table 1: Analysis of the performance on COCO val set among different methods. “RepPoints*” indicates our improved re-implementation of RepPoints.

2 Related Works

Verification based object detection

Early deep learning based object detection approaches 

Szegedy et al. (2013); Sermanet et al. (2014) adopt a multi-scale sliding window mechanism to verify whether each window is an object or not. Corner/extreme point based verification is also proposed Tychsen-Smith and Petersson (2017); Law and Deng (2018); Zhou et al. (2019b); Duan et al. (2019) where the verification of a 4-d hypothesis is factorized into sub-problems of verifying 2-d corners, such that the hypothesis space is more completely covered. A sub-pixel offset branch is typically included in these methods to predict continuous corner coordinates through regression. However, since this mainly deals with quantization error due to the lower resolution of the feature map compared to the input image, we treat these methods as purely verification based in our taxonomy.

Regression based object detection Achieving object detection by pure regression dates back to YOLO Redmon et al. (2016) and DenseBox Huang et al. (2015), where four box borders are regressed at each feature map position. Though attractive for their simplicity, their accuracy is often limited due to the large displacements of regression targets, the issue of multiple objects located within a feature map bin, and extremely imbalanced positive and negative samples. Recently, after alleviating these issues by a feature pyramid network (FPN) Lin et al. (2017a) structure along with a focal loss Lin et al. (2017b), regression-based object detection has regained attention Tian et al. (2019); Kong et al. (2019); Zhou et al. (2019a); Yang et al. (2019a), with performance on par or even better than other verification or hybrid methods. Our work advances in this direction, by leveraging verification methodology into regression based detectors without disrupting its flow and largely maintaining the convenience of the original detectors. We mainly base our study on the RepPoints detector, but the method can be generally applied to other regression based detectors.

Hybrid approaches Most detectors are hybrid, for example, those built on anchors or proposals Girshick et al. (2014); Girshick (2015b); Ren et al. (2015); Lin et al. (2017b); Cai and Vasconcelos (2018); Lu et al. (2019). The verification and regression methodologies are employed in succession, where the anchors and proposals which provide coarse box localization are verified first, and then are refined by regression to produce the detection output. The regression target is usually at a relatively small displacement that can be easily inferred. Our work demonstrates a different hybrid approach, where the verification and regression steps are not run in serial but instead mostly in parallel to better combine their strengths. Moreover, this paper utilizes the more accurate corner verification tasks to complement regression based approaches.

3 Verification Meets Regression for Object Detection

3.1 A Brief Review of a Pure Regression Method: RepPoints

RepPoints Yang et al. (2019a) adopts pure regression to achieve object localization. Starting from a feature map position , it directly regresses a set of points to represent the spatial extent of an object using two successive steps:

(1)

where is the intermediate point set representation;

denotes the feature vector at position

; and are 2-d regression functions implemented by a linear layer. The bounding boxes of an object are obtained by applying a conversion function on the point sets and , where is modeled as the min-max, partial min-max or moment function.

The direct regression in RepPoints Yang et al. (2019a) makes it a simple framework without anchoring. Though no anchor verification step is employed, it performs no worse than anchor-based detectors, i.e. RetinaNet Lin et al. (2017b), in localization accuracy as shown in Table 1. Nevertheless, we are motivated by the potential synergy between regression and verification to consider the following questions: What kind of verification tasks can benefit the regression-based RepPoints Yang et al. (2019a)? Can various verification tasks be conveniently fused into the RepPoints framework without impairing the original detector?

3.2 Verification Tasks

We first discuss a pair of verification tasks that may help regression-based localization methods.

3.2.1 Corner Point Verification

Two corner points, e.g. the top-left corner and bottom-right corner, can determine the spatial extent of a bounding box, providing an alternative to the usual 4-d descriptor consisting of the box’s center point and size. This has been used in several bottom-up object detection methods Law and Deng (2018); Zhou et al. (2019b); Tychsen-Smith and Petersson (2017), which in general perform worse than other kinds of detectors in classification, but is significantly better in object localization, as seen in Table 1. In later sections, we show that this verification task can complement regression based methods to obtain more accurate object localization.

Corner point verification operates by associating a score to each point in the feature map, indicating its probability of being a corner point. An additional offset is predicted to produce continuous coordinates for corner points, which are initially quantized due to the lower resolution of the feature map compared to the input image, e.g.

downsampling. Following the original implementation Law and Deng (2018), corner pooling is computed in the head, with a focal loss Lin et al. (2017b) to train the corner score prediction and a smooth L1 loss for the sub-pixel offset prediction. In label assignment, each feature map point is labeled positive if a ground truth corner point is located within its feature bin, and others are labeled negative. In computing the loss, the negative samples around each ground truth corner are assigned lower weights by an inverse Gaussian function with respect to its distance to the ground-truth corner point. A more detailed description is given in Appendix A.

Different from CornerNet Law and Deng (2018), which employs a special backbone architecture with an Hourglass structure and a single-level high resolution feature map ( downsampled from the original image), most other recent object detectors adopt an FPN backbone with multi-level feature maps. We adapt the corner verification to utilize multi-level feature maps, e.g. the C3-C7 settings in RepPoints Yang et al. (2019a). Specifically, all ground truth corner points are assigned to every feature map level, contrary to the usual practice in FPN-based object detection of assignment according to object size. We find that assignment in this manner performs slightly better although it disregards the scale normalization issue, probably due to more positive samples at each level in training. It also performs substantially better than training on a single feature map level of highest resolution, i.e. C3, and then copying/resizing the predicted score/offset map to other levels.

3.2.2 Within-box foreground verification

Another verification task with potential to benefit regression based object detectors is to verify whether a feature map point is located within an object box or not. This within-box foreground verification task provides localization information evenly inside an object box, in contrast to corner points which focus on the box extremes. It is thus not as precise as corner points in describing object bounds, but may benefit object detectors given a coarse localization criterion.

We also differentiate among different object categories by using a non-binary category-aware foreground heatmap. Concretely, for object categories, there is a -channel output, with each channel indicating the probability of a feature point being in the corresponding object category. The same as for corner point verification, each ground truth object is assigned to every level of an FPN backbone.

Normalized focal loss.

In training, a vanilla focal loss lets larger objects contribute significantly more than smaller objects, resulting in poorly learnt foreground scores for small objects. To address this issue, a normalized focal loss is proposed, which normalizes every positive feature map point by the total number of positive points within the same object box in the feature map. For negative points, the normalized loss uses the number of positive points as the denominator. A more detailed description is given in Appendix A.

Figure 1: Overview of the general fusion method. The outputs of verification modules (corner and foreground) are incorporated with the input feature to elevate the performance of regression-based object localization, and then joint inference is further employed.

3.3 A General Fusion Method

In this section, we incorporate these forms of verification to elevate the performance of regression-based methods. In general, regression-based methods detect objects in a top-down manner where all intermediate representations model the whole object. Since the two verification tasks process object parts, such as a corner or a foreground point, the different granularity of their object representations complicates fusion of the two methodologies.

To address this issue, we propose to model verification tasks by auxiliary side-branches that are fused with the major regression branch in a manner that does not affect its intermediate representations, as illustrated in Figure 1. Fusion occurs only at the feature level and result level. With these auxiliary side-branches, the detector can gain several benefits:

Better features by multi-task learning

The auxiliary verification tasks provide richer supervision in learning, yielding stronger features that increase detection accuracy, as shown in Table 4. Note that this multi-task learning is different from that of Mask R-CNN He et al. (2017). In Mask R-CNN He et al. (2017), the bounding box object detection benefits from the object mask prediction task, but it requires extra annotation of the object mask. In contrast, our additional auxiliary tasks are automatically generated from only the object bounding box annotation, allowing them to be applied in scenarios where just bounding box annotations are available.

Figure 2: Illustration of the corner module and foreground module.
Feature enhancement for better regression

The verification output includes strong cues regarding corner locations and the foreground area, which should benefit the regression task. Since the prediction output of these verification tasks has the same resolution as the feature map used for regression at each FPN level, we directly fuse them by applying a plus operator on the original feature map and an embedded feature map produced from the verification output by one conv layer. The embedding aims to project any verification output to the same dimension as the original feature map, and is shared across feature map levels. Note that for the verification output, a copy detached from back-propagation is fed into the embedding convolution layer to avoid affecting the learning of that verification task.

Joint inference by both methodologies

Feature-level fusion implicitly aids object localization. We also explicitly utilize the verification output from corner prediction together with regression-based localization in a joint inference approach that makes use of both of their strengths. Specifically, by corner verification, the sub-pixel corner localization in a small neighborhood is usually more accurate than that by the main regression branch, but is worse at judging whether it is a real corner point since it lacks a whole picture of the object. On the contrary, the main regression branch is better for the latter while worse in accurate sub-pixel localization. To combine their strengths, we refine a corner point of the bounding box predicted from the main regression branch by

(2)

where indicates the corner type (top-left or bottom-right); is a sub-pixel corner point produced by a corner prediction branch at a feature map position; is the verification score; is the neighborhood threshold, set to 1 by default. Note that this result-level fusion is designed for the corner verification task only.

This fusion method is flexible and general, utilizing any kind of verification cue, as it avoids interaction with the intermediate representations in the main branch, and thus has few requirements on the types of verification target. It also does not disrupt the overall flow of the main branch and largely maintains the convenience of the original detector built on the main branch.

3.4 RepPoints v2: Fusing Verification into RepPoints

RepPoints is a pure regression based object detector. We now complement it with verification tasks of different forms, specifically for corners and within-box foreground. To increase the compatibility of RepPoints with the auxiliary verification tasks, we first make a small modification to it, such that the first two points explicitly represent the top-left and bottom-right corner points. We refer to this as the explicit-corners variant. These corner points replace the conversion function used in the original RepPoints, so that the bounding box is defined by these corner points instead of by a min-max or momentum operation on the point set. With the corner points, the conversion function becomes

(3)

where the four numbers denote -center, -center, width and height, respectively. To this explicit-corners variant of RepPoints, we add the auxiliary side-branches for verification. Specifically, we take the feature map right after the 3 conv layer of the localization head as input, to reuse the existing head for computational savings. As illustrated in Figure 2, a convolutional layer is applied on this feature map, followed by two small sub-networks for the two verification tasks. The corner sub-network consists of a corner pooling layer Law and Deng (2018) followed by a conv layer to predict heatmap scores and sub-pixel offsets. The foreground sub-network is a single conv layer to predict the foreground score heatmap. In training, we adopt a multi-task loss:

(4)

with loss weights and . More details are given in Appendix A.

Customizing the general fusion method of Section 3.3 to RepPoints, we use corner verification for multi-task learning, feature enhancement and joint inference. Foreground verification is instead used only for multi-task learning and feature enhancement. The resulting detector is named RepPoints v2.

3.5 Extension to Other Detectors and Problems

The fusion method used for RepPoints can also improve other detectors such as FCOS Tian et al. (2019). As FCOS shares similar classification and localization heads as in RepPoints, the fusion of RepPoints v2 can directly be applied to it. Concretely, the corner and foreground verification heads are applied on the feature map after the conv layer. The verification output maps are fused into the main branch, and the final regression results are obtained by the joint inference described in Section 3.3.

The fusion method can also be extended to other tasks such as instance segmentation by Dense RepPoints Yang et al. (2019b), a regression-based method. Since there is an additional object mask annotation, more fine-grained verification formats can be used, such as object contour verification and category-aware semantic segmentation. As shown in Table 8, the additional verification methodology brings 1.3 mask AP gains to Dense RepPoints on the COCO test-dev set. More details are presented in Appendix B.

4 Experiments

We conduct experiments on the challenging MS COCO 2017 benchmark Lin et al. (2014b), which is split into train, val and test-dev sets with 115K, 5K and 20K images, respectively. We train all the models using the train set and conduct an ablation study on the val set. A system-level comparison to other methods is reported on the test-dev set.

4.1 Implementation Details

We use the mmdetection codebase Chen et al. (2019) for experiments. All experiments perform training with an SGD optimizer on 8 GPUs with 2 images per GPU, using an initial learning rate of 0.01, a weight decay of 0.0001 and momentum of 0.9. In ablations, most experiments follow the

settings where 12 epochs with single-scale training of

are used, with learning rate decayed by after epoch 8 and 11. Most of the ablations use a ResNet-50 He et al. (2016)

backbone pretrained on ImageNet

Russakovsky et al. (2015). We also test our approach using multi-scale () and longer training ( settings with 24 epochs in total and the learning rate decayed at epoch 16 and 22) on stronger backbones to see whether the gains by the proposed approaches hold on these stronger baselines.

In inference, unless otherwise specified, we adopt a single-scale test approach with the image size the same as in single-scale training. We also conduct multi-scale testing on the strongest backbone for comparison with the previous state-of-the-art approaches. An IoU threshold of 0.6 is applied for Non-Maximum Suppression (NMS) to remove duplicate boxes.

For RepPoints Yang et al. (2019a), we use an improved implementation by replacing the IoU assigner with an ATSS assigner Zhang et al. (2019a), yielding 39.1 mAP on COCO val using a ResNet-50 model and the settings, 0.9 mAP higher than that reported in the original paper.

 

variant +verification
min-max 39.1 58.8 42.4 22.4 42.8 50.5
40.7 59.8 43.7 23.3 44.4 54.0
partial min-max 39.0 58.7 42.4 21.8 42.5 50.7
40.7 59.7 43.6 23.1 44.4 54.0
momentum 39.1 58.9 42.2 22.3 42.6 50.8
40.8 59.7 43.7 23.5 44.7 53.9
explicit-corners 39.1 58.8 42.5 22.4 42.6 50.6
41.0 59.9 43.9 23.8 44.8 54.0

 

Table 2: Performance of the explicit-corners variant of RepPoints.

4.2 Ablation Study

Explicit-corners variant. We first validate the effectiveness of the explicit-corners variant of RepPoints described in Section 3.4, as shown in Table 2. This variant performs on par with the three variants used in the original RepPoints, but performs 0.2-0.3 mAP better than other variants when the verification module is added. This could be contributed to more effective interaction between verification and regression tasks in this explicit-corners variant.

Forms of verification Table 3 ablates the two forms of verification. The corner verification task alone brings 1.4 mAP gains over the RepPoints baseline. The benefits are mainly for higher IoU criteria, e.g. is improved by 4.0 mAP while increases by only 0.2 mAP. The additional foreground verification task brings another 0.5 mAP in gains, but mainly on lower IoU criteria, for example, is improved by 0.9 AP while remains about the same.

 

corner foreground
39.1 58.8 42.5 14.4 22.4 42.6 50.6
40.5 59.0 43.5 18.4 23.4 44.1 53.5
41.0 59.9 43.9 18.5 23.8 44.8 54.0

 

Table 3: Ablations on two forms of verification.

 

multi-task enhance feature joint inference
39.1 58.8 42.5 14.4 22.4 42.6 50.6
39.5 58.9 42.7 14.6 22.5 43.1 51.0
40.2 60.0 43.5 15.7 24.1 43.8 52.5
41.0 59.9 43.9 18.5 23.8 44.8 54.0

 

Table 4: Ablations on three types of fusion.

Types of fusion Table 4 ablates the types of fusion, specifically multi-task learning, feature enhancement for regression, and joint inference. Multi-task learning brings a 0.4 mAP gain over the RepPoints baseline. Note that this multi-task learning does not rely on annotations beyond bounding boxes, in contrast to that in Mask R-CNN He et al. (2017). The additional feature enhancement operation brings another 0.7 gain. The explicit fusion by joint inference brings increases mAP by 0.8, such that the full approach surpasses its counterpart without verification modules by 1.9 mAP.

Complexity analysis. Our approach involves slightly more parameters (38.3M vs 37.0M) and marginally more computation (244.2G vs 211.0G) than the original RepPoints. This overhead mainly occurs at the additional heads to produce verification score/offset maps. We also conduct RepPoints with heavier computation, by adding one more convolutional layers on the heads, resulting in a baseline with similar parameters and computations as our approach (38M/235.8G v.s. 38.3M/244.2G). The enhanced baseline model performs 0.2 mAP better than the vanilla RepPoints, indicating that the improvements by our approach are mostly not due to more parameters and computation.

 

backbone +verification
ResNet-50 41.8 61.8 58.1 51.1 38.6 15.9
43.9 63.1 59.3 52.5 40.1 20.6
ResNet-101 43.4 63.3 59.4 53.0 40.4 18.0
45.5 64.5 60.6 54.1 42.3 22.2
ResNeXt-101 45.5 65.9 62.1 55.2 42.4 19.7
47.3 66.9 62.9 56.1 44.0 23.7

 

Table 5: Experiments on RepPoints baselines with stronger backbones using settings (24 epochs) and multi-scale training () on COCO val set.

Stronger baselines. We further validate our method on stronger RepPoints baselines, using longer/multi-scale training ( settings) and stronger backbones, as shown in Table 5. It can be seen that the gains are well maintained on these stronger RepPoints baselines, at about 2.0 mAP. This indicates that the proposed approach is largely complementary to improved baseline architecture, in contrast to many techniques that have exhibited decreasing gains with respect to stronger baselines.

Visualization. The visualization results are given in Appendix C.

4.3 Comparison to State-of-the-art Methods

We compare the proposed method to other state-of-the-art object detectors on the COCO2017 test-dev set, as shown in Table 6. We use GIoU Rezatofighi et al. (2019) loss instead of smooth-l1 loss in the regression branch here. With ResNet-101 as the backbone, our method achieves 46.0 mAP without bells and whistles. By using stronger ResNeXt-101 Xie et al. (2017) and DCN Dai et al. (2017) models, the accuracy rises to 49.4 mAP. With additional multi-scale tests as in Zhang et al. (2019a), the proposed method achieves 52.1 mAP.

 

method backbone epoch
RetinaNet Lin et al. (2017b) ResNet-101 18 39.1 59.1 42.3 21.8 42.7 50.2
FCOS Tian et al. (2019) ResNeXt-101 24 43.2 62.8 46.6 26.5 46.2 53.3
DCN V2* Zhu et al. (2019) ResNet-101+DCN 18 46.0 67.9 50.8 27.8 49.1 59.5
RepPoints* Yang et al. (2019a) ResNet-101+DCN 24 46.5 67.4 50.9 30.3 49.7 57.1
MAL* Ke et al. (2020) ResNeXt-101 24 47.0 66.1 51.2 30.2 50.1 58.9
FreeAnchor* Zhang et al. (2019b) ResNeXt-101 24 47.3 66.3 51.5 30.6 50.4 59.0
ATSS* Zhang et al. (2019a) ResNeXt-101+DCN 24 50.7 68.9 56.3 33.2 52.9 62.4
TSD* Song et al. (2020) SENet154+DCN 24 51.2 71.9 56.0 33.8 54.8 64.2
CornerNet Law and Deng (2018) HG-104 100 40.5 56.5 43.1 19.4 42.7 53.9
ExtremeNet Zhou et al. (2019b) HG-104 100 40.2 55.5 43.2 20.4 43.2 53.1
CenterNet Duan et al. (2019) HG-104 100 44.9 62.4 48.1 25.6 47.4 57.4
RepPoints v2 ResNet-50 24 44.4 63.5 47.7 26.6 47 54.6
RepPoints v2 ResNet-101 24 46.0 65.3 49.5 27.4 48.9 57.3
RepPoints v2 ResNeXt-101 24 47.8 67.3 51.7 29.3 50.7 59.5
RepPoints v2 ResNet-101+DCN 24 48.1 67.5 51.8 28.7 50.9 60.8
RepPoints v2 ResNeXt-101+DCN 24 49.4 68.9 53.4 30.3 52.1 62.3
RepPoints v2* ResNeXt-101+DCN 24 52.1 70.1 57.5 34.5 54.6 63.6

 

Table 6: Comparison of RepPoints v2 to state-of-the-art detectors on COCO test-dev. * denotes that the number is obtained by multi-scale testing.

4.4 Extension to Other Detectors and Applications

 

backbone
FCOS ResNet-50 38.2 57.1 41.2 15.3 22.2 42.3 49.5
+verification ResNet-50 39.5 57.7 41.9 18.4 22.3 43.2 52.7

 

Table 7: Applying the verification module to FCOS, which is implemented in mmdetection.

Direct application to FCOS FCOS Tian et al. (2019) is another popular regression based object detector. We directly apply our approach without modification to this detector, and 1.3 mAP improvements are obtained as shown in Table 7, indicating generality of the proposed approach.

 

backbone
Dense RepPoints ResNet-50 37.6 60.4 40.2 20.9 40.5 48.6
+verification ResNet-50 38.9 61.5 41.9 21.2 42.0 51.1

 

Table 8: Adding the verification module to the instance segmentation algorithm Dense RepPoints on COCO test-dev.

Extension to instance segmentation Table 8 shows the effect of additional verification modules in the instance segmentation method of Dense RepPoints Yang et al. (2019b). The additional contour and foreground modules improve accuracy by 1.3 mAP, demonstrating the broad applicability of the fusion method.

5 Conclusion

In this paper, we propose RepPoints v2, which enhances the original regression-based RepPoints by fusing verification tasks in various ways. A new variant of RepPoints is proposed to increase the compatibility with the auxiliary verification tasks. The resulting object detector shows consistent improvements over the original RepPoints under different backbones and training approaches. It also achieves 52.1 mAP on the COCO test-dev. Moreover, this approach could be easily transferred to other detectors and the instance segmentation domain, boosting the performance of the base detector/segmenter by a considerable margin.

Broader Impact

Since this work is about designing better object detectors, researchers and engineers engaged in object detection and instance segmentation on natural images, medical images and even video data may benefit from this paper. If there is any failure in this system, the model may not detect objects correctly. Similar to most object detectors, the detection results may not be interpretable, thus it is hard to predict failure scenarios. This object detector also leverages biases in the dataset used for training, and may incur a performance drop on datasets which have a domain gap with the training dataset.

References

  • Z. Cai and N. Vasconcelos (2018) Cascade r-cnn: delving into high quality object detection. In CVPR, pp. 6154–6162. Cited by: §2.
  • K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.1.
  • J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §4.3.
  • K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) CenterNet: keypoint triplets for object detection. In ICCV, Cited by: §2, Table 6.
  • R. B. Girshick (2015a) Fast R-CNN. In ICCV, pp. 1440–1448. Cited by: §1.
  • R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587. Cited by: §2.
  • R. Girshick (2015b) Fast r-cnn. In ICCV, pp. 1440–1448. Cited by: §2.
  • K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In ICCV, pp. 2961–2969. Cited by: §3.3, §4.2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In CVPR, pp. 770–778. Cited by: §4.1.
  • L. Huang, Y. Yang, Y. Deng, and Y. Yu (2015) Densebox: unifying landmark localization with end to end object detection. arXiv preprint arXiv:1509.04874. Cited by: §2.
  • W. Ke, T. Zhang, Z. Huang, Q. Ye, J. Liu, and D. Huang (2020) Multiple anchor learning for visual object detection. In CVPR, pp. 7363–7372. Cited by: Table 6.
  • T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi (2019) FoveaBox: beyond anchor-based object detector. arxiv abs/1904.03797. Cited by: §2.
  • H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In ECCV, pp. 734–750. Cited by: §A.1, §A.1, §A.1, Table 1, §1, §2, §3.2.1, §3.2.1, §3.2.1, §3.4, Table 6.
  • T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017a) Feature pyramid networks for object detection. In ICCV, pp. 2117–2125. Cited by: §2.
  • T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017b) Focal loss for dense object detection. In ICCV, pp. 2980–2988. Cited by: §A.1, Table 1, §1, §1, §2, §2, §3.1, §3.2.1, Table 6.
  • T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014a) Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: Appendix C.
  • T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014b) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: §4.
  • W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In ECCV, pp. 21–37. Cited by: §1.
  • X. Lu, B. Li, Y. Yue, Q. Li, and J. Yan (2019) Grid R-CNN. In CVPR, pp. 7363–7372. Cited by: §2.
  • J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In CVPR, pp. 779–788. Cited by: §2.
  • S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In NeurIPS, pp. 91–99. Cited by: §2.
  • H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. D. Reid, and S. Savarese (2019) Generalized intersection over union: A metric and a loss for bounding box regression. In CVPR, pp. 658–666. Cited by: §4.3.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. S. Bernstein, A. C. Berg, and F. Li (2015) ImageNet large scale visual recognition challenge.

    International Journal of Computer Vision

    115 (3), pp. 211–252.
    Cited by: §4.1.
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun (2014) OverFeat: integrated recognition, localization and detection using convolutional networks. In ICLR, Cited by: §2.
  • G. Song, Y. Liu, and X. Wang (2020) Revisiting the sibling head in object detector. In CVPR, Cited by: Table 6.
  • C. Szegedy, A. Toshev, and D. Erhan (2013) Deep neural networks for object detection. In NeurIPS, pp. 2553–2561. Cited by: §2.
  • Z. Tian, C. Shen, H. Chen, and T. He (2019) FCOS: fully convolutional one-stage object detection. In ICCV, Cited by: §1, §1, §2, §3.5, §4.4, Table 6.
  • L. Tychsen-Smith and L. Petersson (2017) Denet: scalable real-time object detection with directed sparse sampling. In ICCV, pp. 428–436. Cited by: §2, §3.2.1.
  • S. Xie, R. B. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In CVPR, pp. 5987–5995. Cited by: §4.3.
  • Z. Yang, S. Liu, H. Hu, L. Wang, and S. Lin (2019a) RepPoints: point set representation for object detection. In ICCV, pp. 9656–9665. Cited by: Appendix C, Table 1, §1, §1, §2, §3.1, §3.1, §3.2.1, §4.1, Table 6.
  • Z. Yang, Y. Xu, H. Xue, Z. Zhang, R. Urtasun, L. Wang, S. Lin, and H. Hu (2019b) Dense reppoints: representing visual objects with dense point sets. arxiv abs/1912.11473. Cited by: Appendix B, §1, §3.5, §4.4.
  • S. Zhang, C. Chi, Y. Yao, Z. Lei, and S. Z. Li (2019a) Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. arXiv preprint arXiv:1912.02424. Cited by: §1, §4.1, §4.3, Table 6.
  • X. Zhang, F. Wan, C. Liu, R. Ji, and Q. Ye (2019b) FreeAnchor: learning to match anchors for visual object detection. In NeurIPS, pp. 147–155. Cited by: Table 6.
  • X. Zhou, D. Wang, and P. Krähenbühl (2019a) Objects as points. arxiv abs/1904.07850. Cited by: §2.
  • X. Zhou, J. Zhuo, and P. Krähenbühl (2019b) Bottom-up object detection by grouping extreme and center points. In CVPR, Cited by: §2, §3.2.1, Table 6.
  • X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets V2: more deformable, better results. In CVPR, pp. 9308–9316. Cited by: Table 6.

Appendix A Details of Verification Tasks

a.1 Corner Point Verification

Ground-truth assignment.

We follow CornerNet Law and Deng (2018) to assign ground-truth corners. For each corner, only the corner itself is positive location, and all other locations are negative. Moreover, the penalty given to negative locations within a radius of the positive location is reduced. Specifically, for a given corner point on original image, the size of the ground-truth heatmap with downsampled rate is , and the corresponding location of on is . The penalty weight of negative locations is defined as a inverse Gaussian function:

(5)

where

is an object size-adaptive standard deviation,

and indicate the location of a negative point. Note that for different positive points, the penalty weight of a negative point may be different. Therefore, the largest one as the penalty weight of the negative point.

For additional offset prediction, we follow  Law and Deng (2018) that only supervises the positive locations. For a given corner point and its corresponding downsampled location , the training target is:

(6)
Loss.

Follow CornerNet Law and Deng (2018), we use a modified focal loss Lin et al. (2017b) to learn the corner heatmap. The loss is defined as

(7)

where is the number of objects in an image, and are the score and label at location in the predicted heatmap. We set and , following Law and Deng (2018).

In addition, the loss to learn offset are defined as:

(8)

where is the groundtruth offset, is the predicted offset, is the -th corner point. Finally, the overall loss of corner branch is simply defined as the summation of and .

a.2 Within-box Foreground Verification

Normalized focal loss.

The normalized focal loss is defined as:

(9)

where is the value on the ground-truth foreground heatmap, is the -th category score at location of the predicted heatmap, is the normalizing factor, which is defined as:

(10)

where is the area of the object that lies in. If multiple objects of the same category collide at the same location, we would take the smallest size. is defined as , the sum of normalizing factor at all locations. is the number of positive points. and is set as 0.25, 2, respectively.

a.3 Overall Loss

The overall loss is defined as:

(11)

and and . and are defined above.

Appendix B Extension to Instance Segmentation

Training settings.

We based on Dense RepPoints Yang et al. (2019b) to validate the effectiveness of our method, due to the Dense RepPoints is the state-of-the-arts regression-based instance segmentation approach. Because the contour points has no type, only one heatmap is used for predicting all contour points. Other parameters, network architectures and training details are same as object detection.

Joint inference.

With only a few modifications, joint inference can also be used for instance segmentation. For a predicted representative point, if it is close to the contour point, then we refine the predicted representative point set by adding the adjacent contour point into the set. More specifically, if the score of representative point in the contour heatmap is greater than 0.5, then the point with the highest contour score among all the points with a distance less than 1 are added to the set.

Experimental results.

The results is given in Table 9. ResNet-50 backbone and 3x scheduler are adopted. By adding verification module, the performance are elevated by 1.0 mAP, further applying the joint inference, additional 0.3 mAP is improved. This demonstrates the flexibility of our proposed method.

 

backbone
Dense RepPoints ResNet-50 37.6 60.4 40.2 20.9 40.5 48.6
+contour&fg ResNet-50 38.6 61.4 41.7 21.3 41.8 50.8
+joint inference ResNet-50 38.9 61.5 41.9 21.2 42.0 51.1

 

Table 9: Adding the verification module to the instance segmentation algorithm Dense RepPoints on COCO test-dev.

Appendix C Visualization

Figure 3 shows some object detection results comparison on COCO 2017 Lin et al. (2014a) between RepPoints v1 Yang et al. (2019a) and RepPoints v2. Both methods adopt ResNet-50 backbone and 1x scheduler. As can be seen, compared to RepPoints v1, RepPoints v2 could provide us more precise localization results.

Figure 3: Visualization results of RepPoints v1 and RepPoints v2. image on the top row is the detection of RepPoints v1 and the bottom row is for RepPoints v2. The red boxes are generated without joint inference while green boxes adopts joint inference. As can be seen, our full version of RepPoints v2 could achieve better localization results.

Figure 4 gives the visualization of main component of RepPoints v2. From left to right are set of representative points predicted, foreground prediction, top-left corner prediction and bottom-right corner prediction. As can be seen, all components could provide informative cues, leading to better performance.

Figure 4: Visualization of main component of RepPoints v2. From left to right are set of representative points predicted, foreground prediction, top-left corner prediction and bottom-right corner prediction.