Dynamic Anchor Learning for Arbitrary-Oriented Object Detection

12/08/2020 ∙ by Qi Ming, et al. ∙ Beijing Institute of Technology 0

Arbitrary-oriented objects widely appear in natural scenes, aerial photographs, remote sensing images, etc., thus arbitrary-oriented object detection has received considerable attention. Many current rotation detectors use plenty of anchors with different orientations to achieve spatial alignment with ground truth boxes, then Intersection-over-Union (IoU) is applied to sample the positive and negative candidates for training. However, we observe that the selected positive anchors cannot always ensure accurate detections after regression, while some negative samples can achieve accurate localization. It indicates that the quality assessment of anchors through IoU is not appropriate, and this further lead to inconsistency between classification confidence and localization accuracy. In this paper, we propose a dynamic anchor learning (DAL) method, which utilizes the newly defined matching degree to comprehensively evaluate the localization potential of the anchors and carry out a more efficient label assignment process. In this way, the detector can dynamically select high-quality anchors to achieve accurate object detection, and the divergence between classification and regression will be alleviated. With the newly introduced DAL, we achieve superior detection performance for arbitrary-oriented objects with only a few horizontal preset anchors. Experimental results on three remote sensing datasets HRSC2016, DOTA, UCAS-AOD as well as a scene text dataset ICDAR 2015 show that our method achieves substantial improvement compared with the baseline model. Besides, our approach is also universal for object detection using horizontal bound box. The code and models are available at https://github.com/ming71/DAL.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Object detection is one of the most fundamental and challenging problem in computer vision. In recent years, with the development of deep convolutional neural networks (CNN) , tremendous successes have been achieved on object detection

ren2015faster; dai2016r; redmon2016you; liu2016ssd. Most detection frameworks utilize preset horizontal anchors to achieve spatial alignment with ground-truth(GT) box. Positive and negative samples are then selected through a specific strategy during training phase, which is called label assignment.

Since objects in the real scene tend to appear in diverse orientations, the issue of oriented object detection has gradually received considerable attention. There are many approaches have achieved oriented object detection by introducing the extra orientation prediction and preset rotated anchors ma2018arbitrary; liao2018textboxes++. These detectors often follow the same label assignment strategy as general object detection frameworks. For simplicity, we call the IoU between GT box and anchors as input IoU, and the IoU between GT box and regression box as output IoU. The selected positives tend to obtain higher output IoU compared with negatives, because their better spatial alignment with GT provides sufficient semantic knowledge, which is conducive to accurate classification and regression.

Figure 1: Predefined anchor (red) and its regression box (green). (a) shows that anchors with a high input IoU cannot guarantee perfect detection. (b) reveals that the anchor that is hardly spatially aligned with the GT box still has the potential to localize object accurately.

However, we observe that the localization performance of the assigned samples is not consistent with the assumption mentioned above. As shown in Figure 1, the division of positive and negative anchors seems not always related to the detection performance. Furthermore, we count the distribution of the anchor localization performance for all candidates to explore whether this phenomenon is universal. Illustrated in Figure 2, a considerable percentage (26%) of positive anchors are poorly aligned with GT after regression, therefore the positive anchors cannot ensure accurate localization. Besides, more than half of the candidates that achieve high-quality predictions are regressed from negatives, as shown in Figure 2, which implies that a considerable number of negatives with high localization potential have not been effectively used. In summary, we conclude that the localization performance does not entirely depend on the spatial alignment between the anchors and GT.

Figure 2: Analysis of the classification and regression capabilities of anchors that use input IoU for label assignment. (a) 74% of the positive sample anchors localize GT well after regression, which illustrates that many false positive samples are introduced. (b) Only 42% of the high-quality detections (output IoU is higher than 0.5) come from matched anchors, which represents that quite a lot of negative anchors (58% ) have the potential to achieve accurate localization. (c) Current label assignment leads to a positive correlation between the classification confidence and the input IoU. (d) High-performance detection results exhibit a weak correlation between the localization ability and classification confidence, which is not conducive to selecting accurate detection results through classification score during inference.

Besides, inconsistent localization performance before and after anchor regression further lead to inconsistency between classification and localization, which has been discussed in previous works jiang2018acquisition; kong2019consistent; he2019bounding. As shown in Figure 2, anchor matching strategy based on input IoU induces a positive correlation between the classification confidence and input IoU. However, as discussed above, the input IoU is not entirely equivalent to the localization performance. Thus we cannot distinguish the localization performance of the detection results based on the classification score. The results in Figure 2 also confirm this viewpoint: a large number of regression boxes with high output IoU are misjudged as background.

To solve the problems, we propose a Dynamic Anchor Learning(DAL) method for better label assignment and further improve the detection performance. Firstly, a simple yet effective standard: matching degree

is designed to assess the localization potential of anchors, which comprehensively considers the prior information of spatial alignment, the localization ability represented by feature alignment, and the regression uncertainty. After that, we adopt matching degree for training sample selection, which helps to eliminate false-positive samples, dynamically mine potential high-quality candidates, and suppress the disturbance caused by regression uncertainty during the training process. Next, we propose a matching-sensitive loss function to further alleviate the inconsistency between classification and regression, making the classifier more discriminative for proposals with high localization performance, thereby high-quality object detection can be achieved.

Extensive experiments on public available datasets, including remote sensing datasets HRSC2016, DOTA, UCAS-AOD, and scene text dataset ICDAR 2015, show that our method achieves consistent and substantial improvements for arbitrary-oriented object detections. Integrated with our approach, even the vanilla one-stage detector can complete with current state-of-the-art approaches on several datasets. Additionally, experiments on ICDAR 2013 and NWPU VHR-10 prove that our approach is also universal for object detection using horizontal box. The proposed DAL approach is general and can be easily integrated into existing object detection pipeline with no extra inference overhead. The contributions can be summarized as follows:

  • We observe that the label assignment based on IoU between anchor and GT box leads to suboptimal localization ability assessment, and further brings misaligned classification and regression performance.

  • The matching degree is introduced to measure the localization potential of anchors, the superior label assignment method based on this metric is conducive to high-quality detection.

  • The matching-sensitive loss is proposed to alleviate the problem of the weak correlation between classification and regression, and improves the discrimination ability of high-quality proposals.

Related Work

Arbitrary-Oriented Object Detection

The current mainstream detectors can be divided into two categories: two-stage detector ren2015faster; dai2016r and one-stage detector redmon2016you; liu2016ssd. Existing rotation detectors are mostly built on detectors using horizontal bounding box representation. To localize rotated objects, the preset rotated anchor and additional angle prediction are adopted in the literature liu2018arbitrary; ma2018arbitrary; liao2018textboxes++; liu2017rotated. Nevertheless, due to the variation of the orientation, these detectors are obliged to preset plenty of rotated anchors to spatially align with GT box. There are also some methods that detect oriented objects only using horizontal anchors. For example, RoI Transformer ding2019learning uses horizontal anchors but learns the rotated RoI through spatial transformation, reducing the number of predefined anchors. RDetyang2019r3det adopts cascade regression and refined box re-encoding module to achieve state-of-the-art performance with horizontal anchors. Although the above approaches have achieved good performance, they cannot make a correct judgment on the quality of anchors, thus the noise caused by label assignment still brings adverse impact during the training process.

Label Assignment

Most anchor-based detectors densely preset anchors at each position of feature maps. The massive preset anchors lead to serious imbalance problem, especially for the arbitrary-oriented objects with additional angle setting. The most common solution is to control the ratio of candidates through a specific sampling strategy shrivastava2016training; pang2019libra. Besides, Focal loss lin2017focal lowers the weight of easy examples to avoid its overwhelming contribution to loss. The work li2019gradient

further considers extremely hard samples as outliers, and gradient harmonizing mechanism is proposed to conquer the imbalance problems. We demonstrate that the existence of outliers is universal, and our method prevents such noise samples from being assigned incorrectly.

Some works have observed problems caused by input IoU as a standard for label assignment. Dynamic R-CNN zhang2020dynamic and ATSS zhang2020bridging automatically adjust the IoU threshold to select high-quality positive samples, but they fail to consider whether the IoU indicator itself is credible. The work li2020learning points out that binary labels assigned to anchors are noisy, and construct cleanliness score for each anchor to supervise the training process. However, it only considers the noise of positive samples, ignoring the potentially powerful localization capabilities of massive negative samples. HAMBox liu2020hambox reveals that unmatched anchors can also achieve accurate prediction, and attempts to utilize these samples. Nevertheless, its compensated anchors mined according to output IoU is not reliable; moreover, it does not consider the degradation of matched positives. FreeAnchor zhang2019freeanchor

formulates object-anchor matching as a maximum likelihood estimation procedure to select the most representative anchors, but its formulation is relatively complicated.

Proposed Method

Rotation Detector Built on RetinaNet

The real-time inference is essential for arbitrary-oriented object detection in many scenarios, hence we use the one-stage detector RetinaNet lin2017focal as the baseline model. It utilizes ResNet-50 as backbone, architecture similar to FPN lin2017feature is adopted to construct a multi-scale feature pyramid. Predefined horizontal anchors are set on feature of every level , , , , . Note that rotation anchor is not used here, because it is inefficient and unnecessary, we will further prove this point in the next chapters. Since the extra angle parameter is introduced, the oriented box is represented in the format of . For bounding box regression, we have:

(1)

where denote center coordinates, width, height and angle, respectively. and are for the predicted box and anchor, respectively (likewise for ). Given the ground-truth box offsets , the multi-task loss is defined as follows:

(2)

in which the value

and the vector

denote predicted classification score and predicted box offsets, respectively. Variable represents the class label for anchors ( for positive samples and for negative samples).

Dynamic Anchor Selection

Existing researches zhang2019freeanchor; song2020revisiting have reported that the discriminative features required to localize object are often not evenly distributed on GT box, especially for object with a wide variety of orientations and aspect ratios. Thus the label assignment strategy based on spatial alignment, i.e., input IoU, leads to the incapability to capture the critical feature demanded for object recognition.

An intuitive approach is to use the feedback of the regression results, that is, the output IoU to represent feature alignment ability and dynamiclly guide the training process. Several attempts jiang2018acquisition; li2020learning have been made in this respect. In particular, we tentatively select the training samples based on the output IoU and use it as the soft-label for classification. However, we find that the model is hard to converge because of the following two cases:

  • Anchors with high input IoU but low output IoU are not always negative samples; it may be caused by not sufficient training.

  • The unmatched low-quality anchors that accidentally achieve accurate localization performance tend to be misjudged as positive samples.

The above analysis shows that regression uncertainty interferes with the credibility of the output IoU for feature alignment representation. Regression uncertainty has been widely discussed in many previous works feng2018towards; choi2019gaussian; kendall2017uncertainties; choi2018uncertainty, which represents the instability and irrelevance in the regression process. We discover in the experiment that it misleads the label assignment. Specifically, high-quality samples cannot be effectively utilized, and the selected false-positive samples will cause the unstable training. Unfortunately, neither the input IoU nor the output IoU used for label assignment can avoid the interference caused by the regression uncertainty.

Based on the observations, we introduce the concept of matching degree (MD), which utilizes the prior information of spatial matching, feature alignment ability, and regression uncertainty of the anchor to measure the localization capacity, it is defined as follows:

(3)

where denotes a priori of spatial alignment, its value is equivalent to input IoU; is the feature alignment capability calculated by IoU between GT box and regression box; and

are hyperparameters used to weight the influence of different items.

is a penalty term, which denotes the regression uncertainty during training, it is obtained via the IoU variation before and after regression:

(4)

The suppression of interference during regression is vital for high-quality anchor sampling and stable training. Variation of IoU before and after regression represents the probability of incorrect anchor assessment. Note that our construction of the penalty term for regression uncertainty is very simple, since detection performance is not sensitive to the form of

, the naive, intuitive yet effective form is retained.

With the newly defined matching degree, we conduct dynamic anchor selection for superior label assignment. In the training phase, we first calculate matching degree between the GT box and anchors, then anchors with matching degree higher than a certain threshold (set to 0.6 in our experiments) are selected as positives, the rest are negatives. After that, for GT that do not match any anchor, the anchor with the highest matching degree will be compensated for it. To achieve more stable training, we gradually adjust the impact of the input IoU during training. The specific adjustment schedule is as follows:

(5)

where , is the total number of iterations, and is the final weighting factor that appears in Eq. 3.

(a) Without MSL
(b) With MSL
Figure 3: The correlation between the output IoU and classification score with and without MSL.

Matching-Sensitive Loss

To further enhance the correlation between classification and regression to achieve high-quality arbitrary-oriented detection, we integrate the matching degree into the training process and propose the matching-sensitive loss function (MSL). The classification loss is defined as:

(6)

where and respectively represent all anchors and the positive samples selected by the matching degree threshold. and denote the total number of all anchors and positive anchors, respectively. is focal loss defined as lin2017focal. indicates the matching compensation factor, which is used to distinguish positive samples of different localization potential. For each ground-truth , we first calculate its matching degree as . Then positive candidates can be selected according to a certain threshold, matching degree of positives is represented as , where . Supposing that the maximal matching degree for is , this value will be adjusted to 1, and the compensation value is denoted as , we have:

(7)

After that, is added to the matching degree of all positives to form the matching compensation factor:

(8)

With the well-designed matching compensation factor, the detector treats positive samples of different localization capability distinctively. In particular, more attention will be paid to candidates with high localization potential for classifier. Thus high-quality predictions can be taken through classification score, which helps to alleviate the inconsistency between classification and regression.

Matching degree measures the localization ability of anchors, thus it can be further used to promote high-quality localization. We formulate the matching-sensitive regression loss as follows:

(9)

denotes the smooth- loss for regression. Matching compensation factor is embedded into regression loss to avoid the loss contribution being submerged in the dominant loss of negative samples and mediocre positive samples. It can be seen from Figure 3(a) that the correlation between the classification score and the localization ability of the regression box is not strong enough, which causes the prediction results selected by the classification confidence to be sometimes unreliable. After training with a matching-sensitive loss, as shown in Figure 3(b), a higher classification score accurately characterizes the better localization performance represented by the output IoU, which verifies the effectiveness of the proposed method.

Experiments

Datasets

We conduct experiments on the remote sensing datasets HRSC2016, DOTA, UCAS-AOD and a scene text dataset ICDAR 2015. The ground-truth boxes in the images are all oriented bounding boxes. The HRSC2016 lb2017high is a challenging remote sensing ship detection dataset, which contains 1061 pictures. The entire dataset is divided into training set, validation set and test set, including 436, 541 and 444 images, respectively. DOTA xia2018dota is the largest public dataset for object detection in remote sensing imagery with oriented bounding box annotations. It contains 2806 aerial images with 188,282 annotated instances; there are 15 categories in total. Note that images in DOTA are too large, we crop images into 800

800 patches with the stride set to 200. UCAS-AOD

zhu2015orientation is an aerial aircraft and car detection dataset, which contains 1510 images. We randomly divide it into training set, validation set and test set as 5:2:3. The ICDAR 2015 dataset is used for Incidental Scene Text Challenge 4 of the ICDAR Robust Text Detection Challenge karatzas2015icdar. It contains 1500 images, including 1000 training images and 500 test images.

Different Variants
with Input IoU
with Output IoU
Uncertainty Supression
Matching Sensitive Loss
80.8 78.9 85.9 88.6
52.4 50.4 57.7 67.6
Table 1: Effects of each component in our method on HRSC2016 dataset.
mAP mAP mAP
3 0.2 84.1 4 0.2 88.1 5 0.2 87.3
0.3 88.3 0.3 88.2 0.3 88.6
0.5 86.2 0.5 85.5 0.5 88.4
0.7 84.1 0.7 77.9 0.7 88.1
0.9 70.1 0.9 75.5 0.9 83.5
Table 2: Analysis of different hyperparameters on HRSC2016 dataset.
Baseline yang2018automatic HAMBoxliu2020hambox ATSSzhang2020bridging DAL(Ours)
80.8 82.2 85.4 86.1 88.6
Table 3: Comparisons with other label assignment strategies on HRSC2016.
Methods Backbone PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP
FR-Oxia2018dota R-101 79.09 69.12 17.17 63.49 34.20 37.16 36.20 89.19 69.60 58.96 49.40 52.52 46.69 44.80 46.30 52.93
R-DFPNyang2018automatic R-101 80.92 65.82 33.77 58.94 55.77 50.94 54.78 90.33 66.34 68.66 48.73 51.76 55.10 51.32 35.88 57.94
jiang2017r2cnn R-101 80.94 65.67 35.34 67.44 59.92 50.91 55.81 90.67 66.92 72.39 55.06 52.23 55.14 53.35 48.22 60.67
RRPNma2018arbitrary R-101 88.52 71.20 31.66 59.30 51.85 56.19 57.25 90.81 72.84 67.38 56.69 52.84 53.08 51.94 53.58 61.01
ICNazimi2018towards R-101 81.36 74.30 47.70 70.32 64.89 67.82 69.98 90.76 79.06 78.20 53.64 62.90 67.02 64.17 50.23 68.16
RoI Trans.ding2019learning R-101 88.64 78.52 43.44 75.92 68.81 73.68 83.59 90.74 77.27 81.46 58.39 53.54 62.83 58.93 47.67 69.56
CAD-Netzhang2019cad R-101 87.80 82.40 49.40 73.50 71.10 63.50 76.70 90.90 79.20 73.30 48.40 60.90 62.00 67.00 62.20 69.90
DRNpan2020dynamic H-104 88.91 80.22 43.52 63.35 73.48 70.69 84.94 90.14 83.85 84.11 50.12 58.41 67.62 68.60 52.50 70.70
wei2019oriented H-104 89.31 82.14 47.33 61.21 71.32 74.03 78.62 90.76 82.23 81.36 60.93 60.17 58.21 66.98 61.03 71.04
SCRDetyang2019scrdet R-101 89.98 80.65 52.09 68.36 68.36 60.32 72.41 90.85 87.94 86.86 65.02 66.68 66.25 68.24 65.21 72.61
yang2019r3det R-152 89.49 81.17 50.53 66.10 70.92 78.66 78.21 90.81 85.26 84.23 61.81 63.77 68.16 69.83 67.17 73.74
CSLyang2020arbitrary R-152 90.25 85.53 54.64 75.31 70.44 73.51 77.62 90.84 86.15 86.69 69.60 68.04 73.83 71.10 68.93 76.17
Baseline R-50 88.67 77.62 41.81 58.17 74.58 71.64 79.11 90.29 82.13 74.32 54.75 60.60 62.57 69.67 60.64 68.43
Baseline+DAL R-50 88.68 76.55 45.08 66.80 67.00 76.76 79.74 90.84 79.54 78.45 57.71 62.27 69.05 73.14 60.11 71.44
Baseline+DAL R-101 88.61 79.69 46.27 70.37 65.89 76.10 78.53 90.84 79.98 78.41 58.71 62.02 69.23 71.32 60.65 71.78
han2020align R-50 89.11 82.84 48.37 71.11 78.11 78.39 87.25 90.83 84.90 85.64 60.36 62.60 65.26 69.13 57.94 74.12
+DAL R-50 89.69 83.11 55.03 71.00 78.30 81.90 88.46 90.89 84.97 87.46 64.41 65.65 76.86 72.09 64.35 76.95
Table 4: Performance evaluation of OBB task on DOTA dataset. R-101 denotes ResNet-101(likewise for R-50), and H-104 stands for Hourglass-104.
Methods Backbone Size NA mAP
Two-stage:
jiang2017r2cnn ResNet101 800800 21 73.07
RC1&RC2lb2017high VGG16 - - 75.70
RRPNma2018arbitrary ResNet101 800800 54 79.08
zhang2018toward VGG16 - 24 79.60
RoI Trans. ding2019learning ResNet101 512800 5 86.20
Gliding Vertexxu2020gliding ResNet101 512800 5 88.20
Single-stage:
RRDliao2018rotation VGG16 384384 13 84.30
yang2019r3det ResNet101 800800 21 89.26
R-RetinaNetlin2017focal ResNet101 800800 121 89.18
Baseline ResNet50 416416 3 80.81
Baseline+DAL ResNet50 416416 3 88.60
Baseline+DAL ResNet101 416416 3 88.95
Baseline+DAL ResNet101 800800 3 89.77
Table 5: Comparisons with state-of-the-art detectors on HRSC2016. NA denotes the number of preset anchor at each location of feature map.
Methods car airplane
FR-Oxia2018dota 86.87 89.86 88.36 47.08
RoI Transformer ding2019learning 87.99 89.90 88.95 50.54
Baseline 84.64 90.51 87.57 39.15
Baseline+DAL 89.25 90.49 89.87 74.30
Table 6: Detection results on UCAS-AOD dataset.
Methods P R F
CTPNtian2016detecting 74.2 51.6 60.9
Seglinkshi2017detecting 73.1 76.8 75.0
RRPNma2018arbitrary 82.2 73.2 77.4
SCRDetyang2019scrdet 81.3 78.9 80.1
RRDliao2018rotation 85.6 79.0 82.2
DBliao2020real 91.8 83.2 87.3
Baseline 77.2 77.8 77.5
Baseline+DAL 83.7 79.5 81.5
84.4 80.5 82.4
Table 7: Comparisons of different methods on the ICDAR 2015. P, R, F indicate recall, precision and F-measure respectively. * means multi-scale training and testing.
Dataset Backbone mAP/F
ICDAR 2013 RetinaNet 77.2
RetinaNet+DAL 81.3
NWPU VHR-10 RetinaNet 86.4
RetinaNet+DAL 88.3
VOC 2007 RetinaNet 74.9
RetinaNet+DAL 76.1
Table 8: Performance evaluation of HBB task on ICDAR 2013 and NWPU VHR-10.

Implementation Details

For the experiments, we build the baseline on RetinaNet as described above. For HRSC2016, DOTA, and UCAS-AOD, only three horizontal anchors are set with aspect ratios of {1/2, 1, 2}; for ICDAR, only five horizontal anchors are set with aspect ratios of {1/5, 1/2, 1, 2, 5}. All images are resized to 800800. We use random flip, rotation, and HSV colour space transformation for data augmentation.

The optimizer used for training is Adam. The initial learning rate is set to 1e-4 and is divided by 10 at each decay step. The total iterations of HRSC2016, DOTA, UCAS-AOD, and ICDAR 2015 are 20k, 30k, 15k, and 40k, respectively. We train the models on RTX 2080Ti with batch size set to 8.

Ablation Study

Evaluation of different components

We conduct a component-wise experiment on HRSC2016 to verify the contribution of the proposed method. The experimental results are shown in Table 1. For variant with output IoU, is set to 0.8 for stable training, even so, detection performance still drops from 80.8% to 78.9%. It indicates that the output IoU is not always credible for label assignment. With the suppression of regression uncertainty, the prior space alignment and posterior feature alignment can work together effectively on label assignment; thus performance is dramaticly improved by 4.8% higher than the baseline. Furthermore, the model using the matching sensitivity loss function achieves mAP of 88.6%, and the proportion of high-precision detections is significantly increased. For example, AP is 9.9% higher than the variants with uncertainty supression, which indicates that the matching degree guided loss effectively distinguishes anchors with differential localization capability, and pay more attention to high matching degree anchors to improve high-quality detection results.

Hyper-parameters

To find suitable hyperparameter settings, and explore the relationship between parameters, we conduct parameter sensitivity experiments, and the results are shown in Table 2. In the presence of uncertainty suppression terms, as the is reduced appropriately, the influence of feature alignment increases, and the mAP increases. It indicates that the feature alignment represented by the output IoU is beneficial to select anchors with high localization capability. However, when is extremely large, the performance decreases sharply. The possible reason is that most potential high quality samples are suppressed by uncertainty penalty item when the output IoU can hardly provide feedback information. In this case, weakening the uncertainty suppression ability, that is, increasing helps to alleviate this problem and make anchor selection more stable.

Figure 4: Visualization of detection results on DOTA with our method .

Experiment Results

Comparison with other sampling methods

The experimental results are shown in Table 3. Baseline model conducts label assignment based on input IoU. ATSS zhang2020bridging has achieved great success in object detection using horizontal box. When applied to rotation object detection, there is still a substantial improvement, which is 5.3% higher than the baseline model. As for HAMBox liu2020hambox, since the samples mined according to the output IoU are likely to be low-quality samples, too many mining samples may cause the network to fail to diverge, we only compensate one anchor for GT that do not match enough anchors. It outperforms 4.6% than baseline. The proposed DAL method significantly increases 7.8% compared with baseline model. Compared with the popular method ATSS, our approach considers the localization performance of the regression box, so the selected samples have more powerful localization capability, and the effect is 2.5% higher than it, which confirms the effectiveness of our method.

Results on DOTA

We compare the proposed approach with other state-of-the-art methods. As shown in Table 4, we achieve the mAP of 71.44%, which outperforms the baseline model by 3%. Integrated with DAL, even the vanilla RetinaNet can compete with many advanced methods. Besides, we also embed our approach to other models to verify its universality. SA-Net han2020align is an advanced rotation detector that achieves the state-of-the-art performance on DOTA dataset. It can be seen that our method further improves the performance by 2.83%, reaches the mAP of 76.95%, achieving the best results among all compared models. Some detection results on DOTA are shown in Figure 4.

Results on HRSC2016

HRSC2016 contains lots of rotated ships with large aspect ratios and arbitrary orientation. Our method achieves the state-of-the-art performances on HRSC2016, as depicted in Table 5. Using ResNet-101 as the backbone and the input image is resized to 800800, our method has reached the highest mAP of 89.77%. Even if we use a lighter backbone ResNet50 and a smaller input scale of 416416, we still achieve the mAP of 88.6%, which is comparable to many current advanced methods.

It is worth mentioning that our method uses only three horizontal anchors in each position, but outperforms the frameworks with a large number of anchors. This shows that it is critical to effectively utilize the predefined anchors and select high-quality samples, and there is no need to preset a large number of rotated anchors. Besides, our model is a one-stage detector, and the feature map used is . Compared with the for two-stage detectors, the total amount of positions that need to set anchor is less, so the inference speed is faster. With input image resized to 416416, our model reaches 34 FPS on RTX 2080 Ti GPU.

Results on UCAS-AOD

Experimental results in Table 5 show that our model achieves a further improvement of 2.3% compared with baseline. Specifically, the detection performance of small vehicles is significantly improved, indicating that our method is also robust to small objects. Note that the DAL method dramaticly improves AP, it reveals that the matching degree based loss function helps to pay more attention to high-quality samples and effectively distinguish them to achieve high-quality object detection.

Results on ICDAR 2015

To verify the generalization of our method in different scenarios, we also conduct experiments on the scene text detection dataset. The results are shown in Table 7. Our baseline model only achieved an F-measure of 77.5% after careful parameters selection and long-term training. The proposed DAL method improves the detection performance by 4%, achieves an F-measure of 81.5%. With multi-scale training and testing, it reaches 82.4%, which was equivalent to the performance of many well-designed text detectors. However, there are a large number of long texts in ICDAR 2015 dataset, which are often mistakenly detected as several short texts. Designed for general rotation detection, DAL does not specifically consider this situation, thus the naive model still cannot outperform current state-of-the-art approaches for scene text detection, such as DB liao2020real.

Experiments on Object Detection with HBB

When localize objects with horizontal bounding box(HBB), label assignment still suffers from the uneven discriminative feature. Although this situation is not as severe as the rotated objects, it still lays hidden dangers of unstable training. Therefore, our method is also effective in general object detection using HBB. The experimental results on ICDAR 2013 karatzas2013icdar, NWPU VHR-10 cheng2016learning and VOC2007 everingham2010pascal are shown in the Table 8. It can be seen that DAL achieves substantial improvements for object detection with HBB, which proves the universality of our method.

Conclusion

In this paper, we propose a dynamic anchor learning strategy to achieve high-performance arbitrary-oriented object detection. Specifically, matching degree is constructed to comprehensively considers the spatial alignment, feature alignment ability, and regression uncertainty for label assignment. Then dynamic anchor selection and matching-sensitive loss are integrated into the training pipeline to improves the high-precision detection performance and alleviate the gap between classification and regression tasks. Extensive experiments on several datasets have confirmed the effectiveness and universality of our method.

References