Although object detection has achieved significant progress in natural images, it still remains challenging for rotated object detection in aerial images, due to the arbitrary orientations, large scale and aspect ratio variations, and extreme density of objects [xia2018dota]. Rotated object detection aims at predicting a set of oriented bounding box (OBB) and the corresponding classes in an aerial image, which has been serving as an essential step in many applications, e.g., urban management, emergency rescue, precise agriculture [ding2021object]. Modern rotated object detectors can be divided into two categories in terms of the representation of OBB: angle-based detectors and angle-free detectors.
In angle-based detectors, an OBB of a rotated object is usually represented as a five-parameter (). Most existing state-of-the-art methods are angle-based detectors relying on two-stage RCNN frameworks [jiang2018r, ding2019learning, yang2019scrdet, han2021redet, xie2021oriented]. Generally, these methods use an RPN to generate horizontal or rotated RoIs, then use a designed RoI pooling operator to extract features from these RoIs. Finally, an RCNN head is used to predict the OBB and the corresponding classes. Compared to two-stage detectors, one-stage angle-based detectors [ma2018arbitrary, zhang2018toward, yang2020arbitrary, yang2019r3det, han2021align]
directly regress the OBB and classify them based on dense anchors for efficiency.
However, angle-based detectors usually introduce a long-standing boundary discontinuity problem [DCL, yang2020scrdet++] due to the periodicity of angle and exchange of edges. Moreover, the unit between () and angle of the five-parameter representation is not consistent. These obstacles will cause the training unstable and limit the performance.
In contrast with angle-based detectors, angle-free detectors usually represent a rotated object as an eight-parameter OBB (), which denotes the four corner points of a rotated object. Modern angle-free detectors [azimi2018towards, qian2019learning, xu2020gliding, yi2021oriented] directly perform quadrilateral regression, which is more straightforward than the angle-based representation. Unfortunately, although abandoning angle regression and the parameter unit is consistent as well, the performance of existing angle-free detectors is still relatively limited. How to design a more straightforward and effective framework to alleviate the boundary discontinuity problem is the key to the success of rotated object detectors.
In this paper, we propose a purely angle-free framework for rotated object detection, called Point RCNN, which can alleviate the boundary discontinuity problem and attain state-of-the-art performance. Concretely, Point RCNN is a two-stage detector and mainly consists of an RPN (PointRPN) and an RCNN head (PointReg), which are both angle-free. Given an input feature map, first, PointRPN learns a set of representative points for each feature point with a coarse-to-fine manner. Then, rotated RoI (RRoI) is generated through the function of OpenCV [opencv_library]. Finally, PointReg applies a rotate RoI Align [han2021redet, ding2019learning] operator to extract RRoI features, and then refines and classifies the eight-parameter OBB of corner points. In addition, the existing methods almost ignore the category imbalance in aerial images, and we propose to resample images of rare categories to stabilize convergence during training.
The contributions of this paper are as follows:
We propose Point RCNN, a purely angle-free framework for rotated object detection. Without introducing angle prediction, Point RCNN is able to address the boundary discontinuity problem.
We propose PointRPN and PointReg to reformulate angle prediction as the more straightforward points regression. Both of them are angle-free and have consistent parameter units. We further propose to resample images of rare categories to stabilize training and improve overall performance.
Compared with the state-of-the-art methods, our Point RCNN framework attains higher detection performance on several large-scale datasets.
2 Related Work
2.1 Horizontal Object Detection
In the past decade, object detection has become an important computer vision task and has received considerable attention. One line of research focuses on two-stage detectors[girshick2014rich, girshick2015fast, ren2015faster, he2017mask, lin2017feature, cai2018cascade, hu2018relation], which first generates a sparse set of Regions of Interests (RoI) with a Region Proposal Network (RPN), and then performs classification and bounding box regression. While two-stage detectors still attract much attention, another line of research tends to develop efficient one-stage detectors [liu2016ssd, redmon2016you, lin2017focal, law2018cornernet, tian2019fcos, duan2019centernet, yang2019reppoints], in which SSD [liu2016ssd] and YOLO [redmon2016you] are the fundamental methods that use a set of pre-defined anchor boxes to predict object category and anchor box offsets. Recently, some anchor-free methods [law2018cornernet, duan2019centernet, yang2019reppoints] detect object by predicting the center or corner or representative points, which also inspire us to develop the angle-free detector for rotated object.
2.2 Rotated Object Detection
In terms of the representation of oriented bounding box (OBB), modern rotated object detectors can be mainly divided into two categories: angle-based detectors and angle-free detectors.
Angle-based detectors detect rotated object by learning a five-parameter OBB (), in which () denotes a horizontal bounding box and denotes the rotated angle between the longer edge and the horizontal axis. RRPN [ma2018arbitrary] and RPN [zhang2018toward] make use of multiple rotated anchors with different angles, scales, and aspect ratios, which improves the performance while increasing the computational complexity (see LABEL:fig:frontcover(a)). RCNN [jiang2018r] proposes to detect horizontal and rotated bounding box simultaneously with multi task learning. RoI Transformer [ding2019learning] proposes a rotated RoI (RRoI) learner to transform a horizontal RoI into a RRoI, which provides more accurate RRoIs with a complex pipeline (see LABEL:fig:frontcover(b)). SCRDet [yang2019scrdet] enhances features with attention module and proposes an IoU-smooth loss to alleviate the loss discontinuity issue. CSL [yang2020arbitrary] reformulates angle prediction from regression to classification to alleviate discontinuous boundary problem. GWD [GWD2021] and KLD [KLD2021]
propose more efficient loss function for OBB regression. SA-Net [han2021align] proposes a single-shot alignment network to realize full feature alignment and alleviates the inconsistency between regression and classification. Recently, ReDet [han2021redet] proposes to use rotation-equivariant network to encode rotation equivariance explicitly and presents rotation-invariant RoI Align to extract rotation-invariant features. Oriented R-CNN [xie2021oriented] proposes a two-stage detector that consists of an oriented RPN for generating RRoI and an oriented RCNN for refining the RRoI. Both ReDet and Oriented RCNN provide promising accuracy.
However, the boundary problem in the angle regression learning still causes training unstable and limits the performance. While angle-based detectors still find many applications, angle-free methods are getting more and more attention from the community.
Angle-free detectors reformulate rotated object regression as learning a eight-parameter OBB (), which represents the four corner points of a rotated object. ICN [azimi2018towards]
proposes to directly estimate the four vertices of a quadrilateral to regress an oriented object based on image pyramid and feature pyramid. RSDet[qian2019learning] and Gliding Vertex [xu2020gliding] achieve more accurate rotated object detection via directly quadrilateral regression prediction. Recently, BBAVectors [yi2021oriented] extends the horizontal keypoint-based object detector to the oriented object detection task. CFA [BeyondBBox] proposes a convex-hull feature adaptation approach for configuring convolutional features. Compared to angle-based methods, angle-free detectors are more straightforward and can alleviate the boundary problem to a large extent. However, the performance is relatively limited yet.
In this paper, we propose an effective angle-free framework for rotated object detection, i.e., Point RCNN, which mainly consists of PointRPN and PointReg. Compared with other RRoI generation methods, our PointRPN generates accurate RRoI in an anchor-free and angle-free manner (see LABEL:fig:frontcover(c)).
3 Point RCNN
The overall structure of our Point RCNN is depicted in Fig. 1. We start by revisiting the boundary discontinuity problem of angle-based detectors. Then, we describe the overall pipeline of Point RCNN. Finally, we elaborate the PointRPN and PointReg modules, and propose a balanced dataset strategy to rebalance the long-tailed datasets during training.
3.1 Boundary Discontinuity
Boundary problem [DCL, yang2020scrdet++] is a long-standing problem that existed in angle-based detectors. Take the commonly used five-parameter OBB representation as an example, where represents the center coordinates, represents the shorter and longer edges of the box, and represents the angle between the longer edge and the horizontal axis. As shown in Fig. 2, when the target box is approximately square, a slight variation in edge length may cause and to swap, leading to a substantial variation of in angle .
This boundary discontinuity issue in angle prediction will confuse the optimization of the network and limit the detection performance.
The overall pipeline of Point RCNN is shown in Fig. 1. During training, Backbone-FPN first extracts feature maps given an input image. Then, PointRPN performs representative points regression and generates pseudo OBB for rotated RoI (RRoI). Finally, for each RRoI, PointReg refines the corner points and classifies them for final detection results. Besides, we propose to resample images of rare categories to stabilize training and improve the overall performance.
The overall training objective is described as:
where denotes the losses in PointRPN, and denotes the losses in PointReg. We will describe them in detail in the following sections.
Existing rotated object detection methods generate rotated proposals indirectly by transforming the outputs of RPN [faster-rcnn] and suffer from the boundary discontinuity problem caused by angle prediction. For example, [han2021redet, ding2019learning] use RoI transformer to convert horizontal proposals to rotated proposals with an additional angle prediction task. Unlike these methods, in this paper, we propose to directly predict the rotated proposals with representative points learning. The learning of points is more flexible, and the distribution of points can reflect the angle and size of the rotated object. The boundary discontinuity problem can thus be alleviated without angle regression.
Representative Points Prediction. Inspired by RepPoints [yang2019reppoints] and CFA [BeyondBBox], we propose PointRPN to predict the representative points in the RPN stage. The predicted points can effectively represent the rotating box and can be easily converted to rotated proposals in subsequent RCNN stages.
As shown in Fig. 3, PointRPN learns a set of representative points for each feature point. In order to make the features better adapt to the representative points learning, we take a coarse-to-fine prediction manner. In this way, the features will be refined with DCN [dai2017deformable] and the predicted offset in the initial stage. For each feature point, the predicted representative points of the two stages are:
where denotes the number of predicted representative points and we set by default. denotes the initial location, denote the learned offsets in the initial stage, and denote the learned offsets in the refine stage.
Label Assignment. PointRPN predicts representative points for each feature point in the initial and refine stages. This section will describe how we determine the positive samples among all feature points for these two stages.
For the initial stage, we project each ground-truth box to the corresponding feature level according to its area, and then select the feature point closest to its center as the positive sample. The rule used for projecting the ground-truth box to the corresponding feature level is defined as:
where is a hyper-parameter and is set to 16 by default. and are the width and height of .
For the refine stage, we use the predicted representative points from the initial stage to help determine the positive samples. Specifically, for each feature point with its corresponding prediction , if the maximum convex-hull IoU (defined in Eq. 6) between and ground-truth boxes exceeds the threshold , we select this feature point as a positive sample. We set = 0.1 in all our experiments.
Optimization. The optimization of the proposed PointRPN is driven by classification loss and rotated object localization loss. The learning objective is formulated as:
where , and are the trade-off parameters and are set to 0.5, 1.0, and 1.0 by default, respectively. denotes the localization loss of the initial stage. and denote the classification loss and localization loss of the refine stage. Note that the classification loss is only calculated in the refine stage, and the two localization losses are only calculated for the positive samples.
In the initial stage, the localization loss is conducted between the convex-hulls converted from the learned points (see initial stage in Fig. 3) and the ground-truth OBBs. We use convex-hull GIoU loss [BeyondBBox] to calculate the localization loss:
where indicates the number of positive samples of the initial stage. is the matched ground-truth OBB. represents the convex-hull GIoU between the two convex-hulls and , which is differential and can be calculated as:
where the first term denotes the convex-hull IoU, and denotes the smallest enclosing convex object area of and . denotes the Jarvis March algorithm [jarvis1973identification] used to calculate the convex-hull from points.
The learning of the refine stage, which is responsible for outputting more accurate rotated proposals, is driven by both classification loss and localization loss. is a standard focal loss [lin2017focal], which can be calculated as:
where denotes the number of positive samples in the refine stage, and are the classification output and the assigned ground-truth category, respectively. and are hyper-parameters and are set to 0.25 and 2.0 by default. The localization loss is similar to Eq. 5 and can be formulated as:
With the refined representative points, pseudo OBB is converted using the function of OpenCV [opencv_library], which is then used for generating RRoI for PointReg.
As illustrated in Fig. 5, our PointRPN can automatically learn extreme points and semantic key points of rotated objects.
Corner Points Refine. The rotated proposals generated by PointRPN already provide a reasonable estimate for the target rotated objects. To avoid the problems caused by angle regression and further improve the performance, we turn to refine the four corners of the rotated proposals in the RCNN stage. As shown in Fig. 4, with the rotated proposals as input, we use a RRoI feature extractor [ding2019learning, han2021redet]
to extract RRoI features. Then, given the RRoI features, two consecutive fully-connected and ReLU layers are used to encode the RRoI features. Finally, two fully-connected layers are responsible for predicting the class probabilityand refined corners of the corresponding rotated object. The refined corners are represented as:
where denotes the corner coordinates of the input rotated proposals. denotes the predicted corner offsets.
Instead of directly performing angle prediction, we refine the four corners of the input rotated proposals. There are three advantages of adopting corner points refinement: 1). it can alleviate the boundary discontinuity problem caused by angle prediction; 2). the parameter units are consistent among the eight parameters; 3). it is possible to improve the localization accuracy using a coarse-to-fine manner.
We can easily extend PointReg to cascade structure for better performance. As shown in Fig. 1, in the cascade structure, the refined rotated proposals of the previous stage are used as the input of the current stage.
Optimization. The learning of PointReg is driven by classification loss and rotated object localization loss:
where and are the trade-off coefficients and are both set to 1.0 by default. indicates the classification loss, which is a cross-entropy loss:
where denotes the number of training samples in PointReg, is the number of categories excluding background, is the predicted classification probability of the RRoI. if the ground-truth class of the RRoI is ; otherwise it is 0. represents the localization loss between the refined corners and the corners of ground-truth OBB. We use loss to optimize the corner refinement learning:
where denotes refined corners for the rotated proposal, denotes the corners of matched ground-truth OBB. denotes the permutation of four corners of with the smallest loss . Note that is only calculated for positive training samples.
3.5 Balanced Dataset
The extremely nonuniform object densities of aerial images usually make the dataset long-tailed, which may cause the training process to be unstable and limit the detection performance. For instance, DOTA-v1.0 contains 52, 516 ship instances but only 678 ground track field instances [ding2021object]. To alleviate this issue, we resample the images of rare categories, which is inspired by [lvis2019]. More concretely, first, for each category , compute the fraction of images that contains this category. Then, compute the category-level repeat factor for each category:
where is a threshold which indicates that there will be not oversampling if “”. Finally, compute the image-level repeat factor for each image :
where denotes the categories contained in . In other words, those images that contain long-tailed categories will have a greater chance of being resampled during training.
To evaluate the effectiveness of our proposed Point RCNN framework, we perform experiments on two popular large-scale datasets: DOTA [xia2018dota] and HRSC2016 [liu2017high].
DOTA [xia2018dota] is the largest dataset for oriented object detection with three released versions: DOTA-v1.0, DOTA-v1.5 and DOTA-v2.0. To compare the performance with the state-of-the-art methods, we perform experiments on DOTA-v1.0 and DOTA-v1.5. DOTA-v1.0 contains 2806 images ranging in size from 800 800 to 4000 4000, and contains 188, 282 instances in 15 categories: Bridge (BR), Harbor (HA), Ship (SH), Plane (PL), Helicopter (HC), Small vehicle (SV), Large vehicle (LV), Baseball diamond (BD), Ground track field (GTF), Tennis court (TC), Basketball court (BC), Soccer-ball field (SBF), Roundabout (RA), Swimming pool (SP), and Storage tank (ST). DOTA-v1.5 has the same images as DOTA-v1.0 but contains 402, 089 instances. This is a more challenging dataset, which introduces a new category Container Crane (CC) and more small instances.
HRSC2016 [liu2017high] contains 1061 aerial images with size ranges from 300 300 to 1500 900. There are 436, 181, and 444 images in the training, validation and test set, respectively.
|RoI Trans. [ding2019learning]||R101-FPN||88.64||78.52||43.44||75.92||68.81||73.68||83.59||90.74||77.27||81.46||58.39||53.54||62.83||58.93||47.67||69.56|
|Gliding Vertex [xu2020gliding]||R101-FPN||89.64||85.00||52.26||77.34||73.01||73.14||86.82||90.74||79.02||86.81||59.55||70.91||72.94||70.86||57.32||75.02|
|Oriented R-CNN [xie2021oriented]||R101-FPN||90.26||84.74||62.01||80.42||79.04||85.07||88.52||90.85||87.24||87.96||72.26||70.03||82.93||78.46||68.05||80.52|
|Point RCNN (Ours)||ReR50-ReFPN||82.99||85.73||61.16||79.98||77.82||85.90||88.94||90.89||88.89||88.16||71.84||68.21||79.03||80.32||75.71||80.37|
|Point RCNN (Ours)||ReR50-ReFPN||86.21||86.44||60.30||80.12||76.45||86.17||88.58||90.84||88.58||88.44||73.03||70.10||79.26||79.02||77.15||80.71|
|Point RCNN (Ours)||Swin-T-FPN||86.59||85.72||61.64||81.08||81.01||86.49||88.84||90.83||87.22||88.23||68.85||71.48||82.09||83.60||76.08||81.32|
|Mask R-CNN [he2017mask]||R50-FPN||76.84||73.51||49.90||57.80||51.31||71.34||79.75||90.46||74.21||66.07||46.21||70.61||63.07||64.46||57.81||9.42||62.67|
|Oriented R-CNN [xie2021oriented]||R101-FPN||87.20||84.67||60.13||80.79||67.51||81.63||89.74||90.88||82.21||78.51||70.98||78.63||79.46||75.40||75.71||39.69||76.45|
|Point RCNN (Ours)||ReR50-ReFPN||83.40||86.59||60.76||80.25||79.92||83.37||90.04||90.86||87.45||84.50||72.79||77.32||78.29||77.48||78.92||47.97||78.74|
|Point RCNN (Ours)||ReR50-ReFPN||83.12||86.55||60.84||82.43||80.60||83.39||90.01||90.88||87.25||84.60||73.49||78.51||78.75||78.41||76.12||54.12||79.31|
|Point RCNN (Ours)||Swin-T-FPN||86.93||85.79||59.52||80.42||81.91||81.92||89.95||90.35||85.72||85.84||68.57||76.35||78.79||81.24||78.64||69.23||80.14|
4.2 Implementation Details
We implement Point RCNN using the MMDetection tool-box [mmdetection]. We follow ReDet [han2021redet] to use ReResNet with ReFPN as our backbone (ReR50-ReFPN), which has shown the ability to extract rotation-equivariant features. We also verify with the more generalized transformer backbone (Swin-Tiny) to show the generalization and scalability of our Point RCNN.
On the DOTA dataset, following previous methods [ding2019learning, han2021align, han2021redet], we crop the image to 1024
1024 with 824 pixels as a stride and we also resize the image to three scalesfor multi-scale data. Random horizontal flipping and random rotation () are adopted for multi-scale training. On the HRSC2016 dataset, like previous method [han2021redet]
, we resize all the images to (800, 512), random horizontal flipping is applied during training. Unless otherwise specified, we train all the models with 19 epochs for DOTA and 36 epochs for HRSC2016. To be specific, we train the models using AdamW[adam] on 8 Tesla-V100 GPUs with =0.9 and =0.999, with an initial learning rate of 0.0002, a weight decay of 0.05, and a mini-batch size of 16 (2 images per GPU). The learning rate decays by a factor of 10 at each decay step.
|Method||Backbone||mAP (%)||mAP (%)|
|Gliding Vertex [xu2020gliding]||R101-FPN||88.20||-|
|Orient R-CNN [xie2021oriented]||R101-FPN||90.50||97.60|
|Point RCNN (Ours)||ReR50-ReFPN||90.53||98.53|
4.3 Main Results
We compare our Point RCNN framework with other state-of-the-art methods on three datasets: DOTA-v1.0, DOTA-v1.5, and HRSC2016. As shown in Tab. 1, Tab. 2, and Tab. 3, without bells and whistles, our Point RCNN demonstrates superior performance against state-of-the-art methods.
On DOTA-v1.0, as reported in Tab. 1, Point RCNN achieves state-of-the-art 80.71 mAP. With the more generalized transformer backbone Swin-Tiny [liu2021swin] (Swin-T), Point RCNN can further improve the performance by 0.61% (from 80.71 to 81.32).
On DOTA-v1.5, which is more challenging compared to DOTA-v1.0, Point RCNN achieves 79.31 mAP, which significantly improve the performance by 2.51%. With the more generalized transformer backbone Swin-T, Point RCNN further improves the performance by 0.83% (from 79.31 to 80.14). The results are reported in Tab. 2
On HRSC2016, as reported in Tab. 3, Point RCNN attains the new state-of-the-art performance under both the VOC2007 and VOC2012 metrics, respectively.
4.4 Ablation Study
In this section, if not specified, all the models are trained only on the training and validation set with scale 1.0 for simplicity, and are tested using multi-scale testing. The metric mAP is evaluated on the DOTA-v1.5 test set and obtained by submitting prediction results to DOTA’s evaluation server.
4.4.1 Effect of PointRPN
To analysis the efficiency of PointRPN, we evaluate the detection recall of PointRPN on the validation set of DOTA-v1.5. For simplicity, we train the models on the training set with scale 1.0 and evaluate the metric recall on the validation set with scale 1.0 as well. The positive IoU threshold is set to 0.5. We select the top-300, top-1000, and top-2000 proposals to calculate their recall values and report the results in Tab. 4. We can find that when the proposals changes from top-2000 to top-1000, the recall value only drops 0.17%. Even if there are only top-300 proposals, the recall still achieves 85.93%.
|Method||Recall (%)||Recall (%)||Recall (%)|
4.4.2 Effect of Regression Type of PointReg
In this section, we analysis the effectiveness of the OBB regression type of PointReg. The results are shown in Tab. 5, compared to the five-parameter representation, the eight-parameter regression type brings higher performance.
|Regression type||mAP (%)|
4.4.3 Effect of Balanced Dataset
|Oversampling threshold ()||mAP (%)|
In this section, we analysis the impact of the oversampling threshold of the balanced dataset strategy. As shown in Tab. 6, we achieve the best detection accuracy of 77.60% at . Therefore, we set in all other experiments on DOTA.
4.4.4 Factor-by-factor Experiment
|Method||PointRPN||Balanced Dataset||PointReg||mAP (%)|
To explore the effectiveness of each module of our proposed Point RCNN framework, we conduct a factor-by-factor experiment on the proposed PointRPN, PointReg and balanced dataset strategy. The results are depicted in Tab. 7, each component has a positive effect, and all components are combined to obtain the best performance.
4.4.5 Visualization Analysis
We visualize some detection results on DOTA-v1.0 test set. Fig. 5 shows some examples of the learned points of PointRPN, which indicates that PointRPN is capable of learning representative points of rotated object. Fig. 6 shows some final detection results of Point RCNN, the red points denote the corner points learned by PointReg and the colored OBBs converted by the function of OpenCV are the final results.
Although experiments substantiate the superiority of Point RCNN over state-of-the-art methods, our method does not perform well enough in some categories, e.g., PL (Plane), which needs to be further explored. Point RCNN also needs to use rotate NMS to remove duplicate results, which may mistakenly delete the TP. Transformer-based methods [DETR] may be the potential solutions, which will be the future work.
In this work, we revisit rotated object detection and propose a purely angle-free framework for rotated object detection, named Point RCNN, which mainly consists of a PointRPN for generating accurate RRoIs, and a PointReg for refining corner points based on the generated RRoIs. In addition, we propose a balanced dataset strategy to overcome the long-tailed distribution of different object classes in aerial images. Extensive experiments on several large-scale benchmarks demonstrate the significant superiority of our proposed framework against the state-of-the-arts.