With the increase of available remote sensing images, efficient interpretation of remote sensing images becomes more and more important. Object detection is an efficient interpretation method to process massive remote sensing images. In recent years, with the rapid development of deep learning, object detection technology has been developed by leaps and bounds[10, 4, 9, 2]. Object detection in remote sensing images has also made great progress. A series of high-efficiency detectors are proposed to achieve high-precision object detection in remote sensing images [3, 5, 6, 7, 14, 1, 8, 15].
Unlike general object detection that uses a horizontal bounding box (HBB) to annotate objects, object detection in remote sensing images often take oriented bounding boxes (OBB) to characterize objects. Compared with HBB, OBB contains less background, so it can more effectively describe the contour information of the targets, which is beneficial for the convolutional neural networks (CNNs) to distinguish boundary features. The current mainstream methods all preset rotated priori bounding boxes or generating rotated proposals, such as RoI-Transformer, CFC-Net .
However, most of the current detectors inherit from the generic detection methods and do not pay attention to the unique problems in remote sensing images. That is, it is difficult to distinguish similar targets of different categories when viewed from a high altitude. For example, DOTA  is currently the largest remote sensing dataset with OBB annotations, which contains 15 categories. And there are big differences between the appearance of these categories. Even in remote sensing images, it is not hard to distinguish different categories (such as bridges and airplanes). These methods performed well on the DOTA dataset may achieve poor performance when distinguishing fine-grained targets. As shown in Figure 1, it is difficult to distinguish the different types of aircraft unless you are an expert in the field.
Fine-grained object recognition in remote sensing images is a challenging task. Objects have few texture features at high viewing distances, and thus it is hard to identify them. Besides, the long-tail distribution of instances will further degrade the recognition accuracy. In this technical report, we use an oriented feature alignment network (OFA-Net) to achieve high-precision fine-grained object recognition in remote sensing images. OFA-Net is an anchor-free method that lays dense anchor boxes on the images. These anchor boxes generate candidate proposals to capture the possible positions of the objects to improve the recall rate. Specifically, we use an additional oriented box refinement module to adjust the position of the obtained proposals. For the classification task, we designed the oriented feature alignment branch to extract the features in the rotated proposal for fine-grained classification task.
Category imbalance is another problem that affects detection accuracy. As shown in Figure 2, there is a huge gap in the number of instances of different categories. In order to achieve a balanced training process, we performed class-balanced sampling to expand the dataset. At the same time, data augmentations111Implementation: https://github.com/ming71/toolbox such as random cropping, flipping, optical distortion, etc. are used to expand the dataset. These methods can effectively improve the accuracy of recognition.
The main contributions of this technical report are as follows:
We analyzed the technical difficulties in fine-grained object recognition and developed solutions for it.
OFA-Net is adopted to decouple the fine-grained object detection task into localization subtask and classification subtask. Then rotated anchor refinement module (RARM) and accurate detection module (ADM) are applied to achieves high-precision localization and classification.
We designed an effective data augmentation strategy to greatly improve the recognition accuracy. Besides, we have tried series of tricks and reported the effects in detail.
We suggest that the fine-grained object recognition task can be decoupled into a localization task and a classification subtask. Then, we design effective structures for the two subtasks to achieve high performance.
The purpose of the localization subtask is to achieve high recall and precision for bounding box regression. To this end, we adopt a cascade oriented refinement module in the detection branch. As shown in Figure 3, the localization branch consists of two parts: rotated anchor refinement module (RARM) and accurate detection module (ADM). Among them, RARM uses a low IoU threshold to select positive samples (set to 0.4 in our experiment), so as to improve the recall rate as much as possible. The high-quality proposals obtained after oriented anchor refinement are sent to the ADM for final localization output, thereby achieving high result accuracy. The superiority of the cascade refinement module has been confirmed in some previous work [5, 3, 13].
The classification task is one of the points of the fine-grained object recognition task. The fine-grained features need to be captured to achieve accurate discrimination for similar classes. Therefore, we suggest extracting as much effective texture information as possible, while ignoring the backgrounds. To this end, we use AlignConv in SANet  to achieve the alignment and extraction of local features. AlignConv extracts aligned convolutional features from the proposal area that may contain objects, and extract texture information efficiently to help achieve high-performance classification. The details of AlignConv are shown in Figure 4.
|GF AIR pretrain|
Moreover, in order to further improve the performance of fine-grained object recognition, we have tried many tricks. To solve the problem of long-tail distribution in fine-grained object recognition, we adopted class-balanced sampling on the original dataset. It’s known that some categories such as ARJ21 have relatively fewer instances, we resample them to expand the dataset. Secondly, the scale of some objects (such as ships) varies greatly in different scenarios. We used multi-scale training and testing to make the network adapt to variations in different scales. Finally, to enable the network to learn a certain degree of translation and rotation invariance, we adopted the data augmentations of affine transformations.
There are more data analysis and processing techniques such as pre-training weights, training schedule, and adjustment of the learning rate, etc. We will give them in detail in the experimental part.
3 Experimental Results
3.1 Dataset and Implementation Details
The data set we used includes the fine-grained object recognition competition in the GaoFen Challenge and FAIR1M dataset . FAIR1M is the ISPRS benchmark on object detection in high-resolution satellite images which contains 37 classes. The object categories in the FAIR1M dataset are Boeing 737, Boeing 777, Boeing 747, Boeing 787, Airbus A320, Airbus A220, Airbus A330, Airbus A350, COMAC C919, COMAC ARJ21, other-airplane, passenger ship, motorboat, fishing boat, tugboat, engineering ship, liquid cargo ship, dry cargo ship, warship, other-ship, small car, bus, cargo truck, dump truck, van, trailer, tractor, truck tractor, excavator, other-vehicle, baseball field, basketball court, football field, tennis court, roundabout, intersection, and bridge. In the dataset, each object is annotated by an oriented bounding box (OBB). We conducted evaluation on the test set of the FAIR1M dataset and submit the final submission on servers for GaoFen Challenge.
The input images are cropped into 800*800 patches with a gap of 150. We use the SGD optimizer to train the network with a learning rate of 0.05. We train the models for 12 epochs on 2 RTX 2080ti GPUs. Random cropping, flipping, affine transformation, and optical distortion are used for data augmentation. We train the model on the train set of FAIR1M and eval on the test set on FAIR1M. Finally, we test the model on the GaoFen challenge. The total results on FAIR1M are shown in Table 1.
3.2 Evaluation of pretrain models
We have tried to use additional data for pre-training, including the DOTA dataset and 300 images of fine-grained aircraft recognition data in the GaoFen2020 competition, and then fine-tune the model on FAIR1M. Experiments of ID1 and ID6 in Table 1 (40.7887% vs. 40.7576%) show that the DOTA pre-training weight provides good prior knowledge in some categories with higher scene similarity (such as baseball field, basketball court, football field, tennis court). The performance of the related classes is improved. On the other hand, fine-grained aircraft images are very similar to FAIR1M data. Experiments of ID5 and ID6 in Table 1 show that the use of a pre-trained model effectively improves the performance of aircraft recognition. Moreover, the pre-trained model speeds up the model convergence. For example, with the DOTA pre-training weights, the model fine-tuned for 1 epoch on FAIR1M can achieve an accuracy comparable to 6 epoch training from scratch.
3.3 Evaluation of IoU threshold for training sample selection
Different IoU thresholds will lead to various training sample distributions and affect the detection performance. A low IoU threshold for positive samples helps to improve the recall rate but reduces the detection accuracy. A high IoU threshold will achieve higher accuracy, but the recall rate may not be good. We have tried different threshold settings. As shown in ID5 and ID9 in Table 1 (40.6944% vs. 41.3631%), we can a low IoU threshold in the first stage to ensure the recall rate, and set a high IoU threshold in the second stage to improve accuracy can achieve higher detection performance.
3.4 Evaluation of detection confidence
The model using a high detection confidence threshold can output more credible detection results, while the one using a low detection threshold achieves a higher recall rate. We need to make trade-offs among them. The experimental results of ID1 (40.7887%) compared with ID4 (40.9713%) and ID2 (39.4929%) compared with ID3 (39.0654%) show that a lower confidence threshold can get higher mAP.
3.5 Evaluation of data augmentation
We use a variety of data augmentation methods, including random flips, affine transformations, and optical distortions. Methods such as cutout, random noise, and random pixel dropping have also been tried, but they did not work well. In addition, we resample categories with few instances to expand the training set. With these augmentations, we doubled the dataset. Then multi-scale training and testing were performed, and a substantial performance improvement was achieved, as shown by ID10 and ID11 (40.8882% vs. 43.7304%). This single model achieved the mAP of 46.1747% in the GaoFen fine-grained aircraft recognition track (ranking 10/213), improved by more than 6 points compared with the baseline.
3.6 Outlook and other analysis
The above experimental results are obtained from the single-model evaluation. Due to constraints of time and available GPU resources, we are not able to further apply complex data augmentation and multi-model ensemble for higher mAP. And we believe our methods and training strategies could have achieved better performance.
We have also tried other strategies. For example, the single model ensemble helps to slightly improve performance (about 0.5 points in the GaoFen challenge). It’s also interesting that the expert model for certain categories cannot significantly improve the detection accuracy even complex data augmentations are attached to the tiny data. Maybe caused by overfitting? or the misjudgment between similar classes? I’m not sure. The first thing is to figure out whether the classification or detection is a real problem, or the both. I’ll make it clear in the future. There’s one more thing, class-agnostic NMS never works. Obviously, recall is more important than precision in mAP in most cases, which is consistent with the conclusion in Section 3.4.
In this technical report, we analyzed the difficulties of fine-grained object recognition in optical remote sensing images and designed effective strategies to achieve high-precision object detection. Specifically, we decompose the fine-grained object recognition detection task into the detection subtask and the classification subtask. The rotated anchor refining module is designed to obtain accurate object positioning, while the oriented feature alignment module is applied to effectively extract the texture features of the object. Besides, we designed customized data augmentation strategies and resampling strategies to alleviate the problem of category imbalance. Even the single model of our methods achieved impressive results. There is still a lot of room for improvement and we will figure out the solutions in the future.
-  (2019) Learning roi transformer for oriented object detection in aerial images. In , pp. 2849–2858. Cited by: §1, §1.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
Align deep features for oriented object detection. IEEE Transactions on Geoscience and Remote Sensing, pp. 1–11. External Links: Cited by: §1, §2, §2.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1.
-  (2021) CFC-net: a critical feature capturing network for arbitrary-oriented object detection in remote-sensing images. IEEE Transactions on Geoscience and Remote Sensing (), pp. 1–14. External Links: Cited by: §1, §1, §2.
-  (2021) Sparse label assignment for oriented object detection in aerial images. Remote Sensing 13 (14), pp. 2664. Cited by: §1.
-  (2021) Optimization for arbitrary-oriented object detection via representation invariance loss. IEEE Geoscience and Remote Sensing Letters (), pp. 1–5. External Links: Cited by: §1.
Dynamic anchor learning for arbitrary-oriented object detection.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35, pp. 2355–2363. Cited by: §1.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
-  (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE transactions on pattern analysis and machine intelligence 39 (6), pp. 1137–1149. Cited by: §1.
-  (2021) FAIR1M: a benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. arXiv preprint arXiv:2103.05569. Cited by: §3.1.
-  (2018) DOTA: a large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983. Cited by: §1.
-  (2019) R3det: refined single-stage detector with feature refinement for rotating object. arXiv preprint arXiv:1908.05612. Cited by: §2.
-  (2021) Rethinking rotated object detection with gaussian wasserstein distance loss. arXiv preprint arXiv:2101.11952. Cited by: §1.
Learning high-precision bounding box for rotated object detection via kullback-leibler divergence. arXiv preprint arXiv:2106.01883. Cited by: §1.