Aerial Object detection on DOTA-v2.0
In the past decade, object detection has achieved significant progress in natural images but not in aerial images, due to the massive variations in the scale and orientation of objects caused by the bird's-eye view of aerial images. More importantly, the lack of large-scale benchmarks becomes a major obstacle to the development of object detection in aerial images (ODAI). In this paper, we present a large-scale Dataset of Object deTection in Aerial images (DOTA) and comprehensive baselines for ODAI. The proposed DOTA dataset contains 1,793,658 object instances of 18 categories of oriented-bounding-box annotations collected from 11,268 aerial images. Based on this large-scale and well-annotated dataset, we build baselines covering 10 state-of-the-art algorithms with over 70 configurations, where the speed and accuracy performances of each model have been evaluated. Furthermore, we provide a uniform code library for ODAI and build a website for testing and evaluating different algorithms. Previous challenges run on DOTA have attracted more than 1300 teams worldwide. We believe that the expanded large-scale DOTA dataset, the extensive baselines, the code library and the challenges can facilitate the designs of robust algorithms and reproducible research on the problem of object detection in aerial images.READ FULL TEXT VIEW PDF
Aerial Object detection on DOTA-v2.0
Currently, Earth vision (also known as Earth observation and remote sensing) technologies enable us to observe the earth‘s surface with aerial images with a resolution up to a half meter. Although challenging, developing mathematical tools and numerical algorithms is necessary for interpreting these huge volumes of images, among which object detection refers to localizing objects of interest (, vehicles and ships) on the earth‘s surface and predicting their categories. Object detection in aerial images (ODAI) has been an essential step in many real-world applications such as urban management, precision agriculture, emergency rescue and disaster relief [1, 2]. Although extensive studies have been devoted to object detection in aerial images and appreciable breakthroughs have been made [3, 4, 5, 6, 7, 8], the task has numerous difficulties such as arbitrary orientations, scale variations, extremely nonuniform object densities and large aspect ratios (ARs), as shown in Fig. 1.
Among these difficulties, the arbitrary orientation of objects caused by the overhead view is the main difference between natural images and aerial images, and it complicates the object detection task in two ways. First, rotation-invariant feature representations are preferred in the detection of arbitrarily orientated objects, but they are often beyond the capability of most of current deep neural network models. Although the methods such as those designed in[9, 6, 10]
use rotation-invariant convolutional neural networks (CNNs), the problem is far from solved. Second, thehorizontal bounding box (HBB) object representation used in conventional object detection [11, 12, 13] cannot localize the oriented objects precisely, such as ships and large vehicles, as shown in Fig. 1. The oriented bounding box (OBB) object representation is more appropriate for aerial images [14, 15, 16, 4, 17]. It allows us to distinguish densely packed instances (as shown in Fig. 3) and extract rotation-invariant features [4, 18, 19]. The OBB object representation actually introduces a new object detection task, called oriented object detection. In contrast with horizontal object detection [20, 21, 22, 8]
, oriented object detection is a recently emerging research direction and most of the methods for this new task attempt to transfer successful deep-learning-based object detectors pre-trained on large-scale natural image datasets (, ImageNet
and Microsoft Common Objects in Context (MS COCO)) to aerial scenes [19, 18, 23, 24, 25] due to the lack of large-scale annotated aerial image datasets.
To mitigate the dataset problem, some public datasets of aerial images have been created, see [26, 7, 15, 16, 17], but they contain a limited number of instances and tend to use images taken under ideal conditions (, clear backgrounds and centered objects), which cannot reflect the real-world difficulties of the problem. The recently released xView  dataset provides a wide range of categories and contains large quantities of instances in complicated scenes. However, it annotates the instances with HBBs instead of the more precise OBBs. Thus, a large-scale dataset that has OBB annotations and reflects the difficulties in real-world applications of aerial images is in high demand.
Another issue with ODAI is that the module design and the hyperparameter setting of conventional object detectors learned from natural images are not appropriate for aerial images due to domain differences. Thus, in the sense of algorithm development, comprehensive baselines and enough ablative analyses of models on aerial images are required. However, comparing different algorithms is difficult due to the diversities in hardware, software platforms, detailed settings and so on. These factors influence both speed and accuracy. Therefore, when building the baselines, implementing the algorithms with a unified code library and keeping the hardware and software platform the same is highly desirable. Nevertheless, current object detection libraries, , MMDetection and Detectron , do not support oriented object detection.
To address the above-mentioned problems, in this paper we first extend the preliminary version of DOTA, , DOTA-v1.0 , to DOTA-v2.0. Specifically, DOTA-v2.0 collects aerial images from various sensors and platforms and contains approximately 1.8 million object instances annotated with OBBs in common categories, which, to our knowledge, is the largest public Earth vision object detection dataset. Then, to facilitate algorithm developments and comparisons with DOTA, we provide a well-designed code library that supports oriented object detection in aerial images. Based on the code library, we also build more comprehensive baselines than the preliminary version , keeping the hardware, software platform, and settings the same. In total, we evaluate algorithms and over models with different configurations. We then provide detailed speed and accuracy analyses to explore the module designs and parameter settings in aerial images to guide future research. These experiments verify the large differences in object detector design between natural and aerial images and provide materials for universal object detection algorithms .
The main contributions of this paper are three-fold:
To the best of our knowledge, the expanded DOTA is the largest dataset for object detection in Earth vision. The OBB annotations of DOTA not only provide a large-scale benchmark for object detection in Earth vision but also pose interesting algorithmic questions and challenges to generalized object detection in computer vision.
We build a code library for object detection in aerial images. This is expected to facilitate the development and benchmarking of object detection algorithms in aerial images with both HBB and OBB representations.
With the expanded DOTA, we evaluate representative algorithms over model configurations, providing comprehensive analysis that can guide the designs of object detection algorithms in aerial images.
The dataset, code library, and regular evaluation server are available and maintained on the DOTA website111https://captain-whu.github.io/DOTA/
. It is worth noting that the creation and use of DOTA have advanced object detection in aerial images. For instance, the regular DOTA evaluation server and two object detection contests organized on the 2018 International Conference on Pattern Recognition (ICPR’ 2018 with DOTA-v1.0)222https://captain-whu.github.io/ODAI/results.html and 2019 Conference on Computer Vision and Pattern Recognition (CVPR’2019 with DOTA-v1.5)333https://captain-whu.github.io/DOAI2019/challenge.html have attracted approximately registrations. We believe that our new DOTA dataset, with a comprehensive code library and an online evaluation platform, will further promote the reproducible research in Earth vision.
Well-annotated datasets have played an important role in data-driven computer vision research [31, 12, 13, 32, 33, 34, 35] and have promoted cutting-edge research in a number of tasks such as object detection and classification. In this section, we briefly review object detection datasets of natural images and aerial images.
As a pioneer, PASCAL Visual Object Classes (VOC) 
has held challenges on object detection from 2005 to 2012. The computer vision community widely adopts PASCAL VOC datasets and their evaluation metrics. Specifically, the PASCAL VOC Challenge 2012 dataset containsimages, classes, and
annotated bounding boxes. Later, the ImageNet dataset was developed and is an order of magnitude larger than PASCAL VOC, containing classes and approximately annotated bounding boxes. However, non-iconic views are not addressed. Then MS COCO  was released, containing a total of K images, categories, and million labeled segmented objects. MS COCO has on average more instances and categories per image and contains more contextual information than PASCAL VOC and ImageNet. It is worth noticing that, in Earth vision, the image size could be extremely large (, pixels), so the number of images cannot reflect the scale of a dataset. In this case, the pixel area would be more reasonable when comparing the scale between the datasets of natural and aerial images. Moreover, the large images include more instances per image and contextual information. Tab. I provides the detailed comparisons.
In aerial object detection, a dataset resembling MS COCO and ImageNet both in terms of the image number and detailed annotations has been missing, which becomes one of the main obstacles to research in Earth vision, especially for developing deep-learning-based algorithms. In Earth vision, many aerial image datasets are prepared for actual demands in a specific category, such as building datasets [7, 36], vehicle datasets [26, 15, 37, 16, 8, 38], ship datasets [4, 39], and plane datasets [40, 17]. Although some public datasets [17, 41, 42, 43, 44] have multiple categories, they have only limited number of samples, which are hardly efficient for training robust deep models. For example, NWPU  only contains images, classes and instances.
To alleviate this problem, our preliminary work DOTA-v1.0  presented a dataset with categories and instances, which the first time enables us to efficiently train robust deep models for ODAI without the help of large-scale datasets of nature images, such as MS COCO and ImageNet. Later, iSAID  provided an instance segmentation extension of DOTA-v1.0 . A notable dataset is xView , which contains images, main categories, fine-grained categories, and million instances. Another dataset DIOR  provided a comparable number of instances as DOTA-v1.0 . However, the instances in xView and DIOR are both annotated by HBBs, which are not suitable for precisely detecting objects that are arbitrarily oriented in aerial images. In addition, VisDrone  is also a large-scale dataset for drone images but focuses more on video object detection and tracking. The image subset in VisDrone for object detection is not very large. Furthermore, most of the previous datasets are heavily imbalanced in favor of positive samples, whose negative samples are not sufficient to represent the real-world distribution.
|# of instances||# of images||
|SZTAKI-INRIA ||multi source||OBB||1||1||665||9||800||2012|
|NWPU VHR-10 ||multi source||HBB||10||10||3,651||800||1000||2014|
|VEDAI ||satellite||OBB||3||9||2,950||1,268||512, 1024||2015|
|DLR 3k ||aerial||OBB||2||8||14,235||20||5616||2015|
|UCAS-AOD ||Google Earth||OBB||2||2||14,596||1,510||1000||2015|
|HRSC2016 ||Google Earth||OBB||1||26||2,976||1,061||1100||2016|
|RSOD ||Google Earth||HBB||4||4||6,950||976||1000||2017|
|LEVIR ||Google Earth||HBB||3||3||11,000||22,000||800600||2018|
|SpaceNet MVOI ||satellite||polygon||1||1||126,747||60,000||900||2019|
|HRRSD ||multi source||HBB||13||13||55,740||21,761||15210569||2019|
|DIOR ||Google Earth||HBB||20||20||190,288||23,463||800||2019|
|iSAID ||multi source||polygon||14||15||655,451||2,806||80013,000||2019|
|FGSD ||Google Earth||OBB||1||43||5,634||2,612||930||2020|
|DOTA-v1.0 ||multi source||OBB||14||15||188,282||2,806||80013,000||2018|
As we stated previously , a good dataset for aerial image object detection should have the following properties: 1) substantial annotated data to facilitate data-driven, especially deep-learning-based methods; 2) large images to contain more contextual information; 3) OBB annotation to describe the precise location of objects; and 4) balance in image sources, as pointed in . DOTA is built considering these principles (unless otherwise specified, DOTA refers to DOTA-v2.0). Detailed comparisons of these existing datasets and DOTA are shown in Tab. II. Compared to other aerial datasets, as we shall see in Sec. 4, DOTA is challenging due to its tremendous object instances, arbitrary orientations, various categories, density distribution, and diverse aerial scenes from various image sources. These properties make DOTA helpful for real-world applications.
Object detection in aerial images is a longstanding problem. Recently, with the development of deep learning, many researchers in Earth vision have adapted deep object detectors [49, 50, 51, 52, 53] developed for natural images to aerial images. However, the challenges caused by the domain shift need to be addressed. Here, we highlight some notable works.
Objects in aerial images are often arbitrarily oriented due to the bird’s-eye view, and the scale variations are larger than those in natural images. To handle rotation variations, a simple model  plugs an additional rotation-invariant layer into R-CNN  relying on rotation data augmentation. The oriented response network (ORN) introduces active rotating filters (ARF) to produce the rotation-invariant feature without using data augmentation, which is adopted by the rotation-sensitive regression detector (RRD) . The deformable modules  designed for general object deformation are also widely used in aerial images. The methods mentioned above do not fully utilize the OBB annotations. When OBB annotations are available, a rotation R-CNN (RR-CNN)  uses rotation region-of-interest (RRoI) pooling to extract rotation-invariant region features. However, RR-CNNs  generate proposals by hand-crafted way. Then the RoI Transformer  tries to use the supervision of OBBs to learn RoI-wise spatial transformation. The later SA-Net  extracts spatially invariant features in one-stage detectors. To solve the challenges of scale variations, feature pyramids [57, 19] and image pyramids [24, 25]
are widely used to extract scale-invariant features in aerial images. We evaluate the geometric transformation network modules and geometric data augmentations in Sec.6.1.
Crowded instances represented by HBBs are difficult to distinguish (see Fig. 3). Traditional HBB-based non maximum suppression (NMS) will fail in such cases. Therefore, these methods [18, 25, 24] use rotated NMS (R-NMS), which require precise detections to address this problem. Similar to text and faces detection in natural scenes, [23, 58, 59, 60], precise ODAI can also be modeled as an oriented object detection task. Most of the previous works [23, 61, 14, 24, 25] consider it as a regression problem and regress the offsets of the OBB ground truth relative to anchors (or proposals). However, the definition of an OBB is ambiguous. For example, there are four permutations of the corner points in a quadrilateral. The Faster R-CNN OBB  solves it by using a defined rule to determine the order of points in OBBs. Work in  further uses the gliding offset and obliquity factor to eliminate the ambiguity. The circular smooth label (CSL)  transforms the regression of angle as a classification problem to avoid the problem. Mask OBB  and CenterMap  consider object detection as a pixel-level classification problem to avoid ambiguity. Mask-based methods converge more easily but have more floating point operations per second (FLOPS) than regression-based methods. We will give a more detailed comparison between them in one unified code library in Sec. 6.1.1.
The final challenge is detecting objects in large images. Aerial images are usually extremely large (over pixels). Current GPU memory capacity is insufficient to process large images. Downsampling a large image to a small size would lose the detailed information. To solve this problem [14, 16], the large images can be simply split into small patches. After obtaining the results on these patches, the results are integrated into large images. To speed up inference on large images, these methods [20, 21, 22, 66] first find regions that are likely to contain instances in the large images and then detect objects in the regions. In this paper, we simply follow the naive solutions [14, 16] to build baselines.
The development of object detection algorithms is a sophisticated process. In addition, there are too many design choices and hyperparameter settings, which make comparisons between different methods difficult. Therefore, object detection code libraries such as the Tensorflow Object Detection API, Detectron , MaskRcnn-Benchmark , Detectron2 , MMDetection  and SimpleDet are developed to facilitate the comparisons of object detection algorithms. These code libraries primarily use a modular design, which makes it easy to develop new algorithms. The current widely used settings, such as the training schedule, are from Detectron . However, these code libraries mainly focus on horizontal object detection. Only Detectron2  has limited support for oriented object detection. In our work, we enriched MMDetection  with several crucial operators for oriented object detection and evaluated 10 algorithms for object detection in aerial images.
In aerial images, the resolution and a variety of sensors are the factors that produce dataset biases . To eliminate these biases, we collect images from various sensors and platforms with multiple resolutions, including Google Earth, the Gaofen-2 (GF-2) Satellite, Jilin-1 (JL-1) Satellite, and aerial images (taken by CycloMedia in Rotterdam). To obtain the DOTA images, we first collected the coordinates of areas of interest (, airports or harbors) from all over the world. Then, according to the coordinates, images are collected from Google Earth, GF-2 and JL-1 satellites. For the images from Google Earth, we cropped the patches that contain instances of interest with sizes from 800 to 4000. However, for the satellite and aerial images, we maintained their original sizes. Large images can approach real-world distributions, and also pose a challenge for finding small instances . In DOTA, the sizes of satellite and aerial images are usually and , respectively.
We choose eighteen categories, plane, ship, storage tank, baseball diamond, tennis court, swimming pool, ground track field, harbor, bridge, large vehicle, small vehicle, helicopter, roundabout, soccer ball field, basketball court, container crane, airport and helipad. We select these categories according to their frequency of occurrence and value for real-world applications. The first 10 categories are common in the existing datasets, , [16, 41, 17, 37]. Other categories are added considering their value in real-world applications. For example, we selected “helicopter” as moving objects are of significant importance in aerial images, and “roundabout” as it plays an essential role in roadway analyses. It is worth discussing whether to take “stuff” categories into account. There are usually no clear definitions for the ”stuff” categories (harbor, airport, parking lot
), as shown in the Scene UNderstanding (SUN) dataset. However, their contextual information may be helpful for object detection. Based on this idea, we select the harbor and airport categories because their borders are relatively easy to define and there are abundant harbor and airport instances in our image sources.
In computer vision, many visual concepts, such as region descriptions, objects, attributes, and relationships, are always represented with bounding boxes, as shown in . A common representation of the bounding box is , where is the center location and are the width and height, respectively, of the bounding box. We call this type of bounding box an HBB. The HBB can describe objects well in most cases. However, it cannot accurately outline oriented instances such as text and objects in aerial images. As shown in Fig. 3
, the HBB cannot differentiate crowd oriented objects. The conventional NMS algorithm fails in such cases. On the other hand, the regional features extracted from HBBs are not rotation invariant. To address these problems, we represent the objects with OBBs. In detail, an OBB is denoted by, where denotes the position of the OBB’s vertex in the image. The vertices are arranged in clockwise order.
The most straightforward way to annotate an OBB is to draw an HBB and then adjust the angle. However, since there is no reference for HBBs, several adjustments in the center, height, width and angle are usually needed to fit an arbitrarily oriented object well. Clicking on physical points lying on the object  could make crowd-sourced annotations more efficient for HBBs, as these points are easy to find. Inspired by this idea, we allow the annotators to click four corners of the OBBs. For most categories, the corners of the OBBs (, tennis court and basketball court) lie on or close to the objects (vehicles), however, there are still some categories whose shapes are very different from OBBs. For these categories, we annotate four key points lying on the object. For example, we annotate the planes with 4 key points, representing the head, two wingtips, and tail. Then we transfer the 4 key points to an OBB.
However, when using OBBs to represent objects, we could obtain four different representations for the same object by changing the order of the points. For example, assume that represents an object, but we could represent the same object by . For categories having differences between the head and tail (, helicopter, large vehicle, small vehicle, harbor), we carefully select the first point to imply the “head” of the object. For other categories(, soccer-ball field, swimming pool and bridge) that do not have visual clues to determine the first point, we choose the top-left point as the starting point.
Some examples of annotated patches are shown in Fig. 4.
We count the proportion of three image sources (Google Earth, satellite and aerial images) in terms of the number of images, number of instances and pixel area in Tab. III. We can see that the carefully selected Google Earth images contain the majority of positive samples. Nevertheless, the negative samples are also important to avoid positive sample bias . The collected satellite and aerial images are close to the real-world distribution and provide enough background area.
|Image sources||GF-2 and JL-1||Google Earth||Aerial image|
|# of images||0.05||0.90||0.05|
|# of instances||0.10||0.76||0.14|
The ground sample distance (GSD), which indicates the distance between pixel centers measured on Earth, has potential usages. For example, it allows us to calculate the actual sizes of objects, which can be used to filter mislabeled or misclassified outliers since the object sizes of the same category are usually limited to a small range. Furthermore, we can conduct scale normalization based on the priors of the object size and GSD. In DOTA, GSDs of the satellite images and aerial images are approximately 1m and 0.1m, respectively, while the GSDs of Google Earth images range from 0.1m to 4.5m.
Objects in the overhead view images have a high diversity of orientations without the restriction of gravity. As shown in Fig. 1
(g), the objects have equal probabilities of arbitrary angles in. It is worthwhile to note that although objects in scene text detection 
and face detection also have many orientation variations, the angles of most objects lie within a narrow range (, ) due to gravity. The unique angle distributions of DOTA make it become a good dataset for research on rotation-invariant feature extraction and oriented object detection.
Following the convention in , we use the height of an HBB to measure the pixel size of the instance. We divide all the instances in our dataset into three splits according to their heights of HBBs: small, with range from to , medium, with range from to , and large, with range above . Tab. IV illustrates the percentages of these three instance splits in different datasets. It is clear that the PASCAL VOC dataset, NWPU VHR-10 dataset and DLR 3K Munich Vehicle dataset are dominated by medium instances or small instances.
MS COCO and DOTA-v1.0 have a good balance between small instances and medium instances. DOTA-v2.0 has more small instances than DOTA-v1.0. In DOTA-v2.0, some instances that are approximately 10 pixels are annotated.
In Fig. 5, we also show the distribution of instances’ pixel sizes for different categories in DOTA. This figure indicates that the scales vary greatly both within and between categories. These large-scale variations among instances make the detection task more challenging.
|Dataset||10-50 pixels||50-300 pixels||300 pixels|
|PASCAL VOC ||0.14||0.61||0.25|
|MS COCO ||0.43||0.49||0.08|
|NWPU VHR-10 ||0.15||0.83||0.02|
|DLR 3K ||0.93||0.07||0|
The AR is essential for anchor-based models, such as Faster R-CNN  and You Only Look Once (YOLOv2) . We use two kinds of ARs for all the instances in our dataset to guide the model design namely, 1) the ARs of the original OBBs and 2) the AR of HBBs, which are generated by calculating the axis-aligned bounding boxes over the OBBs. Fig. 6 illustrates the distributions of these two types of aspect ratios in DOTA. We can see that instances vary significantly in aspect ratio. Moreover, many instances have a large aspect ratio in our dataset.
The number of instances per image is an important property for object detection datasets and varies largely in DOTA. It can be very dense (up to 1000 instances per image patch), or very sparse (only one instance per image patch). We compare this property among DOTA and the general object detection datasets in Fig. I. The number of instances per image in DOTA varies more widely than in natural image datasets.
Different categories also have different density distributions. To give a quantitative analysis, for each instance, we first measure the distance to the closest instance in the same category. We then bin the distances into three parts, dense , normal and sparse (see Fig. 7). Fig. 7 shows that the storage tank, ship and small vehicle are top-3 dense categories.
The three versions of DOTA, DOTA-v1.0, DOTA-v1.5, and DOTA-v2.0, are summarized in Table V.
DOTA-v1.0 contains 15 common categories, 2,806 images and 188, 282 instances. The proportions of the training set, validation set, and testing set in DOTA-v1.0 are 1/2, 1/6, and 1/3, respectively.
DOTA-v1.5 uses the same images as DOTA-v1.0, but extremely small instances (less than 10 pixels) are also annotated. Moreover, a new category, ”container crane” containing 402,089 instances in total is added. The number of images and dataset splits are the same as those in DOTA-v1.0. This version was released for the DOAI Challenge 2019 on Object Detection in Aerial Images in conjunction with CVPR 2019.
DOTA-v2.0 collects more Google Earth, GF-2 Satellite, and aerial images. There are 18 common categories, 11,268 images and 1,793,658 instances in DOTA-v2.0. Compared to DOTA-v1.5, it further adds the new categories of ”airport” and ”helipad”. The images of DOTA-v2.0 are split into training, validation, test-dev, and test-challenge sets. To avoid the problem of overfitting, the proportion of the training and validation sets is smaller than that of the test set. Furthermore, we have two test sets, namely test-dev and test-challenge, which are similar to MS COCO dataset . The detailed splits are shown below:
Training contains 1,830 images and 268,627 instances. We will release both the images and ground truths.
Validation contains 593 images and 81,048 instances. We will release both the images and ground truths.
Test-dev contains 2,792 images and 353,346 instances. We will release the images without ground truths. For evaluation, one can submit the results to the evaluation server that we built.444https://captain-whu.github.io/DOTA/evaluation.html The submission for each team is limited to once a day to avoid overfitting. All the DOTA-v2.0 experiments in this paper are evaluated on test-dev.
Test-challenge contains images and instances. It will be only available during the contest.
The task of object detection is to locate and classify the instances in images. We use two location representations (HBB and OBB) in our paper. The HBB is a rectangular region , and the OBB is an oriented rectangular region . Then, there are two tasks, detection with HBB and detection with OBB. To be more specific, we evaluate these methods on two kinds of ground truths: HBB and OBB ground truths. We adopt the PASCAL VOC 07 metric  for the calculation of the mean average precision (mAP). It is worthwhile to note that for the OBB task, the intersection over union (IoU) is calculated between OBBs.
In the previous benchmarks , the algorithms were implemented with different codes and settings, which makes these algorithms hard to compare in DOTA. To this end, we implement and evaluate all the algorithms in one unified code library modified from MMDetection .
Since large images cannot be directly fed to CNN-based detectors due to the memory limitations, we crop a series of
patches from the original images with a stride set to 824 (different from the previous stride of 512). During inference, we first send the patches (same settings as training) to obtain temporary results. Then we map the detected results from the patch coordinates to the original image coordinates. Finally, we apply NMS on these results in the original image coordinates. We set the threshold of NMS to for the HBB experiments and for the OBB experiments. For multi-scale training and testing, we first scale the original images to and then crop the images into patches of size and a stride of 824. We use 4 GPUs for training with a total batch size of 8 (2 images per GPU). The learning rate is set to 0.01. Except for RetinaNet , which adopts the ”” schedule, the other algorithms adopt the ””  training schedule. We set the number of proposals and maximum number of predictions per image patch to 2,000 for all the experiments except when otherwise mentioned. The other hyperparameters follow those of Detecron .
We use two ways to build baselines for the HBB task. The first way directly predicts the HBB results, while the second way first predicts the OBB results and then converts OBBs to HBBs. To directly predict the HBB results, we use RetinaNet , Mask R-CNN, Cascade Mask R-CNN, Hybrid Task Cascade and Faster R-CNN  as baselines. For the OBB predictions, we will introduce the methods in the following section.
Most of the state-of-the-art object detection methods are not designed for oriented objects. To enable these methods to predict OBBs, we build the baselines in two ways. The first is to change HBB head to OBB Head, which regresses the offsets of OBBs relative to the HBBs. The second is Mask Head, which considers the OBBs to a coarse mask and predicts the pixel-level classification from each RoI.
OBB Head To predict OBB, the previous Faster R-CNN OBB  and Textboxes++  modified RoI Head of Faster R-CNN and the Anchor Head of the single-shot detector (SSD), respectively, to regress quadrangles. In this paper, we use the representation instead of for OBB regression. More precisely, rectangular RoIs (anchors) can be written as . We can also consider it a special OBB and rewrite it as . For matching, IoUs are calculated between the horizontal RoIs (anchors) and HBBs of the ground truths for computational simplicity. Each OBB, it has four forms: , where , , , and . Before calculating the targets, we choose the best matched ground-truth form. The index of the best matched form is calculated by , where is a distance function, which could be Euclidean distance or another distance function. We denote the best matched form by . Then the learning target is calculated as
We then simply replace the HBB RoI Head of Faster R-CNN and anchor head of RetinaNet with OBB Head and obtain two models, called Faster R-CNN OBB and RetinaNet OBB. We also modify the Faster R-CNN to predict both the HBB and OBB in parallel, which is similar to Mask R-CNN . We call this model Faster R-CNN H-OBB. We further evaluate the deformable RoI pooling (Dpool) and RoI Transformer by replacing the RoI Align in Faster R-CNN OBB. Then we have two models: Faster R-CNN OBB + Dpool and Faster R-CNN OBB + RoI Transformer. Note that the RoI Transformer used here is slightly different from the original one. The original RoI Transformer uses the Light Head R-CNN  as the base detector while we use Faster R-CNN.
Mask Head Mask R-CNN  was originally used for instance segmentation. Although DOTA does not have pixel-level annotation for each instance, the OBB annotations can be considered coarse pixel-level annotations, so we can apply Mask R-CNN  to DOTA. During inference, we calculate the minimum OBBs that contain the predicted masks. The original Mask R-CNN  only applies a mask head to the top 100 HBBs in terms of the score. Due to the large number of instances per image, as illustrated in Fig. 2, we apply a mask head to all the HBBs after NMS. In this way, we evaluate Mask R-CNN , Cascade Mask R-CNN and Hybrid Task Cascade .
|HBB mAP||OBB mAP||HBB mAP||OBB mAP||HBB mAP||OBB mAP|
|Cascade Mask R-CNN||7.2||71.36||70.96||64.31||63.41||50.98||50.04|
|Hybrid Task Cascade*||7.9||72.49||71.21||64.47||63.40||50.88||50.34|
|Faster R-CNN OBB||14.1||71.91||69.36||63.85||62.00||49.37||47.31|
|Faster R-CNN OBB + Dpool||12.1||71.83||70.14||63.67||62.20||50.48||48.77|
|Faster R-CNN H-OBB||13.7||70.37||70.11||64.43||62.57||50.38||48.90|
|Faster R-CNN OBB + RoI Transformer||12.4||74.59||73.76||66.09||65.03||53.37||52.81|
In this section, we introduce the aerial object detection code library 555https://github.com/dingjiansw101/AerialDetection and development kit 666https://github.com/CAPTAIN-WHU/DOTA_devkit. To construct the comprehensive baselines, we select MMDetection as the fundamental code library since it contains rich object detection algorithms and has the feature of modular design. However, the original MMDetection  lacks the modules to support oriented object detection. Therefore, we enriched MMDetection with OBB Head as described in Sec. 5.2.2 to enable OBB predictions. We also implemented modules such as rotated RoI Align and rotated position-sensitive RoI Align for rotated region feature extraction, which are crucial components in algorithms such as rotated region proposal network (RRPN)  and RoI Transformer . These new modules are compatible with the modularly designed MMDetection, so we can easily create new algorithms for oriented object detection not restricted to the baseline methods in this paper. We also provide a development kit containing necessary functions for object detection in DOTA, including:
Loading and visualizing the ground truths.
Calculating the IoU between OBBs, which is implemented in a mixture of Python/C program. We provide both the CPU and GPU versions.
Evaluating the results. The evaluation metrics are described in Sec. 5.1.
Cropping and merging images. The original image sizes in DOTA are very large. One can use our tools to split the large size images into patches. The scale and gap between patches are optional arguments. After testing on the patches, we can use the tools to map the results of patches to the original image coordinates and apply NMS.
In this section, we conduct a comprehensive evaluation of over 70 experiments and analyze the results. First, we demonstrate the baseline results of 10 algorithms on DOTA-v.10, DOTA-v1.5 and DOTA-v2.0. The baselines cover both two-stage and one-stage algorithms. For most algorithms, we report the mAPs of HBB and OBB predictions, respectively, except for RetinaNet and Faster R-CNN, since they do not support oriented object detection. For algorithms with only OBB heads (RetinaNet OBB, Faster R-CNN OBB, Faster R-CNN OBB +DPool, Faster R-CNN OBB +RoI Transformer), we obtain their HBB results by transferring from OBB as described in Sec. 5.2.1. For algorithms with both HBB and OBB heads (Mask R-CNN, Cascade Mask R-CNN, Hybrid Task Cascade*, and Faster R-CNN H-OBB), the HBB mAP is the maximum of the predicted HBB mAP and the transferred HBB mAP. It can be seen that the OBB mAP is usually slightly lower than the HBB mAP for the same algorithm since the OBB task needs a more precise location than the HBB task.
Tab. VI shows that the performance on DOTA-v1.0, DOTA-v1.5 and DOTA-v2.0 are declining, indicating the increased difficulty of the datasets. To give more detailed comparisons of speed vs. accuracy, we evaluate all algorithms at different backbones (as shown in Fig. 8). From the speed-accuracy curve, the Faster R-CNN OBB + RoI Transformer outperforms the other methods. To explore the properties of DOTA and provide guidelines for future research, we evaluate the module design and the hyperparameter setting. Then, we analyze the influence of data augmentation in detail. Finally, we visualize the results to show the difficulties of ODAI.
The OBB head tackles oriented object detection as a regression problem, while the mask head tackles oriented object detection as a pixel-level classification problem. The mask head more easily converges and achieves better results but is more computationally expensive. Taking the results on the DOTA-v2.0 test-dev set as an example, Mask R-CNN still outperforms Faster R-CNN H-OBB by 0.57 points in OBB mAP. Nevertheless, Mask R-CNN is slower than Faster R-CNN H-OBB by 4 fps. Note that the process of transferring the mask to the OBB is not considered here. Otherwise, Mask R-CNN should be slower.
Geometric variations are still challenging in object detection. In this part, we evaluate RoI Transformer and Dpool by replacing RoI Align in Faster R-CNN OBB. We call these two models Faster R-CNN OBB + RoI Transformer and Faster R-CNN OBB + Dpool. Tab. VI and Fig. 8 show that Dpool improves the performance of Faster R-CNN OBB at most times, while RoI Transformer performs better than Dpool. This finding verifies that carefully designed geometry transformation modules such as RoI Transformer are better than general geometry transformation modules such as Dpool for aerial images.
During the training on DOTA-v1.5 and DOTA-v2.0, many extremely small instances will cause numerical instability. For the experiments in DOTA-v1.5 and DOTA-v2.0, we set a threshold to exclude too small instances. We try to explore the influence of different thresholds on DOTA-v2.0. We filter the small instances by two rules: 1) the area of instance is below a certain threshold, and 2) is below a threshold, where the and are the width and height, respectively, of the corresponding HBB. The results in Tab. VII show that small instances have little influence on the results.
|# of filtered Instance||Filtering strategy||HBB mAP|
|99,317||area 50 and max(w, h) 10||51.08|
|157,287||area 80 and max(w, h) 10||51.35|
|158,629||area 80 and max(w, h) 12||50.71|
|Method||# of proposals||1,000||2,000||3,000||4,000||5,000||6,000||7,000||8,000||9,000||10,000|
|Faster R-CNN OBB + RoI Transformer||OBB mAP (%)||51.72||52.81||52.81||53.24||53.29||53.51||53.70||53.94||53.93||53.92|
|Faster R-CNN OBB||OBB mAP (%)||47.10||47.31||48.03||48.09||48.32||48.35||48.48||48.49||48.49||48.49|
The number of proposals is an important hyperparameter in modern detectors. As mentioned before, the possible number of instances in aerial images is quite different from that in natural images. In DOTA, one image may contain more than 1,000 instances. There is no doubt that the parameters that perform well for natural images are not optimal for aerial images. Here we explore the optimal settings for aerial images. As shown in Tab. VIII, the number of proposals with the highest performance for Faster R-CNN OBB + RoI Transformer is 8,000. For Faster R-CNN OBB, the increase in the mAP slows at approximately 8,000 proposals. Furthermore, from 1,000 to 10,000 proposals, the improvements in Faster R-CNN + RoI Transformer and Faster R-CNN OBB are 2.2 and 1.39 points in mAP, respectively. However, the increased number of proposals bring more computation. Therefore, for the other experiments in this paper, we choose 2,000 proposals. The optimal number of proposals in DOTA is quite larger than that in PASCAL VOC, where 300 is the optimal number. This finding confirms that the difference between aerial and natural images is again massive.
In this section, we explore the influence of data augmentation in detail. The experiments on data augmentation are conducted on DOTA-v1.5. In , the authors used data augmentation of multi-scale training and testing, as well as rotation training. We follow the data augmentation strategies in  and further conduct rotation testing. The model we select is Faster R-CNN OBB + RoI Transformer. We choose R-50-FPN as the backbone and adopt five data augmentation strategies. The first is the high patch overlap. We change the overlap between patches from 200 to 512 since the large instances may be cut off at the edge. The second and third are multi-scale training and testing, respectively. We resize the original images by factors of [0.5, 1.0, 1.5] and then crop the original images into patches of size . The fourth is the rotation training. For images with roundabouts and storage tanks, we rotate the patches randomly by four angles . For images with the other categories, we rotate the angle randomly from continuously during training. The last is rotation during testing, we rotate the images at four angles (). As shown in Tab. IX, both scale and rotation data augmentations improve the performance of object detection by a large margin, which is consistent with the large scale and orientation variations in DOTA. Furthermore, this baseline model already used a feature pyramid network (FPN) and RoI Transformer. This indicates that the FPN and RoI Transformer do not completely solve the problem of scale and rotation variations, and geometric modeling with CNNs is still an open problem.
|Multi scale Train||✓||✓||✓||✓|
|Multi scale test||✓||✓||✓|
We show the performance of Faster R-CNN , Faster R-CNN OBB, RetinaNet OBB, Mask R-CNN and Faster R-CNN OBB + RoI Transformer on difficult scenes in Fig. 9: 1) The first row demonstrates densely packed large vehicles. Faster R-CNN misses many instances due to the high overlaps between neighboring large vehicles in the HBBs. Those large vehicles are suppressed through NMS. Faster R-CNN OBB, Mask R-CNN and Faster R-CNN OBB + RT perform well, while RetinaNet OBB has lower location precision due to feature misalignment. 2) The second and third rows show long shape instances with a large ARs. These instances are self-similar, which means that each part of the instance has a similar feature as the whole instance. For example, the second row shows that all methods have at least two predictions on a single bridge. The third row also reveals this problem. There exist several predictions on a single ship. 3) The second and third rows also indicate that several different categories have very similar features. Bridges are easily classified as airports and harbors while the ships are easily to be classified as harbors and bridges. 4) The latest row shows the difficulty of detecting extremely small instances (less than or approximately 10 pixels). The recall of the extremely small instances is very low.
|LR-O + RT ||R-101-FPN||88.64||78.52||43.44||75.92||68.81||73.68||83.59||90.74||77.27||81.46||58.39||53.54||62.83||58.93||47.67||69.56|
|Gliding Vertex ||R-101-FPN||89.64||85.00||52.26||77.34||73.01||73.14||86.82||90.74||79.02||86.81||59.55||70.91||72.94||70.86||57.32||75.02|
|Li et al. ||R-101-FPN||90.41||85.21||55.00||78.27||76.19||72.19||82.14||90.70||87.22||86.87||66.62||68.43||75.43||72.70||57.99||76.36|
|FR-O* + RT||R-50-FPN||88.34||77.07||51.63||69.62||77.45||77.15||87.11||90.75||84.90||83.14||52.95||63.75||74.45||68.82||59.24||73.76|
|FR-O* + RT (Aug.)||R-50-FPN||87.89||85.01||57.83||78.55||75.22||84.37||88.04||90.88||87.28||85.79||71.04||69.67||79.00||83.29||73.43||79.82|
|Li et al. ||ResNet101||90.41||85.77||61.94||78.18||77.00||79.94||84.03||90.88||87.30||86.92||67.78||68.76||82.10||80.44||60.43||78.79|
|FR-O* + RT||R-50-FPN||88.47||81.00||54.10||69.19||78.42||81.16||87.35||90.75||84.90||83.55||52.63||62.97||75.89||71.31||57.22||74.59|
|FR-O* + RT (Aug.)||R-50-FPN||87.91||85.11||62.65||77.73||75.83||85.03||88.18||90.88||87.28||86.18||71.49||70.37||84.94||84.11||73.61||80.75|
In this section, we compare the performance of Faster R-CNN OBB + RoI Transformer with the state-of-the-art algorithms on DOTA-v1.0 . As shown in Tab. X, Faster R-CNN OBB + RoI Transformer achieves an OBB mAP of 73.76 for DOTA-v1.0, and it outperforms all the previous state-of-the-art methods except that proposed by Li et al. . Note that the method of Li et al. , SCRDet  and the image cascade network (ICN)  all use multiple scales for training and testing to achieve high performance. The method of Li et al.  further used rotation data augmentation during training as described in Sec. 6.1.5. When using the same data augmentation, we achieve an mAP of 79.82. It outperforms the method of Li et al  by 3.46 points in OBB mAP and 1.96 points in HBB mAP. In addition, there is a significant improvement in densely packed small instances. (, the small vehicles, large vehicles, and ships). For example, the detection performance for the large vehicle category gains an improvement of 12.18 points compared to the previous results.
|FR-O + RT||71.92||76.07||51.87||69.24||52.05||75.18||80.72||90.53||78.58||68.26||49.18||71.74||67.51||65.53||62.16||9.99||65.03|
|FR-O + RT (Aug.)||87.54||84.34||62.22||79.77||67.29||83.16||89.93||90.86||83.85||77.74||73.91||75.31||78.61||77.07||75.20||54.77||77.60|
|FR-O + RT||71.92||75.21||54.09||68.10||52.54||74.87||80.79||90.46||78.58||68.41||51.57||71.48||74.91||74.84||56.66||13.01||66.09|
|FR-O + RT (Aug.)||87.79||84.33||63.75||79.13||72.92||83.08||90.04||90.86||83.85||77.80||73.30||75.66||84.84||82.16||75.20||57.39||78.88|
DOTA-v1.5 has been used to hold the Challenge-2019 on ODAI in conjunction with CVPR 2019 (DOAI 2019)777https://captain-whu.github.io/DOAI2019/. There were 173 registrations in total, 13 teams submitted valid results on the OBB Task, and 22 teams submitted valid results on the HBB Task. Finally, team  from University of Science and Technology of China received first place in OBB Task and second place in the HBB Task. The team  from Nanjing University of Science and Technology received first place in the HBB Task and second place in the OBB Task. We list the top 3 results of the challenge for the OBB and HBB tasks in Tab. XI. The detailed leaderboards for OBB and HBB tasks can be found on the DOAI 2019 website888https://captain-whu.github.io/DOAI2019/results.html, including team information such as members, institute and methods used. Note that the top results are an ensemble of different models. However,  reported one single model and achieved 74.9 in OBB mAP and 77.9 in HBB mAP. Data augmentations such as multi-scale training, testing and rotation training are adopted. Our best model with the same data augmentation is 76.43 in OBB mAP and 77.24 in HBB mAP as shown in Tab. IX. Ours is higher in OBB mAP and comparable in HBB mAP.
ODAI is challenging. To advance future research, we introduce a large-scale dataset, DOTA, containing 1,793,658 instances annotated by OBBs. The DOTA statistics show that it can well represent the real world well. Then, we build a code library for both oriented and horizontal ODAI to conduct a comprehensive evaluation. We hope these experiments can act as benchmarks for fair comparisons between ODAI algorithms. The results show that hyperparameter selection and module design of the algorithms (, number of proposals) for aerial images are very different from those for natural images. It indicates that DOTA can be used as a supplement to natural scene images to facilitate universal object detection.
In the future, we will continue to extend the dataset, host more challenges, and integrate more algorithms for oriented object detection into our code library. We believe that DOTA, challenges and code library will not only promote the development of object detection in Earth vision but also pose interesting algorithmic questions for general object detection in computer vision.
B. Zhou, À. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning deep features for scene recognition using places database,” inNIPS, 2014, pp. 487–495.
B. Uzkent, C. Yeh, and S. Ermon, “Efficient object detection in large images using deep reinforcement learning,” inWACV, 2020, pp. 1824–1833.
F. Massa and R. Girshick, “maskrcnn-benchmark: Fast, modular reference implementation of Instance Segmentation and Object Detection algorithms in PyTorch,”https://github.com/facebookresearch/maskrcnn-benchmark, 2018.
D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. K. Ghosh, A. D. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, F. Shafait, S. Uchida, and E. Valveny, “ICDAR 2015 competition on robust reading,” inICDAR, 2015.