SCRDet++: Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing

04/28/2020 ∙ by Xue Yang, et al. ∙ Shanghai Jiao Tong University 29

Small and cluttered objects are common in real-world which are challenging for detection. The difficulty is further pronounced when the objects are rotated, as traditional detectors often routinely locate the objects in horizontal bounding box such that the region of interest is contaminated with background or nearby interleaved objects. In this paper, we first innovatively introduce the idea of denoising to object detection. Instance-level denoising on the feature map is performed to enhance the detection to small and cluttered objects. To handle the rotation variation, we also add a novel IoU constant factor to the smooth L1 loss to address the long standing boundary problem, which to our analysis, is mainly caused by the periodicity of angular (PoA) and exchangeability of edges (EoE). By combing these two features, our proposed detector is termed as SCRDet++. Extensive experiments are performed on large aerial images public datasets DOTA, DIOR, UCAS-AOD as well as natural image dataset COCO, scene text dataset ICDAR2015, small traffic light dataset BSTLD and our newly released S^2TLD by this paper. The results show the effectiveness of our approach. Project page at



There are no comments yet.


page 1

page 4

page 5

page 8

page 9

page 11

page 12

page 15

Code Repositories


R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Visual object detection is one of the fundamental tasks in computer vision and various general-purpose detectors 

[16, 20, 17, 39, 48, 7, 51]

based on convolutional neural networks (CNNs) have been devised. Promising results have been achieved on public benchmarks including MS COCO

[35] and VOC2007 [12] etc. However, most existing detectors do not pay particular attention to some common aspects for robust object detection in the wild: small size, cluttered arrangement and arbitrary orientations. These challenges are especially pronounced for aerial image [64, 29, 5, 42] which has become an important area for detection in practice, for its various civil applications, e.g. resource detection, environmental monitoring, and urban planning.

(a) Horizontal detection.
(b) Rotation detection.
Fig. 1: Small, cluttered and rotated objects in complex scene whereby rotation detection plays an important role. Red boxes indicate missing detection which are suppressed by non-maximum suppression (NMS).

In the context of remote sensing, we further present some specific discussion to motivate this paper, as shown in Fig. 1. It shall be noted that the following three aspects also prevail in other sources e.g. natural images and scene texts.

1) Small objects. Aerial images often contain small objects overwhelmed by complex surrounding scenes.

2) Cluttered arrangement. Objects e.g. vehicles and ships in aerial images are often densely arranged, leading to inter-class feature coupling and intra-class feature boundary blur.

3) Arbitrary orientations. Objects in aerial images can appear in various orientations. Rotation detection is necessary especially considering the high aspect ratio issue: the horizontal bounding box for a rotated object is more loose than an aligned rotated one, such that the box contains a large portion of background or nearby cluttered objects as disturbance. Moreover, it will be greatly affected by non-maximum suppression, see Fig. 1(a).

As described above, the small/cluttered objects problem can be interleaved with the rotation variance. In this paper, we aim to address the first challenge by seeking a new way of dismissing the noisy interference from both background and other foreground objects. While for rotation alignment, a new rotation loss is devised accordingly. Our both techniques can serve as plug in for existing detectors 

[51, 33, 34, 43, 23, 71], in an out of box manner. We give further description as follows.

For small and cluttered object detection, we devise a denoising module and in fact denoising has not been studied for objection detection. We observe two common types of noises that are orthogonal to each other: i) image level noise, which is object-agnostic, and ii) instance level noise, specifically often in the form of mutual interference between objects, as well as background interference. Such noises are ubiquitous and pronounced in aerial images which are remotely sensed. In fact, denoising has been a long standing task [56, 66, 44, 6] in image processing while they are rarely designated for object detection, and the denoising is finally performed on raw image for the purpose of image enhancement rather than downstream semantic tasks, especially in an end-to-end manner.

In this paper, we explore the way of performing instance level denoising (InLD) and particularly in the feature map (i.e. latent layers’ outputs by CNNs), for robust detection. The hope is to reduce the inter-class feature coupling and intra-class interference, meanwhile blocking background interference. To this end, a novel InLD component is designated to decouple the features of different object categories into their respective channels approximately. Meanwhile, in the spatial domain, the features of the object and background are enhanced and weakened, respectively. It is worth noting that the above idea is conceptually similar to but inherently different from the recent efforts [66, 6] for image level feature map denoising (ImLD), which is used as a way of enhancing the image recognition model’s robustness against attack, rather than location sensitive object detection. Readers are referred to Table V for a quick verification that our InLD can more effectively improve detection than ImLD for both horizontal and rotation cases.

On the other hand, as discussed above, as a closely interleaved problem to small/cluttered object detection, accurate rotation estimation is addressed by devising a novel IoU-Smooth L1 loss. It is motivated by the fact that the existing state-of-the-art regression-based rotation detection methods e.g. five-parameter regression 

[1, 10, 71, 77] suffer from the issue of discontinuous boundaries, which is inherently caused by the periodicity of angular (PoA) and exchangeability of edges (EoE) [74] (see details in Sec. 3.3.2).

We conduct extensive ablation study and experiments on multiple datasets including both aerial images from DOTA [64], DIOR [29], UCAS-AOD [27], as well as natural image dataset COCO [35], scene text dataset ICDAR2015 [24], small traffic light dataset BSTLD [3] and our newly released STLD to illustrate the promising effects of our techniques.

Fig. 2: The pipeline of our method (using RetinaNet [34]

as an embodiment). Our SCRDet++ mainly consists of four modules: basic embodiment for feature extraction, Image-level denoising for removing common image noise, instance-level denoising module for suppressing instance noise (i.e., inter-class feature coupling and distraction between intra-class and background) and the ‘class+box’ branch for predicting classification score and bounding box position. ‘C’ and ‘A’ represent the number of object categories and the number of anchor at each feature point, respectively.

The preliminary content of this paper has partially appeared in the conference version [75]111Compared with the conference version, this journal version has made the following extensions: i) we take a novel feature map denoising perspective to the small and cluttered object detection problem, and specifically devise a new instance-level feature denoising technique for detecting small and cluttered objects with little additional computation and parameter overhead; ii) comprehensive ablation study of our instance-level feature denoising component across datasets, which can be easily plugged into existing detectors. Our new method significantly outperforms our previous detector in the conference version (e.g. overall detection accuracy 72.61% versus 76.81%, and 75.35% versus 79.35% on the OBB and HBB task of DOTA dataset, respectively); iii) We collect, annotate and release a new small traffic light dataset (5,786 images with 1,4130 traffic light instances across five categories) to further verify the versatility and generalization performance of the instance-level denoising module; iv) last but not least, the paper has been largely rephrased and expanded to cover the discussion of up-to-date works including those on image denoising and small object detection. Finally the source code is released., with the detector named SCRDet (Small, Cluttered, and Rotated Object Detector). In this journal version, we extend our improved detector called SCRDet++. The overall contributions are:

1) To our best knowledge, we are the first to develop the concept of instance level noise (at least in the context of object detection), and design a novel Instance-Level Denoising (InLD) module in feature map. This is realized by supervised segmentation whose ground truth is approximately obtained by the bounding box in object detection. The proposed module effectively addresses the challenges in detecting small size, arbitrary direction, and dense distribution objects with little computation and parameter increase.

2) Towards more robust handling of arbitrarily-rotated objects, an improved smooth L1 loss is devised by adding the IoU constant factor, which is tailored to solve the boundary problem of the rotating bounding box regression.

3) We create and release a real-world traffic light dataset: STLD. It consists of 5,786 images with 14,130 traffic light instances across five categories: red, green, yellow, off and wait on. It further verifies the effectiveness of InLD, and it is available at

4) Our method achieves state-of-the-art performance on public datasets for rotation detection in complex scenes like the aerial images. Experiments also show that our InLD module, which can be easily plugged into existing architectures, can notably improve detection on different tasks.

2 Related Work

We first discuss existing detectors for both horizontal bounding box based detection and rotation detection. Then some representative works on image denoising and small object detection are also introduced.

2.1 Horizontal Region Object Detection

There is an emerging line of deep network based object detectors. R-CNN [16] pioneers the CNN-based detection pipeline. Subsequently, region-based models such as Fast R-CNN [17], Faster R-CNN [51], and R-FCN [7] are proposed, which achieves more cost-effective detection. SSD [39], YOLO [48] and RetinaNet [34] are representative single-stage methods, and their single-stage structure further improves detection speed. In addition to anchor-based methods, many anchor-free also have become popular in recent years. FCOS [57], CornerNet [26], CenterNet [11] and ExtremeNet [79] attempt to predict some keypoints of objects such as corners or extreme points, which are then grouped into bounding boxes, and these detectors have also been applied to the field of remote sensing [63, 65]. R-P-Faster R-CNN [18] achieves satisfactory performance in small datasets. The method [69] combines both deformable convolution layers [8] and region-based fully convolutional networks (R-FCN) to improve detection accuracy further. The work [52] adopts top-down and skipped connections to produce a single high-level feature map of a fine resolution, improving the performance of the deformable Faster R-CNN model. IoU-Adaptive R-CNN [70] reduces the loss of small object information by a new IoU-guided detection network. FMSSD [59] aggregates the context information both in multiple scales and the same scale feature maps. However, objects in aerial images with small size, cluttered distribution and arbitrary rotation are still challenging, especially for horizontal region detection methods.

2.2 Arbitrary-Oriented Object Detection

The demand for rotation detection has been increasing recently like for aerial images and scene texts. Recent advances are mainly driven by the adoption of rotated bounding boxes or quadrangles to represent multi-oriented objects. For scene text detection, RRPN [43] employs rotated RPN to generate rotated proposals and further perform rotated bounding box regression. TextBoxes++ [31] adopts vertex regression on SSD. RRD [32] further improves TextBoxes++ by decoupling classification and bounding box regression on rotation-invariant and rotation sensitive features, respectively. EAST [80] directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating unnecessary intermediate steps with a single neural network. Recent text spotting methods like FOTS [40] show that training text detection and recognition simultaneously can greatly boost detection performance. In contrast, aerial images object detection is more challenging: first, multi-category object detection requires the generalization of the detector. Second, small objects in aerial images are usually densely arranged on a large scale. Third, aerial image detection requires a more robust algorithm due to the variety of noises. Many aerial images rotation detection algorithms are designed for different problems. ICN [1], ROI Transformer [10], and SCRDet [75] are representative of two-stage aerial images rotation detectors, which are mainly designed from the perspective of feature extraction. From the results, they have achieved good performance in small or dense object detection. Compared to the previous methods, RDet [71] and RSDet [46] are based on a single-stage detection method which pay more attention to the trade-off of accuracy and speed. Gliding Vertex [68] and RSDet [46] achieve more accurate object detection via quadrilateral regression prediction. Axis Learning [65] and O-DNet [63] are combined with the latest popular anchor-free ideas, to overcome the problem of too many anchors in anchor-based detection methods.

2.3 Image Denoising

Deep learning has obtained much attention in image denoising. The survey [56] divides image denoising using CNNs into four types (see the references therein): 1) additive white noisy images; 2) real noisy images; 3) blind denoising and 4) hybrid noisy images, as the combination of noisy, blurred and low-resolution images. In addition, image denoising also helps to improve the performance of other computer vision tasks, such as image classification [66], object detection [44], semantic segmentation [6], etc. In addition to image noise, we find that there is also instance noise in the field of object detection. Instance noise describes object-aware noise, which is more widespread in object detection than object-agnostic image noise. In this paper, we will explore the application of image-level denoising and instance-level denoising techniques to object detection in complex scenes.

2.4 Small Object Detection

Small object detection remains an unsolved challenge. Common small object solutions include data augmentation [25], multi-scale feature fusion [33, 9], tailored sampling strategies [81, 41, 75], generative adversarial networks [28], and multi-scale training [54] etc. In this paper, we show that denoising is also an effective means to improve the detection performance of small objects. In complex scenes, the feature information of small objects is often overwhelmed by the background area, which often contains a large number of similar objects. Unlike ordinary image-level denoising, we will use instance-level denoising to improve the detection capabilities of small objects, which is a new perspective.

This paper mainly considers designing a general-purpose instance level feature denoising module, to boost the performance of horizontal detection and rotation detection in challenging aerial imagery, as well as natural images and scene texts. Besides, we also design an IoU-Smooth L1 loss to solve the boundary problem of the arbitrary-oriented object detection for more accurate rotation estimation.

3 The Proposed Method

3.1 Approach Overview

Figure 2 illustrates the pipeline of the proposed SCRDet++. It mainly consists of four modules: i) feature extraction via CNNs which can take different forms of CNNs from existing detectors e.g. [16, 39], ii) image-level denoising (ImLD) module for removing common image noise, which is optional as its effect can be well offset by the subsequent InLD as devised in this paper; iii) instance-level denoising (InLD) module for suppressing instance noise (i.e., inter-class feature coupling and distraction between intra-class and background) and iv) the class and box branch for predicting score and (rotated) bounding box. Specifically, we first describe our main technique i.e. instance-level denoising module (InLD) in Sec. 3.2, which further contains a comparison with the image level denoising module (ImLD). Finally, we detail the network learning which involves a specially designed smooth loss for rotation estimation in Sec. 3.3. Note that in experiments we show that InLD can replace and strike a more effective role for detection than ImLD, making ImLD a dispensable component in our pipeline.

3.2 Instance-level Feature Map Denoising

In this subsection, we present our devised instance-level feature map denoising approach. To emphasis the importance of our instance-level operation, we further compare it with image-level denoising in feature map, which is also adopted for robust image recognition model learning in [66]. To our best knowledge, our approach is the first for using (instance level) feature map denoising for object detection. The denoising module can be learned in an end-to-end manner together with other modules, which is optimized for the object detection task.

3.2.1 Instance-Level Noise

Instance-level noise generally refers to the mutual interference among objects, and also that from background. We discuss its properties in the following aspects. In particular, as shown in Fig. 3, the adversary effect to object detection is especially pronounced in the feature map that calls for feature space denoising rather than on the raw input image.

1) The non-object with object-like shape has a higher response in the feature map, especially for small objects (see the top row of Fig. 3).

2) Clutter objects that are densely arranged tend to suffer the issue for inter-class feature coupling and intra-class feature boundary blurring (see the middle row of Fig. 3).

3) The response of object is not prominent enough surrounded by the background (see the bottom row of Fig. 3).

3.2.2 Mathematical Modeling of Instance-Level Denoising

To dismiss instance level noise, one can generally refer to the idea of attention mechanism, as a common way of re-weighting the convolutional response maps to highlight the important parts and suppress the uninformative ones, such as spatial attention [61] and channel-wise attention [22]. We show that existing aerial image rotation detectors, including FADet [27], SCRDet [75] and CAD-Det [77], often use the simple attention mechanism to re-weight the output, which can be reduced into the following general form:


where represents two feature maps of input image. The attention function refers to the proposal output by a certain attention module e.g. [61, 22]. Note is the element-wise product. and denote the spatial weight and channel weight. indicates the weight of the -th channel, respectively. Throughout the paper,

means the concatenation operation for connecting tensor along the feature map’s channels.

Fig. 3: Images (left) and their feature maps before (middle) and after (right) the instance-level denoising operation. First row: non-object with object-like shape. Second row: inter-class feature coupling and intra-class feature boundary blurring. Third row: weak feature response.
Fig. 4: Feature maps corresponding to clean images (top) and to their noisy versions (bottom). The noise is randomly generated by a Gaussian function with a mean value of 0 and a variance of 0.005. The first and third columns: images; the rest columns: feature maps. The contrast between foreground and background in the feature map of the clean image is more obvious (second column), and the boundaries between dense objects are clearer (fourth column).

However, Eq. 1 simply distinguishes feature response between objects and background in spatial domain, and is only used to measure the importance of each channel. In other words, the interaction between intra-class objects and inter-class objects is not considered which is important for detection in complex scene. We are aimed to devise a new network that can not only distinguish object from background, but also weaken the mutual interference among objects. Specifically, we propose adding instance-level denoising (InLD) module at intermediate layers of convolutional networks. The key is to decouple the feature of different object categories into their respective channels, and meanwhile the features of objects and background are enhanced and weakened in the spatial domain, respectively.

As a result, our new formulation is as follows, which considers the total number of object categories with one additional category for background:


where is a hierarchical weight. and represent the weight and feature response corresponding to the -th category, and its channel number is denoted by , for . and represent the weight and feature of the -th category along the -th channel, respectively.

As can be seen from Eq. 1 and Eq. 2, can be approximated as a combination of multiple , which denotes the attention function of category . Thus we have:


Without loss of generality, consider an image containing objects belonging to the first () categories. In this paper, we aim to decouple the above formula into three parts as concatenated to each other (see Fig. 5):


For background and unseen categories not in image, ideally the response is filtered by our devised denoising module to be as small as possible. From this perspective, Eq. 4 can be further interpreted by:


where denotes tensor with small feature response one aims to achieve, for each category and background .

In the following subsection, we show how to achieve the above decoupled feature learning among categories.

Fig. 5: Feature map with decoupled category-specific feature signals along channels. The abbreviation ‘HA’, ‘SP’, ‘SH’, and ‘SV’ indicate ‘Harbor’, ‘Swimming pool’, ‘Ship’, and ‘Small vehicle’, respectively. ‘Others’ include background and unseen categories that do not appear in the image. Features of different categories are decoupled into their respective channels (top and middle), while the features of object and background are enhanced and suppressed in spatial domain, respectively (bottom).
Base Model Image-Level Denoising mAP (%)
RDet [71] none 65.73
bilateral, dot prod 66.94
bilateral, gaussian 67.03
nonlocal, dot prod 66.82
nonlocal, gaussian 67.68
nonlocal, gaussian, 3x3 mean 66.88
TABLE I: Ablative study of five image level denoising settings as used in [66] on the OBB task of DOTA dataset.

3.2.3 Implementation of Instance-Level Denoising

Based on the above derivations, we devise a practical neural network based implementation. Our analysis starts with the simplest case with a single channel for each category’s weight in Eq. 2, or namely . In this setting, the learned weight

can be regarded as the result of semantic segmentation of the image for specific categories (a three-dimensional one-hot vector). Then more channels of weight

in can be guided by semantic segmentation, as illustrated in Fig. 2 and Fig. 5. In semantic segmentation task, the feature responses of each category on the previous layers of the output layer tend to be separated in the channel dimension, and the feature responses of the foreground and background in the spatial dimension are also polarized. Hence one can adopt a semantic segmentation network for the operations in Eq. 5. Another advantage for holding this semantic segmentation view is that it can be conducted in an end-to-end supervised fashion, whose learned denoising weights can be more reliable and effective than the self-attention based alternatives [61, 22].

In Fig. 2, we give a specific implementation as follows. The input feature map expands the receptive field by dilated convolutions [76] and a convolutional layer at first. For instance, the values of take the numbers of on pyramid levels P3 to P7, respectively as set in our experiments. The feature map is then processed by two parallel convolutional layers to obtain the two important outputs. One output (a three-dimensional one-hot feature map) is used to perform coarse multi-class segmentation, and the annotated bounding box in detection tasks can be used as the approximate ground truth. The hope is that this output will guide the other output into a denoising feature map.

As shown in Fig. 5, this denoising feature map and the original feature map are combined (by dot operation) to obtain the final decoupled feature map. The purpose is in two-folds: along the channel dimension, inter-class feature responses of different object categories (excluding the background) are basically decoupled into their respective channels; In the spatial dimension, intra-class feature boundaries are sharpened due to the feature response of the object area is enhanced and background is weakened. As such, the three issues as raised in the beginning of this subsection are alleviated.

As shown in the upper right corner of Fig. 2, the classification model is decomposed into two terms: objectness and category classification, as written by:


This probability map

relates to whether the anchor for each feature point is an object. While the above decoupled features are directly used for object classification (as well as rotation regression which will be discussed in Sec. 3.3).

During training, the probability map will be used as a weight for the regression loss (see Eq. 10), making those ambiguous positive samples get smaller weights and giving higher quality positive samples more attention. We find in the experiment that the introduction of the probability map can speed up the convergence of the model and improve the detection results, as shown in Table II.

3.2.4 Comparison with Image-Level Denoising

Image denoising is a fundamental task in image processing, which may impose notable impact to image recognition, as has been recently studied and verified in [66]. Specifically, the work [66] shows that the transformations performed by the network layers exacerbate the perturbation, and the hallucinated activations can overwhelm the activations due to true signal, which leads to worse prediction.

Here we also study this issue in the context of aerial images though we directly borrow the image level denoising model [66]. As shown in Fig. 4, we add Gaussian noise on the raw aerial images and compare with the clean ones. The same feature map on clean and noisy images, extracted from the same channel of a res3 block in the same detection network trained on clean images are visualized. Though the noise has little effect and it is difficult to distinguish by naked eyes. However, it becomes more obvious in the feature map such that the objects are gradually submerged in the background or the boundary between the objects tends to be blurred.

Since the convolution operation and the traditional denoising filters are highly correlated, we resort to a potential solution [66] which employs convolutional layers to simulate different types of differential filters, such as non-local means, bilateral filtering, mean filtering, and median filtering. Inspired by the success of these operation in adversarial attacks [66], in this paper we migrate and extend these differential operations for object detection. We show the generic form of ImLD in Fig. 2. It processes the input features by a denoising operation, such as non-local means or other variants. The denoised representation is first processed by a

convolutional layer, and then added to the module’s input via a residual connection. The simulation of ImLD is expressed as follows:


where is the output by a certain filter. , represent the whole feature map of input image. The effect of the imposed denosing module is shown in Table I. In the following, we further show that the more notable detection improvement comes from the InLD module and its effect can well cover the image level one.

3.3 Loss Function Design and Learning

3.3.1 Horizontal Object Detection

Horizontal and rotation detection settings are both considered. For rotation detection, we need to redefine the representation of the bounding box. Fig. 6 shows the rectangular definition of the 90 degree angle representation range [72, 75, 71, 46, 73]. denotes the acute angle to the x-axis, and for the other side we refer it as . Note this definition is also officially adopted by OpenCV

The regression of the bounding box is given by:


where , , , , denote the box’s center coordinates, width, height and angle, respectively. Variables are for the ground-truth box, anchor box, and predicted box, respectively (likewise for ).

For horizontal detection, the multi-task loss is used which is defined as follows:


where indicates the number of anchors, is a binary value ( for foreground and for background, no regression for background). indicates the probability that the current anchor is the object. denotes the predicted offset vectors of the n-th anchor, is the targets vector between n-th anchor and ground-truth it matches. represents the label of object,

is the probability distribution of various classes calculated by sigmoid function.

, denote the label and predict of mask’s pixel respectively. The hyper-parameter , , control the trade-off and are set to 1 by default. The classification loss is focal loss [34]. The regression loss is smooth L1 loss as defined in [17], and the InLD loss is pixel-wise softmax cross-entropy.

Fig. 6: Rotation box definitions (OpenCV definition). denotes the acute angle to the x-axis, and for the other side we refer it as . The range of angle representation is .
(a) Ideal case.
(b) Actual case.
Fig. 7: Boundary discontinuity of angle regression. Blue, green, and red bounding boxes denote the anchor/proposal, ground-truth, and prediction box, respectively.
(a) Smooth L1 loss.
(b) IoU-smooth L1 loss.
Fig. 8: Detection results by two losses. For this dense arrangement case, the angle estimation error will also make the classification even harder.
Base Model Mask Type Coproduct FPS mAP (%)
RDet [71] null 14 65.73
Binary-Mask 13.5 68.12
Multi-Mask 13 69.43
Multi-Mask 13 69.81
TABLE II: Ablative study for speed and accuracy of InLD on OBB task of DOTA. Binary-Mask and Multi-Mask refer to binary and multi-class semantic segmentation, respectively. Coproduct denotes multiplying the objectness term or not in Eq. 6.

3.3.2 Rotation Object Detection

In contrast, rotation detection needs to carefully address the boundary problem. In particular, there exists the boundary problem for the angle regression, as shown in Fig. 7(a). It shows that an ideal form of regression (the blue box rotates counterclockwise to the red box), but the loss of this situation is very large due to the periodicity of angular (PoA) and exchangeability of edges (EoE). Therefore, the model has to be regressed in other complex forms like in Fig. 7(b) (such as the blue box rotating clockwise while scaling and ), increasing the difficulty of regression, as shown in Fig. 8(a). We introduce the IoU constant factor in the traditional smooth L1 loss to perfectly solve this problem, as shown in Eq. 11

. This new loss function is named IoU-smooth L1 loss. It can be seen that in the boundary case, the loss function is approximately equal to

, eliminating the sudden increase in loss caused by , as shown in Fig. 8(b). The new regression loss can be divided into two parts: determines the direction of gradient propagation, and for the magnitude of gradient. In addition, using IoU to optimize location accuracy is consistent with IoU-dominated metric, which is more straightforward and effective than coordinate regression.


where denotes the overlap of the prediction box and ground-truth.

InLD RetinaNet-H [71] RDet [71]
dilated convolution [76]
62.21 65.73
{4,4,3,2,2} 62.36 66.62
{1,1,1,1,1} 65.40 69.81
{4,4,3,2,2} 65.52 69.07
TABLE III: Ablative study by accuracy (%) of the number of dilated convolution on pyramid levels and the InLD loss

in InLD on OBB task of DOTA. It can be found that supervised learning is the main contribution of InLD rather than more convolution layers.

Dataset Base Model InLD red yellow green off wait on mAP
STLD RetinaNet [34] 97.94 88.63 97.17 90.13 92.40 93.25
98.15 87.66 97.12 93.88 93.75 94.11
FPN [33] 97.98 87.55 97.42 93.42 98.31 94.93
98.04 92.84 97.69 92.06 99.08 95.94
BSTLD [3] RetinaNet [34] 69.91 19.71 77.11 22.33 47.26
70.50 24.05 77.16 22.51 48.56
FPN [33] 89.27 47.82 92.01 40.73 67.46
89.88 49.93 92.42 42.45 68.67
TABLE IV: Detailed ablative study by accuracy (%) of the effect of InLD on two traffic light datasets. Note the category ‘wait on’ is only available in our collected STLD dataset as released by this paper.
(a) red, green, off
(b) red, green, wait on
(c) red
(d) red, green, off
(e) red, yellow
(f) red
(g) red, green, wait on
(h) red
(i) red
(j) green
Fig. 9: Illustrations of the five categories and different lighting and weather conditions in our collected STLD dataset as released in the paper.
Dataset and task Base Model Baseline ImLD InLD ImLD + InLD
DOTA OBB[64] RetinaNet-H [71] 62.21 62.39 65.40 65.62 (+0.22)
RetinaNet-R [71] 61.94 63.96 64.52 64.60 (+0.08)
RDet [71] 65.73 67.68 69.81 69.95 (+0.14)
DOTA HBB [64] RetinaNet [34] 67.76 68.05 68.33 68.50 (+0.17)
DIOR [29] RetinaNet [34] 68.05 68.42 69.36 69.35 (-0.01)
FPN [33] 71.74 71.83 73.21 73.25 (+0.04)
ICDAR2015 [24] RetinaNet-H [71] 77.13 78.68
COCO [35] FPN [33] 36.1 37.2
RetinaNet [34] 34.4 35.8
STLD RetinaNet [34] 93.25 94.11
TABLE V: Ablative study by accuracy (%) of ImLD, InLD and their combination (numbers in bracket denote relative improvement against using InLD alone) on different datasets and different detection tasks.

4 Experiments

Experiments are performed on a server with GeForce RTX 2080 Ti and 11G memory. We first give the description of the dataset, and then use these datasets to verify the advantage of the proposed method. Source code is available at

Base Method Backbone InLD Data Aug. PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP
RetinaNet-H [71] ResNet50 88.87 74.46 40.11 58.03 63.10 50.61 63.63 90.89 77.91 76.38 48.26 55.85 50.67 60.23 34.23 62.22
ResNet50 88.83 74.70 40.80 65.85 59.76 53.51 67.38 90.82 78.49 80.52 52.02 59.77 53.56 66.80 48.24 65.40
RetinaNet-R [71] ResNet50 88.92 67.67 33.55 56.83 66.11 73.28 75.24 90.87 73.95 75.07 43.77 56.72 51.05 55.86 21.46 62.02
ResNet50 88.96 70.77 33.30 62.02 66.35 75.69 73.49 90.84 78.73 77.21 47.54 55.59 51.52 58.06 37.65 64.52
RDet [71] ResNet50 88.78 74.69 41.94 59.88 68.90 69.77 69.82 90.81 77.71 80.40 50.98 58.34 52.10 58.30 43.52 65.73
ResNet152 89.24 80.81 51.11 65.62 70.67 76.03 78.32 90.83 84.89 84.42 65.10 57.18 68.10 68.98 60.88 72.81
ResNet50 88.63 75.98 45.88 65.45 69.74 74.09 78.30 90.78 78.96 81.28 56.28 63.01 57.40 68.45 52.93 69.81
ResNet101 89.25 83.30 49.94 66.20 71.82 77.12 79.53 90.65 82.14 84.57 65.33 63.89 67.56 68.48 54.89 72.98
ResNet152 89.20 83.36 50.92 68.17 71.61 80.23 78.53 90.83 86.09 84.04 65.93 60.80 68.83 71.31 66.24 74.41
TABLE VI: Ablative study by accuracy (%) of each component in our method on the OBB task of DOTA dataset. For RetinaNet, ‘H’ and ‘R’ represents the horizontal and rotation anchors, respectively.
RetinaNet-R [71] ResNet50 88.92 67.67 33.55 56.83 66.11 73.28 75.24 90.87 73.95 75.07 43.77 56.72 51.05 55.86 21.46 62.02
ResNet50 89.27 74.93 37.01 64.49 66.00 75.87 77.75 90.76 80.35 80.31 54.75 61.17 61.07 64.78 51.24 68.65 (+6.63)
SCRDet [75] ResNet101 89.65 79.51 43.86 67.69 67.41 55.93 64.86 90.71 77.77 84.42 57.67 61.38 64.29 66.12 62.04 68.89
ResNet101 89.41 78.83 50.02 65.59 69.96 57.63 72.26 90.73 81.41 84.39 52.76 63.62 62.01 67.62 61.16 69.83 (+0.94)
FPN [33] ResNet101 90.25 85.24 55.18 73.24 70.38 73.77 77.00 90.77 87.74 86.63 68.89 63.45 72.73 67.96 60.23 74.90
ResNet101 89.77 83.90 56.30 73.98 72.60 75.63 82.82 90.76 87.89 86.14 65.24 63.17 76.05 68.06 70.24 76.20 (+1.30)
TABLE VII: Ablative study by accuracy (%) of IoU-Smooth L1 loss by using it or not in the three methods on the OBB task of DOTA dataset. Numbers in bracket denote the relative improvement by using the proposed IoU-Smooth L1 loss.
OBB (oriented bounding boxes) Backbone PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP
Two-stage methods
FR-O [64] ResNet101 [21] 79.09 69.12 17.17 63.49 34.20 37.16 36.20 89.19 69.60 58.96 49.4 52.52 46.69 44.80 46.30 52.93
R-DFPN [72] ResNet101 80.92 65.82 33.77 58.94 55.77 50.94 54.78 90.33 66.34 68.66 48.73 51.76 55.10 51.32 35.88 57.94
RCNN [23] ResNet101 80.94 65.67 35.34 67.44 59.92 50.91 55.81 90.67 66.92 72.39 55.06 52.23 55.14 53.35 48.22 60.67
RRPN [43] ResNet101 88.52 71.20 31.66 59.30 51.85 56.19 57.25 90.81 72.84 67.38 56.69 52.84 53.08 51.94 53.58 61.01
ICN [1] ResNet101 81.40 74.30 47.70 70.30 64.90 67.80 70.00 90.80 79.10 78.20 53.60 62.90 67.00 64.20 50.20 68.20
RADet [30] ResNeXt101 [67] 79.45 76.99 48.05 65.83 65.46 74.40 68.86 89.70 78.14 74.97 49.92 64.63 66.14 71.58 62.16 69.09
RoI-Transformer [10] ResNet101 88.64 78.52 43.44 75.92 68.81 73.68 83.59 90.74 77.27 81.46 58.39 53.54 62.83 58.93 47.67 69.56
CAD-Net [77] ResNet101 87.8 82.4 49.4 73.5 71.1 63.5 76.7 90.9 79.2 73.3 48.4 60.9 62.0 67.0 62.2 69.9
SCRDet [75] ResNet101 89.98 80.65 52.09 68.36 68.36 60.32 72.41 90.85 87.94 86.86 65.02 66.68 66.25 68.24 65.21 72.61
SARD [62] ResNet101 89.93 84.11 54.19 72.04 68.41 61.18 66.00 90.82 87.79 86.59 65.65 64.04 66.68 68.84 68.03 72.95
FADet [27] ResNet101 90.21 79.58 45.49 76.41 73.18 68.27 79.56 90.83 83.40 84.68 53.40 65.42 74.17 69.69 64.86 73.28
MFIAR-Net [60] ResNet152 [21] 89.62 84.03 52.41 70.30 70.13 67.64 77.81 90.85 85.40 86.22 63.21 64.14 68.31 70.21 62.11 73.49
Gliding Vertex [68] ResNet101 89.64 85.00 52.26 77.34 73.01 73.14 86.82 90.74 79.02 86.81 59.55 70.91 72.94 70.86 57.32 75.02
Mask OBB [58] ResNeXt101 89.56 85.95 54.21 72.90 76.52 74.16 85.63 89.85 83.81 86.48 54.89 69.64 73.94 69.06 63.32 75.33
FFA [14] ResNet101 90.1 82.7 54.2 75.2 71.0 79.9 83.5 90.7 83.9 84.6 61.2 68.0 70.7 76.0 63.7 75.7
APE [83] ResNeXt-101 89.96 83.62 53.42 76.03 74.01 77.16 79.45 90.83 87.15 84.51 67.72 60.33 74.61 71.84 65.55 75.75
CSL [74] ResNet152 90.25 85.53 54.64 75.31 70.44 73.51 77.62 90.84 86.15 86.69 69.60 68.04 73.83 71.10 68.93 76.17
SCRDet++ (FPN-based) ResNet101 89.77 83.90 56.30 73.98 72.60 75.63 82.82 90.76 87.89 86.14 65.24 63.17 76.05 68.06 70.24 76.20
SCRDet++ MS (FPN-based) ResNet101 90.05 84.39 55.44 73.99 77.54 71.11 86.05 90.67 87.32 87.08 69.62 68.90 73.74 71.29 65.08 76.81
Single-stage methods
IENet [36] ResNet101 80.20 64.54 39.82 32.07 49.71 65.01 52.58 81.45 44.66 78.51 46.54 56.73 64.40 64.24 36.75 57.14
Axis Learning [65] ResNet101 79.53 77.15 38.59 61.15 67.53 70.49 76.30 89.66 79.07 83.53 47.27 61.01 56.28 66.06 36.05 65.98
P-RSDet [78] ResNet101 89.02 73.65 47.33 72.03 70.58 73.71 72.76 90.82 80.12 81.32 59.45 57.87 60.79 65.21 52.59 69.82
O-DNet [63] Hourglass104 [45] 89.31 82.14 47.33 61.21 71.32 74.03 78.62 90.76 82.23 81.36 60.93 60.17 58.21 66.98 61.03 71.04
RDet [71] ResNet152 89.24 80.81 51.11 65.62 70.67 76.03 78.32 90.83 84.89 84.42 65.10 57.18 68.10 68.98 60.88 72.81
RSDet [46] ResNet152 90.1 82.0 53.8 68.5 70.2 78.7 73.6 91.2 87.1 84.7 64.3 68.2 66.1 69.3 63.7 74.1
SCRDet++ (RDet-based) ResNet152 89.20 83.36 50.92 68.17 71.61 80.23 78.53 90.83 86.09 84.04 65.93 60.8 68.83 71.31 66.24 74.41
SCRDet++ MS (RDet-based) ResNet152 88.68 85.22 54.70 73.71 71.92 84.14 79.39 90.82 87.04 86.02 67.90 60.86 74.52 70.76 72.66 76.56
HBB (horizontal bounding boxes) Backbone PL BD BR GTF SV LV SH TC BC ST SBF RA HA SP HC mAP
Two-stage methods
FR-H [51] ResNet101 80.32 77.55 32.86 68.13 53.66 52.49 50.04 90.41 75.05 59.59 57.00 49.81 61.69 56.46 41.85 60.46
ICN [1] ResNet101 90.00 77.70 53.40 73.30 73.50 65.00 78.20 90.80 79.10 84.80 57.20 62.10 73.50 70.20 58.10 72.50
IoU-Adaptive R-CNN [70] ResNet101 88.62 80.22 53.18 66.97 76.30 72.59 84.07 90.66 80.95 76.24 57.12 66.65 84.08 66.36 56.85 72.72
SCRDet [75] ResNet101 90.18 81.88 55.30 73.29 72.09 77.65 78.06 90.91 82.44 86.39 64.53 63.45 75.77 78.21 60.11 75.35
FADet [27] ResNet101 90.15 78.60 51.92 75.23 73.60 71.27 81.41 90.85 83.94 84.77 58.91 65.65 76.92 79.36 68.17 75.38
Mask OBB [58] ResNeXt-101 89.69 87.07 58.51 72.04 78.21 71.47 85.20 89.55 84.71 86.76 54.38 70.21 78.98 77.46 70.40 76.98
ARMNet [47] ResNet101 89.84 83.39 60.06 73.46 79.25 83.07 87.88 90.90 87.02 87.35 60.74 69.05 79.88 79.74 65.17 78.45
SCRDet++ (FPN-based) ResNet101 90.01 82.32 61.94 68.62 69.62 81.17 78.83 90.86 86.32 85.10 65.10 61.12 77.69 80.68 64.25 76.24
SCRDet++ MS (FPN-based) ResNet101 90.00 86.25 65.04 74.52 72.93 84.17 79.05 90.72 87.37 87.06 72.10 66.72 82.64 80.57 71.07 79.35
Single-stage methods
SBL [55] ResNet50 89.15 66.04 46.79 52.56 73.06 66.13 78.66 90.85 67.40 72.22 39.88 56.89 69.58 67.73 34.74 64.77
FMSSD [59] VGG16 [53] 89.11 81.51 48.22 67.94 69.23 73.56 76.87 90.71 82.67 73.33 52.65 67.52 72.37 80.57 60.15 72.43
EFR [15] VGG16 88.36 83.90 45.78 67.24 76.80 77.15 85.35 90.77 85.55 75.77 54.64 60.76 71.40 77.90 60.94 73.49
SCRDet++ (RetinaNet-based) ResNet152 87.89 84.64 56.94 68.03 74.67 78.75 78.50 90.80 85.60 84.98 53.56 56.75 76.66 75.08 62.75 74.37
TABLE VIII: AP and mAP (%) across categories of OBB and HBB task on DOTA. MS indicates multi-scale training and testing.
Backbone c1 c2 c3 c4 c5 f6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 mAP
Two-stage methods
Faster-RCNN [51] VGG16 53.6 49.3 78.8 66.2 28.0 70.9 62.3 69.0 55.2 68.0 56.9 50.2 50.1 27.7 73.0 39.8 75.2 38.6 23.6 45.4 54.1
Mask‐RCNN [19] ResNet‐50 53.8 72.3 63.2 81.0 38.7 72.6 55.9 71.6 67.0 73.0 75.8 44.2 56.5 71.9 58.6 53.6 81.1 54.0 43.1 81.1 63.5
ResNet‐101 53.9 76.6 63.2 80.9 40.2 72.5 60.4 76.3 62.5 76.0 75.9 46.5 57.4 71.8 68.3 53.7 81.0 62.3 43.0 81.0 65.2
PANet [38] ResNet‐50 61.9 70.4 71.0 80.4 38.9 72.5 56.6 68.4 60.0 69.0 74.6 41.6 55.8 71.7 72.9 62.3 81.2 54.6 48.2 86.7 63.8
ResNet‐101 60.2 72.0 70.6 80.5 43.6 72.3 61.4 72.1 66.7 72.0 73.4 45.3 56.9 71.7 70.4 62.0 80.9 57.0 47.2 84.5 66.1
CornerNet [26] Hourglass104 58.8 84.2 72.0 80.8 46.4 75.3 64.3 81.6 76.3 79.5 79.5 26.1 60.6 37.6 70.7 45.2 84.0 57.1 43.0 75.9 64.9
FPN [33] ResNet‐50 54.1 71.4 63.3 81.0 42.6 72.5 57.5 68.7 62.1 73.1 76.5 42.8 56.0 71.8 57.0 53.5 81.2 53.0 43.1 80.9 63.1
ResNet‐101 54.0 74.5 63.3 80.7 44.8 72.5 60.0 75.6 62.3 76.0 76.8 46.4 57.2 71.8 68.3 53.8 81.1 59.5 43.1 81.2 65.1
CSFF [4] ResNet-101 57.2 79.6 70.1 87.4 46.1 76.6 62.7 82.6 73.2 78.2 81.6 50.7 59.5 73.3 63.4 58.5 85.9 61.9 42.9 86.9 68.0
FPN ResNet‐50 66.57 83.00 71.89 83.02 50.41 75.74 70.23 81.08 74.83 79.03 77.74 55.29 62.06 72.26 72.10 68.64 81.20 66.07 54.56 89.09 71.74
SCRDet++ (FPN-based) ResNet‐50 66.35 83.36 74.34 87.33 52.45 77.98 70.06 84.22 77.95 80.73 81.26 56.77 63.70 73.29 71.94 71.24 83.40 62.28 55.63 90.00 73.21
SCRDet++ (FPN-based) ResNet‐101 80.79 87.67 80.46 89.76 57.83 80.90 75.23 90.01 82.93 84.51 83.55 63.19 67.25 72.59 79.20 70.44 89.97 70.71 58.82 90.25 77.80
Single-stage methods
SSD [13] VGG16 59.5 72.7 72.4 75.7 29.7 65.8 56.6 63.5 53.1 65.3 68.6 49.4 48.1 59.2 61.0 46.6 76.3 55.1 27.4 65.7 58.6
YOLOv3 [50] Darknet‐53 72.2 29.2 74.0 78.6 31.2 69.7 26.9 48.6 54.4 31.1 61.1 44.9 49.7 87.4 70.6 68.7 87.3 29.4 48. 3 78.7 57.1
RetinaNet [34] ResNet‐50 53.7 77.3 69.0 81.3 44.1 72.3 62.5 76.2 66.0 77.7 74.2 50.7 59.6 71.2 69.3 44.8 81.3 54.2 45.1 83.4 65.7
ResNet‐101 53.3 77.0 69.3 85.0 44.1 73.2 62.4 78.6 62.8 78.6 76.6 49.9 59.6 71.1 68.4 45.8 81.3 55.2 44.4 85.5 66.1
RetinaNet ResNet‐50 59.98 79.02 70.85 83.37 45.25 75.93 64.53 76.87 66.63 80.25 76.75 55.94 60.70 70.38 61.45 60.15 81.13 62.76 44.52 84.46 68.05
SCRDet++ (RetinaNet-based) ResNet‐50 64.33 78.99 73.24 85.72 45.83 75.99 68.41 79.28 68.93 77.68 77.87 56.70 62.15 70.38 67.66 60.42 80.93 63.74 44.44 84.56 69.36
SCRDet++ (RetinaNet-based) ResNet‐101 71.94 84.99 79.48 88.86 52.27 79.12 77.63 89.52 77.79 84.24 83.07 64.22 65.57 71.25 76.51 64.54 88.02 70.91 47.12 85.10 75.11
TABLE IX: Accuracy (%) on DIOR. indicates our own implementation, higher than the official baseline. indicates data augmentation is used.

4.1 Datasets and Protocols

We choose a wide variety of public datasets from both aerial images as well as natural images and scene texts for evaluation. The details are as follows.

DOTA [64]: DOTA is a complex aerial image dataset for object detection, which contains objects exhibiting a wide variety of scales, orientations, and shapes. DOTA contains 2,806 aerial images and 15 common object categories from different sensors and platforms. The fully annotated DOTA benchmark contains 188,282 instances, each of which is labeled by an arbitrary quadrilateral. There are two detection tasks for DOTA: horizontal bounding boxes (HBB) and oriented bounding boxes (OBB). The training set, validation set, and test set account for 1/2, 1/6, 1/3 of the entire data set, respectively. Due to ranging of the image size from around to pixels, we divide the images into subimages with an overlap of 150 pixels and scale it to . With all these processes, we obtain about 27,000 patches. The model is trained by 135k iterations in total, and the learning rate changes during the 81k and 108k iterations from 5e-4 to 5e-6. The short names for categories are defined as (abbreviation-full name): PL-Plane, BD-Baseball diamond, BR-Bridge, GTF-Ground field track, SV-Small vehicle, LV-Large vehicle, SH-Ship, TC-Tennis court, BC-Basketball court, ST-Storage tank, SBF-Soccer-ball field, RA-Roundabout, HA-Harbor, SP-Swimming pool, and HC-Helicopter.

(a) COCO: the red boxes represent missed objects and the orange boxes represent false alarms.
(b) ICDAR2015: red arrows denote missed objects.
(c) STLD: the red box represent missed object.
Fig. 10: Detection illustration on the datasets of COCO, ICDAR2015, STLD before and after using InLD.

DIOR[29]: DIOR is another large aerial images dataset labeled by a horizontal bounding box. It consists of 23,463 images and 190,288 instances, covering 20 object classes. DIOR has a large variation of object size, not only in spatial resolutions, but also in the aspect of inter‐class and intra‐class size variability across objects. The complexity of DIOR is also reflected in different imaging conditions, weathers, seasons, and image quality, and it has high inter‐class similarity and intra‐class diversity. The training protocol of DIOR is basically consistent with DOTA. The short names c1-c20 for categories in our experiment are defined as: Airplane, Airport, Baseball field, Basketball court, Bridge, Chimney, Dam, Expressway service area, Expressway toll station, Golf course, Ground track field, Harbor, Overpass, Ship, Stadium, Storage tank, Tennis court, Train station, Vehicle, and Wind mill.

UCAS-AOD [82]: UCAS-AOD contains 1510 aerial images of approximately pixels, it contains two categories of 14,596 instances. In line with [64, 1], we randomly select 1,110 for training and 400 for testing.

BSTLD [3]: BSTLD contains 13,427 camera images at a resolution of pixels and contains about 24,000 annotated small traffic lights. Specifically, 5,093 training images are annotated by 15 labels every 2 seconds, but only 3,153 images contain the instance, about 10,756. There are very few instances of many categories, so we reclassify them into 4 categories (red, yellow, green, off). In contrast, 8,334 consecutive test images are annotated by 4 labels at about 15 fps. In this paper, we only use the training set of BSTLD, whose median traffic light width is 8.6 pixels. In the experiment, we divide BSTLD training set into a training set and a test set according to the ratio of . Note that we use the RetinaNet with P2 feature level and FPN to verify InLD, and scale the size of the input image to .

STLD: STLD222STLD is available at is our newly collected and annotated traffic light dataset as released in this paper, which contains 5,786 images of approximately pixels (1,222 images) and pixels (4,564 images). It also contains 5 categories (namely red, yellow, green, off and wait on) of 14,130 instances. The scenes cover a variety of lighting, weather and traffic conditions, including busy street scenes inner-city, dense stop-and-go traffic, strong changes in illumination/exposure, flickering/fluctuating traffic lights, multiple visible traffic lights, image parts that can be confused with traffic lights (e.g. large round tail lights), as shown in Fig. 9. The training strategy is consistent with BSTLD.

In addition to the above datasets, we also use natural image dataset COCO [35] and scene text dataset ICDAR2015 [24] for further evaluation.

The experiments are initialized by ResNet50 [21] by default unless otherwise specified. The weight decay and momentum for all experiments are set 0.0001 and 0.9, respectively. We employ MomentumOptimizer over 8 GPUs with a total of 8 images per minibatch. We follow the standrad evaluation protocol of COCO, while for other datasets, the anchors of RetinaNet-based method have areas of to on pyramid levels from P3 to P7, respectively. At each pyramid level we use anchors at seven aspect ratios and three scales . For rotating anchor-based method (RetinaNet-R), the angle is set by an arithmetic progression from to with an interval of degrees.

4.2 Ablation Study

The ablation study covers the detailed evaluation of the effect of image level denoising (ImLD) and instance level denoising (InLD), as well as their combintation.

Effect of Image-Level Denoising. We have experimented with five denoising modules introduced in [66] on DOTA dataset. We use our previous work RDet [71], one of the most state-of-the-art methods on the DOTA, as the baseline. From Table I, one can observe that most methods workable except the mean filtering. Among them, the non-local with Gaussian is the most effective (1.95% higher).

Effect of Instance-Level Denoising. The purpose of designing InLD is to make the feature of different categories decoupled in the channel dimension, while the features of the object and non-object are enhanced and weakened in the spatial dimension, respectively. We have designed some verification tests and obtained positive results as shown in Table II. We first explore the utility of weakening the non-object noise by binary semantic segmentation, and the detection mAP has increased from 65.73% to 68.12%. The result on multi-category semantic segmentation further proves that there is indeed interference between objects, which is reflected by the increase of detection mAP (reaching 69.43%). From the above two experiments, we can preliminarily speculate that the interference in the non-object area is the main reason that affects the performance of the detector. It is surprising to to find that coproducting the prediction score for objectness (see in Eq. 6) can further improve performance and speed up training with a final accuracy of 69.81%. Experiments in Table VI show that InLD has greatly improved the RDet’s performance of small objects, such as BR, SV, LV, SH, SP, HC, which increased by 3.94%, 0.84%, 4.32%, 8.48%, 10.15%, and 9.41%, respectively. While the accuracy is greatly improved, the detection speed of the model is only reduced by 1fps (at 13fps). In addition to the DOTA dataset, we have used more datasets to verify the general applicability, such as DIOR, ICDAR, COCO and STLD. InLD obtains 1.44%, 1.55%, 1.4% and 0.86% improvements in each of the four datasets according to Table V and Fig. 10 shows the visualization results before and after using InLD. In order to investigate whether the performance improvement brought by InLD is due to the extra computation (dilated convolutions) or supervised learning (), we perform ablation experiments by controlling the number of dilated convolutions and supervision signal. Table III shows that supervised learning is the main contribution of InLD rather than more convolution layers.

In particular, we conduct a detailed study on the SJTU Small Traffic Light Dataset (STLD) which is our newly released traffic detection dataset. Compared with BSTLD, STLD has more available categories. In addition, STLD contains two different resolution images taken from two different cameras, which can be used for more challenging detection tasks. Table IV shows the effectiveness of InLD on these two traffic light datasets.

(a) BC and TC
(b) SBF and GTF
(c) HA and SH
(d) SP
(e) RA and SV
(f) ST
(g) BD and RA
(h) SV and LV
(i) PL and HC
(j) BR
Fig. 11: Detection illustration on OBB task on DOTA of different objects by the proposed method.

Effect of combining ImLD and InLD. A natural idea is whether we can combine these two denoising structures, as shown in Fig. 2. For more comprehensive study, we perform detailed ablation experiments on different datasets and different detection tasks. The experimental results are listed in Table V, and we tend to get the following remarks:

1) Most of the datasets are relatively clean, so ImLD does not obtain a significant increase in all datasets.

2) The performance improvement of detectors with InLD is very significant and stable, and is superior to ImLD.

3) The gain by the combination of ImLD and InLD is not large, mainly because their effects are somewhat overlapping: InLD weakens the feature response of the non-object region while weakening the image noise interference.

(a) Small vehicle and large vehicle (HBB task).
(b) Plane (OBB task).
Fig. 12: Detection examples of our proposed method in large scenarios on DOTA dataset. Our method can both effectively handle the dense (top plot with white bounding box) and rotating (bottom plot with red bounding box) cases.

Therefore, ImLD is an optional module depending on the dataset and computing environment. We will not use ImLD in subsequent experiments unless otherwise stated.

Effect of IoU-Smooth L1 Loss. The IoU-Smooth L1 loss333Source code of IoU-Smooth L1 Loss is separately available at eliminates the boundary effects of the angle, making it easier for the model to regress to the objects coordinates. Table VII shows that new loss improves three detectors’ accuracy to 69.83%, 68.65% and 76.20%, respectively.

Effect of Data Augmentation and Backbone. Using ResNet101 as backbone and data augmentation (random horizontal, vertical flipping, random graying, and random rotation), we observe a reasonable improvement as shown in Table VI (69.81% 72.98%). We improve the final performance of the model from 72.98% to 74.41% by using ResNet152 as backbone. Due to the extreme imbalance of categories in the dataset, this provides a huge advantage to data augmentation, but we have found that this does not affect the functioning of InLD under these heave settings, from 72.81% to 74.41%. All experiments are performed on the OBB task on DOTA, and the final model baesd on RDet is also named RDet++444Code of RDet and RDet++ are all available at

4.3 Comparison with the State-of-the-Art Methods

We compare our proposed InLD with the state-of-the-art algorithms on two datasets DOTA [64] and DIOR [29]. Our model outperforms all other models.

Results on DOTA. We compare our results with the state-of-the-arts results in DOTA as depicted in Table VIII. The results of DOTA reported here are obtained by submitting our predictions to the official DOTA evaluation server555 In the OBB task, we add the proposed InLD module to a single-stage detection method (RDet++) and a two-stage detection method (FPN-InLD). Our methods achieve the best performance, 76.56% and 76.81% respectively. To make fair comparison, we do not use overlays of various tricks, oversized backbones, and model ensemble , which are often used on DOTA’s leaderboard methods. In the HBB task, we also conduct the same experiments and obtain competitive detection mAP, about 74.37% and 76.24%. Model performance can be further improved to 79.35% if multi-scale training and testing are used. It is worth noting that FADet [27], SCRDet [75] and CAD-Det [77] use the simple attention mechanism as described in Eq. 1, but our performance is far better than all. Fig. 11 shows some aerial subimages, and Fig. 12 shows the aerial images of large scenes.

Results on DIOR and UCAS-AOD. DIOR is a new large-scale aerial images dataset, and has more categories than DOTA. In addition to the official baselines, we also give our final detection results in Table IX. It should be noted that the baseline we reproduce is higher than the official one. In the end, we obtain 77.80% and 75.11% mAP on FPN and RetinaNet based methods. Table X illustrates the comparison of performance on UCAS-AOD dataset. As we can see, our method achieves 96.95% for OBB task and is the best out of all the existing published methods.

Method mAP Plane Car
YOLOv2 [49] 87.90 96.60 79.20
R-DFPN [72] 89.20 95.90 82.50
DRBox [37] 89.95 94.90 85.00
SARN [2] 94.90 97.60 92.20
RetinaNet-H [71] 95.47 97.34 93.60
ICN [1] 95.67
FADet [27] 95.71 98.69 92.72
RDet [71] 96.17 98.20 94.14
SCRDet++ (RDet-based) 96.95 98.93 94.97
TABLE X: Performance by accuracy (%) on UCAS-AOD dataset.

5 Conclusion

We have presented an instance level denoising technique in the feature map for improving detection especially for small and densely arranged objects e.g. in aerial images. The core idea of InLD is to make the feature of different categories decoupled over different channels, while the features of the object and non-object are enhanced and weakened in the space, respectively. Meanwhile, the IoU constant factor is added to the smooth L1 loss to address the boundary problem in rotation detection for more accurate rotation estimation. We perform extensive ablation studies and comparative experiments on multiple aerial image datasets such as DOTA, DIOR, UCAS-AOD, small traffic light dataset BSTLD and our released STLD, and demonstrate that our method achieves the state-of-the-art detection accuracy. We also use natural image dataset COCO and scene text dataset ICDAR2015 to verify the effectiveness of our approach.


This research was supported by National Key Research and Development Program of China (2018AAA0100704, 2016YFB1001003), and NSFC (61972250, U19B2035), STCSM (18DZ1112300). The author Xue Yang is supported by Wu Wen Jun Honorary Doctoral Scholarship, AI Institute, Shanghai Jiao Tong University.


  • [1] S. M. Azimi, E. Vig, R. Bahmanyar, M. Körner, and P. Reinartz (2018) Towards multi-class object detection in unconstrained remote sensing imagery. In Asian Conference on Computer Vision, pp. 150–165. Cited by: §1, §2.2, §4.1, TABLE X, TABLE VIII.
  • [2] S. Bao, X. Zhong, R. Zhu, X. Zhang, Z. Li, and M. Li (2019) Single shot anchor refinement network for oriented object detection in optical remote sensing imagery. IEEE Access 7, pp. 87150–87161. Cited by: TABLE X.
  • [3] K. Behrendt, L. Novak, and R. Botros (2017) A deep learning approach to traffic lights: detection, tracking, and classification. In 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 1370–1377. Cited by: §1, TABLE IV, §4.1.
  • [4] G. Cheng, Y. Si, H. Hong, X. Yao, and L. Guo (2020) Cross-scale feature fusion for object detection in optical remote sensing images. IEEE Geoscience and Remote Sensing Letters. Cited by: TABLE IX.
  • [5] G. Cheng, P. Zhou, and J. Han (2016) Learning rotation-invariant convolutional neural networks for object detection in vhr optical remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 54 (12), pp. 7405–7415. Cited by: §1.
  • [6] S. J. Cho, T. J. Jun, B. Oh, and D. Kim (2019)

    DAPAS: denoising autoencoder to prevent adversarial attack in semantic segmentation

    arXiv preprint arXiv:1908.05195. Cited by: §1, §1, §2.3.
  • [7] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1, §2.1.
  • [8] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 764–773. Cited by: §2.1.
  • [9] C. Deng, M. Wang, L. Liu, and Y. Liu (2020) Extended feature pyramid network for small object detection. arXiv preprint arXiv:2003.07021. Cited by: §2.4.
  • [10] J. Ding, N. Xue, Y. Long, G. Xia, and Q. Lu (2019-06) Learning roi transformer for oriented object detection in aerial images. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Cited by: §1, §2.2, TABLE VIII.
  • [11] K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang, and Q. Tian (2019) Centernet: keypoint triplets for object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 6569–6578. Cited by: §2.1.
  • [12] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §1.
  • [13] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: TABLE IX.
  • [14] K. Fu, Z. Chang, Y. Zhang, G. Xu, K. Zhang, and X. Sun (2020) Rotation-aware and multi-scale convolutional neural network for object detection in remote sensing images. ISPRS Journal of Photogrammetry and Remote Sensing 161, pp. 294–308. Cited by: TABLE VIII.
  • [15] K. Fu, Z. Chen, Y. Zhang, and X. Sun (2019) Enhanced feature representation in detection for optical remote sensing images. Remote Sensing 11 (18), pp. 2095. Cited by: TABLE VIII.
  • [16] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 580–587. Cited by: §1, §2.1, §3.1.
  • [17] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1440–1448. Cited by: §1, §2.1, §3.3.1.
  • [18] X. Han, Y. Zhong, and L. Zhang (2017) An efficient and robust integrated geospatial object detection framework for high spatial resolution remote sensing imagery. Remote Sensing 9 (7), pp. 666. Cited by: §2.1.
  • [19] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969. Cited by: TABLE IX.
  • [20] K. He, X. Zhang, S. Ren, and J. Sun (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 346–361. Cited by: §1.
  • [21] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.1, TABLE VIII.
  • [22] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7132–7141. Cited by: §3.2.2, §3.2.3.
  • [23] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and Z. Luo (2017) R2CNN: rotational region cnn for orientation robust scene text detection. arXiv preprint arXiv:1706.09579. Cited by: §1, TABLE VIII.
  • [24] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. Chandrasekhar, S. Lu, et al. (2015) ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. Cited by: §1, TABLE V, §4.1.
  • [25] M. Kisantal, Z. Wojna, J. Murawski, J. Naruniec, and K. Cho (2019) Augmentation for small object detection. arXiv preprint arXiv:1902.07296. Cited by: §2.4.
  • [26] H. Law and J. Deng (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2.1, TABLE IX.
  • [27] C. Li, C. Xu, Z. Cui, D. Wang, T. Zhang, and J. Yang (2019) Feature-attentioned object detection in remote sensing imagery. In 2019 IEEE International Conference on Image Processing (ICIP), pp. 3886–3890. Cited by: §1, §3.2.2, §4.3, TABLE X, TABLE VIII.
  • [28] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan (2017) Perceptual generative adversarial networks for small object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1222–1230. Cited by: §2.4.
  • [29] K. Li, G. Wan, G. Cheng, L. Meng, and J. Han (2020) Object detection in optical remote sensing images: a survey and a new benchmark. ISPRS Journal of Photogrammetry and Remote Sensing 159, pp. 296–307. Cited by: §1, §1, TABLE V, §4.1, §4.3.
  • [30] Y. Li, Q. Huang, X. Pei, L. Jiao, and R. Shang (2020) RADet: refine feature pyramid network and multi-layer attention network for arbitrary-oriented object detection of remote sensing images. Remote Sensing 12 (3), pp. 389. Cited by: TABLE VIII.
  • [31] M. Liao, B. Shi, and X. Bai (2018) Textboxes++: a single-shot oriented scene text detector. IEEE transactions on image processing 27 (8), pp. 3676–3690. Cited by: §2.2.
  • [32] M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai (2018) Rotation-sensitive regression for oriented scene text detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5909–5918. Cited by: §2.2.
  • [33] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection.. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 4. Cited by: §1, §2.4, TABLE IV, TABLE V, TABLE VII, TABLE IX.
  • [34] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2980–2988. Cited by: Fig. 2, §1, §2.1, §3.3.1, TABLE IV, TABLE V, TABLE IX.
  • [35] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §1, §1, TABLE V, §4.1.
  • [36] Y. Lin, P. Feng, and J. Guan (2019) IENet: interacting embranchment one stage anchor free detector for orientation aerial object detection. arXiv preprint arXiv:1912.00969. Cited by: TABLE VIII.
  • [37] L. Liu, Z. Pan, and B. Lei (2017) Learning a rotation invariant detector with rotatable bounding box. arXiv preprint arXiv:1711.09405. Cited by: TABLE X.
  • [38] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8759–8768. Cited by: TABLE IX.
  • [39] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 21–37. Cited by: §1, §2.1, §3.1.
  • [40] X. Liu, D. Liang, S. Yan, D. Chen, Y. Qiao, and J. Yan (2018) Fots: fast oriented text spotting with a unified network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5676–5685. Cited by: §2.2.
  • [41] Y. Liu, X. Tang, X. Wu, J. Han, J. Liu, and E. Ding (2019) HAMBox: delving into online high-quality anchors mining for detecting outer faces. arXiv preprint arXiv:1912.09231. Cited by: §2.4.
  • [42] Z. Liu, L. Yuan, L. Weng, and Y. Yang (2017) A high resolution optical satellite image dataset for ship recognition and some new baselines. In Proc. ICPRAM, Vol. 2, pp. 324–331. Cited by: §1.
  • [43] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia. Cited by: §1, §2.2, TABLE VIII.
  • [44] S. Milani, R. Bernardini, and R. Rinaldo (2012) Adaptive denoising filtering for object detection applications. In 2012 IEEE International Conference on Image Processing (ICIP), pp. 1013–1016. Cited by: §1, §2.3.
  • [45] A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    In Proceedings of the European Conference on Computer Vision (ECCV), pp. 483–499. Cited by: TABLE VIII.
  • [46] W. Qian, X. Yang, S. Peng, Y. Guo, and C. Yan (2019) Learning modulated loss for rotated object detection. arXiv preprint arXiv:1911.08299. Cited by: §2.2, §3.3.1, TABLE VIII.
  • [47] H. Qiu, H. Li, Q. Wu, F. Meng, K. N. Ngan, and H. Shi (2019) A2RMNet: adaptively aspect ratio multi-scale network for object detection in remote sensing images. Remote Sensing 11 (13), pp. 1594. Cited by: TABLE VIII.
  • [48] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 779–788. Cited by: §1, §2.1.
  • [49] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7263–7271. Cited by: TABLE X.
  • [50] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: TABLE IX.
  • [51] S. Ren, K. He, R. Girshick, and J. Sun (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence (6), pp. 1137–1149. Cited by: §1, §1, §2.1, TABLE VIII, TABLE IX.
  • [52] Y. Ren, C. Zhu, and S. Xiao (2018) Deformable faster r-cnn with aggregating multi-layer features for partially occluded object detection in optical remote sensing images. Remote Sensing 10 (9), pp. 1470. Cited by: §2.1.
  • [53] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: TABLE VIII.
  • [54] B. Singh, M. Najibi, and L. S. Davis (2018) SNIPER: efficient multi-scale training. In Advances in neural information processing systems, pp. 9310–9320. Cited by: §2.4.
  • [55] P. Sun, G. Chen, G. Luke, and Y. Shang (2018) Salience biased loss for object detection in aerial images. arXiv preprint arXiv:1810.08103. Cited by: TABLE VIII.
  • [56] C. Tian, L. Fei, W. Zheng, Y. Xu, W. Zuo, and C. Lin (2019) Deep learning on image denoising: an overview. arXiv preprint arXiv:1912.13171. Cited by: §1, §2.3.
  • [57] Z. Tian, C. Shen, H. Chen, and T. He (2019) Fcos: fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 9627–9636. Cited by: §2.1.
  • [58] J. Wang, J. Ding, H. Guo, W. Cheng, T. Pan, and W. Yang (2019) Mask obb: a semantic attention-based mask oriented bounding box representation for multi-category object detection in aerial images. Remote Sensing 11 (24), pp. 2930. Cited by: TABLE VIII.
  • [59] P. Wang, X. Sun, W. Diao, and K. Fu (2019) FMSSD: feature-merged single-shot detection for multiscale objects in large-scale remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing. Cited by: §2.1, TABLE VIII.
  • [60] P. Wang (2020) Multi-scale feature integrated attention-based rotation network for object detection in vhr aerial images. Sensors 20 (6), pp. 1686. Cited by: TABLE VIII.
  • [61] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7794–7803. Cited by: §3.2.2, §3.2.3.
  • [62] Y. Wang, Y. Zhang, Y. Zhang, L. Zhao, X. Sun, and Z. Guo (2019) SARD: towards scale-aware rotated object detection in aerial imagery. IEEE Access 7, pp. 173855–173865. Cited by: TABLE VIII.
  • [63] H. Wei, L. Zhou, Y. Zhang, H. Li, R. Guo, and H. Wang (2019) Oriented objects as pairs of middle lines. arXiv preprint arXiv:1912.10694. Cited by: §2.1, §2.2, TABLE VIII.
  • [64] G. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang (2018) DOTA: a large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, TABLE V, §4.1, §4.1, §4.3, TABLE VIII.
  • [65] Z. Xiao, L. Qian, W. Shao, X. Tan, and K. Wang (2020) Axis learning for orientated objects detection in aerial images. Remote Sensing 12 (6), pp. 908. Cited by: §2.1, §2.2, TABLE VIII.
  • [66] C. Xie, Y. Wu, L. v. d. Maaten, A. L. Yuille, and K. He (2019) Feature denoising for improving adversarial robustness. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 501–509. Cited by: §1, §1, §2.3, §3.2.4, §3.2.4, §3.2.4, §3.2, TABLE I, §4.2.
  • [67] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1492–1500. Cited by: TABLE VIII.
  • [68] Y. Xu, M. Fu, Q. Wang, Y. Wang, K. Chen, G. Xia, and X. Bai (2020) Gliding vertex on the horizontal bounding box for multi-oriented object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.2, TABLE VIII.
  • [69] Z. Xu, X. Xu, L. Wang, R. Yang, and F. Pu (2017) Deformable convnet with aspect ratio constrained nms for object detection in remote sensing imagery. Remote Sensing 9 (12), pp. 1312. Cited by: §2.1.
  • [70] J. Yan, H. Wang, M. Yan, W. Diao, X. Sun, and H. Li (2019) IoU-adaptive deformable r-cnn: make full use of iou for multi-class object detection in remote sensing imagery. Remote Sensing 11 (3), pp. 286. Cited by: §2.1, TABLE VIII.
  • [71] X. Yang, Q. Liu, J. Yan, and A. Li (2019) R3Det: refined single-stage detector with feature refinement for rotating object. arXiv preprint arXiv:1908.05612. Cited by: §1, §1, §2.2, §3.3.1, TABLE I, TABLE II, TABLE III, TABLE V, §4.2, TABLE X, TABLE VI, TABLE VII, TABLE VIII.
  • [72] X. Yang, H. Sun, K. Fu, J. Yang, X. Sun, M. Yan, and Z. Guo (2018) Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sensing 10 (1), pp. 132. Cited by: §3.3.1, TABLE X, TABLE VIII.
  • [73] X. Yang, H. Sun, X. Sun, M. Yan, Z. Guo, and K. Fu (2018) Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network. IEEE Access 6, pp. 50839–50849. Cited by: §3.3.1.
  • [74] X. Yang and J. Yan (2020) Arbitrary-oriented object detection with circular smooth label. arXiv preprint arXiv:2003.05597. Cited by: §1, TABLE VIII.
  • [75] X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and K. Fu (2019-10) SCRDet: towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2.2, §2.4, §3.2.2, §3.3.1, §4.3, TABLE VII, TABLE VIII.
  • [76] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: §3.2.3, TABLE III.
  • [77] G. Zhang, S. Lu, and W. Zhang (2019) CAD-net: a context-aware detection network for objects in remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 57 (12), pp. 10015–10024. Cited by: §1, §3.2.2, §4.3, TABLE VIII.
  • [78] L. Zhou, H. Wei, H. Li, Y. Zhang, X. Sun, and W. Zhao (2020) Objects detection for remote sensing images based on polar coordinates. arXiv preprint arXiv:2001.02988. Cited by: TABLE VIII.
  • [79] X. Zhou, J. Zhuo, and P. Krahenbuhl (2019) Bottom-up object detection by grouping extreme and center points. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 850–859. Cited by: §2.1.
  • [80] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang (2017) EAST: an efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2642–2651. Cited by: §2.2.
  • [81] C. Zhu, R. Tao, K. Luu, and M. Savvides (2018) Seeing small faces from robust anchor’s perspective. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5127–5136. Cited by: §2.4.
  • [82] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao (2015) Orientation robust object detection in aerial images using deep convolutional neural network. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 3735–3739. Cited by: §4.1.
  • [83] Y. Zhu, X. Wu, and J. Du (2019) Adaptive period embedding for representing oriented objects in aerial images. arXiv preprint arXiv:1906.09447. Cited by: TABLE VIII.