R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object
Small and cluttered objects are common in real-world which are challenging for detection. The difficulty is further pronounced when the objects are rotated, as traditional detectors often routinely locate the objects in horizontal bounding box such that the region of interest is contaminated with background or nearby interleaved objects. In this paper, we first innovatively introduce the idea of denoising to object detection. Instance-level denoising on the feature map is performed to enhance the detection to small and cluttered objects. To handle the rotation variation, we also add a novel IoU constant factor to the smooth L1 loss to address the long standing boundary problem, which to our analysis, is mainly caused by the periodicity of angular (PoA) and exchangeability of edges (EoE). By combing these two features, our proposed detector is termed as SCRDet++. Extensive experiments are performed on large aerial images public datasets DOTA, DIOR, UCAS-AOD as well as natural image dataset COCO, scene text dataset ICDAR2015, small traffic light dataset BSTLD and our newly released S^2TLD by this paper. The results show the effectiveness of our approach. Project page at https://yangxue0827.github.io/SCRDet++.html.READ FULL TEXT VIEW PDF
Arbitrary-oriented object detection has recently attracted increasing
Object detection plays a vital role in natural scene and aerial scene an...
Rotation detection serves as a fundamental building block in many visual...
Object detection in aerial images is an active yet challenging task in
Detecting small, densely distributed objects is a significant challenge:...
Rotation detection is a challenging task due to the difficulties of loca...
Popular rotated detection methods usually use five parameters (coordinat...
R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object
Visual object detection is one of the fundamental tasks in computer vision and various general-purpose detectors[16, 20, 17, 39, 48, 7, 51]
based on convolutional neural networks (CNNs) have been devised. Promising results have been achieved on public benchmarks including MS COCO and VOC2007  etc. However, most existing detectors do not pay particular attention to some common aspects for robust object detection in the wild: small size, cluttered arrangement and arbitrary orientations. These challenges are especially pronounced for aerial image [64, 29, 5, 42] which has become an important area for detection in practice, for its various civil applications, e.g. resource detection, environmental monitoring, and urban planning.
In the context of remote sensing, we further present some specific discussion to motivate this paper, as shown in Fig. 1. It shall be noted that the following three aspects also prevail in other sources e.g. natural images and scene texts.
1) Small objects. Aerial images often contain small objects overwhelmed by complex surrounding scenes.
2) Cluttered arrangement. Objects e.g. vehicles and ships in aerial images are often densely arranged, leading to inter-class feature coupling and intra-class feature boundary blur.
3) Arbitrary orientations. Objects in aerial images can appear in various orientations. Rotation detection is necessary especially considering the high aspect ratio issue: the horizontal bounding box for a rotated object is more loose than an aligned rotated one, such that the box contains a large portion of background or nearby cluttered objects as disturbance. Moreover, it will be greatly affected by non-maximum suppression, see Fig. 1(a).
As described above, the small/cluttered objects problem can be interleaved with the rotation variance. In this paper, we aim to address the first challenge by seeking a new way of dismissing the noisy interference from both background and other foreground objects. While for rotation alignment, a new rotation loss is devised accordingly. Our both techniques can serve as plug in for existing detectors[51, 33, 34, 43, 23, 71], in an out of box manner. We give further description as follows.
For small and cluttered object detection, we devise a denoising module and in fact denoising has not been studied for objection detection. We observe two common types of noises that are orthogonal to each other: i) image level noise, which is object-agnostic, and ii) instance level noise, specifically often in the form of mutual interference between objects, as well as background interference. Such noises are ubiquitous and pronounced in aerial images which are remotely sensed. In fact, denoising has been a long standing task [56, 66, 44, 6] in image processing while they are rarely designated for object detection, and the denoising is finally performed on raw image for the purpose of image enhancement rather than downstream semantic tasks, especially in an end-to-end manner.
In this paper, we explore the way of performing instance level denoising (InLD) and particularly in the feature map (i.e. latent layers’ outputs by CNNs), for robust detection. The hope is to reduce the inter-class feature coupling and intra-class interference, meanwhile blocking background interference. To this end, a novel InLD component is designated to decouple the features of different object categories into their respective channels approximately. Meanwhile, in the spatial domain, the features of the object and background are enhanced and weakened, respectively. It is worth noting that the above idea is conceptually similar to but inherently different from the recent efforts [66, 6] for image level feature map denoising (ImLD), which is used as a way of enhancing the image recognition model’s robustness against attack, rather than location sensitive object detection. Readers are referred to Table V for a quick verification that our InLD can more effectively improve detection than ImLD for both horizontal and rotation cases.
On the other hand, as discussed above, as a closely interleaved problem to small/cluttered object detection, accurate rotation estimation is addressed by devising a novel IoU-Smooth L1 loss. It is motivated by the fact that the existing state-of-the-art regression-based rotation detection methods e.g. five-parameter regression[1, 10, 71, 77] suffer from the issue of discontinuous boundaries, which is inherently caused by the periodicity of angular (PoA) and exchangeability of edges (EoE)  (see details in Sec. 3.3.2).
We conduct extensive ablation study and experiments on multiple datasets including both aerial images from DOTA , DIOR , UCAS-AOD , as well as natural image dataset COCO , scene text dataset ICDAR2015 , small traffic light dataset BSTLD  and our newly released STLD to illustrate the promising effects of our techniques.
The preliminary content of this paper has partially appeared in the conference version 111Compared with the conference version, this journal version has made the following extensions: i) we take a novel feature map denoising perspective to the small and cluttered object detection problem, and specifically devise a new instance-level feature denoising technique for detecting small and cluttered objects with little additional computation and parameter overhead; ii) comprehensive ablation study of our instance-level feature denoising component across datasets, which can be easily plugged into existing detectors. Our new method significantly outperforms our previous detector in the conference version (e.g. overall detection accuracy 72.61% versus 76.81%, and 75.35% versus 79.35% on the OBB and HBB task of DOTA dataset, respectively); iii) We collect, annotate and release a new small traffic light dataset (5,786 images with 1,4130 traffic light instances across five categories) to further verify the versatility and generalization performance of the instance-level denoising module; iv) last but not least, the paper has been largely rephrased and expanded to cover the discussion of up-to-date works including those on image denoising and small object detection. Finally the source code is released., with the detector named SCRDet (Small, Cluttered, and Rotated Object Detector). In this journal version, we extend our improved detector called SCRDet++. The overall contributions are:
1) To our best knowledge, we are the first to develop the concept of instance level noise (at least in the context of object detection), and design a novel Instance-Level Denoising (InLD) module in feature map. This is realized by supervised segmentation whose ground truth is approximately obtained by the bounding box in object detection. The proposed module effectively addresses the challenges in detecting small size, arbitrary direction, and dense distribution objects with little computation and parameter increase.
2) Towards more robust handling of arbitrarily-rotated objects, an improved smooth L1 loss is devised by adding the IoU constant factor, which is tailored to solve the boundary problem of the rotating bounding box regression.
3) We create and release a real-world traffic light dataset: STLD. It consists of 5,786 images with 14,130 traffic light instances across five categories: red, green, yellow, off and wait on. It further verifies the effectiveness of InLD, and it is available at https://github.com/Thinklab-SJTU/S2TLD.
4) Our method achieves state-of-the-art performance on public datasets for rotation detection in complex scenes like the aerial images. Experiments also show that our InLD module, which can be easily plugged into existing architectures, can notably improve detection on different tasks.
We first discuss existing detectors for both horizontal bounding box based detection and rotation detection. Then some representative works on image denoising and small object detection are also introduced.
There is an emerging line of deep network based object detectors. R-CNN  pioneers the CNN-based detection pipeline. Subsequently, region-based models such as Fast R-CNN , Faster R-CNN , and R-FCN  are proposed, which achieves more cost-effective detection. SSD , YOLO  and RetinaNet  are representative single-stage methods, and their single-stage structure further improves detection speed. In addition to anchor-based methods, many anchor-free also have become popular in recent years. FCOS , CornerNet , CenterNet  and ExtremeNet  attempt to predict some keypoints of objects such as corners or extreme points, which are then grouped into bounding boxes, and these detectors have also been applied to the field of remote sensing [63, 65]. R-P-Faster R-CNN  achieves satisfactory performance in small datasets. The method  combines both deformable convolution layers  and region-based fully convolutional networks (R-FCN) to improve detection accuracy further. The work  adopts top-down and skipped connections to produce a single high-level feature map of a fine resolution, improving the performance of the deformable Faster R-CNN model. IoU-Adaptive R-CNN  reduces the loss of small object information by a new IoU-guided detection network. FMSSD  aggregates the context information both in multiple scales and the same scale feature maps. However, objects in aerial images with small size, cluttered distribution and arbitrary rotation are still challenging, especially for horizontal region detection methods.
The demand for rotation detection has been increasing recently like for aerial images and scene texts. Recent advances are mainly driven by the adoption of rotated bounding boxes or quadrangles to represent multi-oriented objects. For scene text detection, RRPN  employs rotated RPN to generate rotated proposals and further perform rotated bounding box regression. TextBoxes++  adopts vertex regression on SSD. RRD  further improves TextBoxes++ by decoupling classification and bounding box regression on rotation-invariant and rotation sensitive features, respectively. EAST  directly predicts words or text lines of arbitrary orientations and quadrilateral shapes in full images, eliminating unnecessary intermediate steps with a single neural network. Recent text spotting methods like FOTS  show that training text detection and recognition simultaneously can greatly boost detection performance. In contrast, aerial images object detection is more challenging: first, multi-category object detection requires the generalization of the detector. Second, small objects in aerial images are usually densely arranged on a large scale. Third, aerial image detection requires a more robust algorithm due to the variety of noises. Many aerial images rotation detection algorithms are designed for different problems. ICN , ROI Transformer , and SCRDet  are representative of two-stage aerial images rotation detectors, which are mainly designed from the perspective of feature extraction. From the results, they have achieved good performance in small or dense object detection. Compared to the previous methods, RDet  and RSDet  are based on a single-stage detection method which pay more attention to the trade-off of accuracy and speed. Gliding Vertex  and RSDet  achieve more accurate object detection via quadrilateral regression prediction. Axis Learning  and O-DNet  are combined with the latest popular anchor-free ideas, to overcome the problem of too many anchors in anchor-based detection methods.
Deep learning has obtained much attention in image denoising. The survey  divides image denoising using CNNs into four types (see the references therein): 1) additive white noisy images; 2) real noisy images; 3) blind denoising and 4) hybrid noisy images, as the combination of noisy, blurred and low-resolution images. In addition, image denoising also helps to improve the performance of other computer vision tasks, such as image classification , object detection , semantic segmentation , etc. In addition to image noise, we find that there is also instance noise in the field of object detection. Instance noise describes object-aware noise, which is more widespread in object detection than object-agnostic image noise. In this paper, we will explore the application of image-level denoising and instance-level denoising techniques to object detection in complex scenes.
Small object detection remains an unsolved challenge. Common small object solutions include data augmentation , multi-scale feature fusion [33, 9], tailored sampling strategies [81, 41, 75], generative adversarial networks , and multi-scale training  etc. In this paper, we show that denoising is also an effective means to improve the detection performance of small objects. In complex scenes, the feature information of small objects is often overwhelmed by the background area, which often contains a large number of similar objects. Unlike ordinary image-level denoising, we will use instance-level denoising to improve the detection capabilities of small objects, which is a new perspective.
This paper mainly considers designing a general-purpose instance level feature denoising module, to boost the performance of horizontal detection and rotation detection in challenging aerial imagery, as well as natural images and scene texts. Besides, we also design an IoU-Smooth L1 loss to solve the boundary problem of the arbitrary-oriented object detection for more accurate rotation estimation.
Figure 2 illustrates the pipeline of the proposed SCRDet++. It mainly consists of four modules: i) feature extraction via CNNs which can take different forms of CNNs from existing detectors e.g. [16, 39], ii) image-level denoising (ImLD) module for removing common image noise, which is optional as its effect can be well offset by the subsequent InLD as devised in this paper; iii) instance-level denoising (InLD) module for suppressing instance noise (i.e., inter-class feature coupling and distraction between intra-class and background) and iv) the class and box branch for predicting score and (rotated) bounding box. Specifically, we first describe our main technique i.e. instance-level denoising module (InLD) in Sec. 3.2, which further contains a comparison with the image level denoising module (ImLD). Finally, we detail the network learning which involves a specially designed smooth loss for rotation estimation in Sec. 3.3. Note that in experiments we show that InLD can replace and strike a more effective role for detection than ImLD, making ImLD a dispensable component in our pipeline.
In this subsection, we present our devised instance-level feature map denoising approach. To emphasis the importance of our instance-level operation, we further compare it with image-level denoising in feature map, which is also adopted for robust image recognition model learning in . To our best knowledge, our approach is the first for using (instance level) feature map denoising for object detection. The denoising module can be learned in an end-to-end manner together with other modules, which is optimized for the object detection task.
Instance-level noise generally refers to the mutual interference among objects, and also that from background. We discuss its properties in the following aspects. In particular, as shown in Fig. 3, the adversary effect to object detection is especially pronounced in the feature map that calls for feature space denoising rather than on the raw input image.
1) The non-object with object-like shape has a higher response in the feature map, especially for small objects (see the top row of Fig. 3).
2) Clutter objects that are densely arranged tend to suffer the issue for inter-class feature coupling and intra-class feature boundary blurring (see the middle row of Fig. 3).
3) The response of object is not prominent enough surrounded by the background (see the bottom row of Fig. 3).
To dismiss instance level noise, one can generally refer to the idea of attention mechanism, as a common way of re-weighting the convolutional response maps to highlight the important parts and suppress the uninformative ones, such as spatial attention  and channel-wise attention . We show that existing aerial image rotation detectors, including FADet , SCRDet  and CAD-Det , often use the simple attention mechanism to re-weight the output, which can be reduced into the following general form:
where represents two feature maps of input image. The attention function refers to the proposal output by a certain attention module e.g. [61, 22]. Note is the element-wise product. and denote the spatial weight and channel weight. indicates the weight of the -th channel, respectively. Throughout the paper,
means the concatenation operation for connecting tensor along the feature map’s channels.
However, Eq. 1 simply distinguishes feature response between objects and background in spatial domain, and is only used to measure the importance of each channel. In other words, the interaction between intra-class objects and inter-class objects is not considered which is important for detection in complex scene. We are aimed to devise a new network that can not only distinguish object from background, but also weaken the mutual interference among objects. Specifically, we propose adding instance-level denoising (InLD) module at intermediate layers of convolutional networks. The key is to decouple the feature of different object categories into their respective channels, and meanwhile the features of objects and background are enhanced and weakened in the spatial domain, respectively.
As a result, our new formulation is as follows, which considers the total number of object categories with one additional category for background:
where is a hierarchical weight. and represent the weight and feature response corresponding to the -th category, and its channel number is denoted by , for . and represent the weight and feature of the -th category along the -th channel, respectively.
Without loss of generality, consider an image containing objects belonging to the first () categories. In this paper, we aim to decouple the above formula into three parts as concatenated to each other (see Fig. 5):
For background and unseen categories not in image, ideally the response is filtered by our devised denoising module to be as small as possible. From this perspective, Eq. 4 can be further interpreted by:
where denotes tensor with small feature response one aims to achieve, for each category and background .
In the following subsection, we show how to achieve the above decoupled feature learning among categories.
|Base Model||Image-Level Denoising||mAP (%)|
|bilateral, dot prod||66.94|
|nonlocal, dot prod||66.82|
|nonlocal, gaussian, 3x3 mean||66.88|
Based on the above derivations, we devise a practical neural network based implementation. Our analysis starts with the simplest case with a single channel for each category’s weight in Eq. 2, or namely . In this setting, the learned weight
can be regarded as the result of semantic segmentation of the image for specific categories (a three-dimensional one-hot vector). Then more channels of weightin can be guided by semantic segmentation, as illustrated in Fig. 2 and Fig. 5. In semantic segmentation task, the feature responses of each category on the previous layers of the output layer tend to be separated in the channel dimension, and the feature responses of the foreground and background in the spatial dimension are also polarized. Hence one can adopt a semantic segmentation network for the operations in Eq. 5. Another advantage for holding this semantic segmentation view is that it can be conducted in an end-to-end supervised fashion, whose learned denoising weights can be more reliable and effective than the self-attention based alternatives [61, 22].
In Fig. 2, we give a specific implementation as follows. The input feature map expands the receptive field by dilated convolutions  and a convolutional layer at first. For instance, the values of take the numbers of on pyramid levels P3 to P7, respectively as set in our experiments. The feature map is then processed by two parallel convolutional layers to obtain the two important outputs. One output (a three-dimensional one-hot feature map) is used to perform coarse multi-class segmentation, and the annotated bounding box in detection tasks can be used as the approximate ground truth. The hope is that this output will guide the other output into a denoising feature map.
As shown in Fig. 5, this denoising feature map and the original feature map are combined (by dot operation) to obtain the final decoupled feature map. The purpose is in two-folds: along the channel dimension, inter-class feature responses of different object categories (excluding the background) are basically decoupled into their respective channels; In the spatial dimension, intra-class feature boundaries are sharpened due to the feature response of the object area is enhanced and background is weakened. As such, the three issues as raised in the beginning of this subsection are alleviated.
As shown in the upper right corner of Fig. 2, the classification model is decomposed into two terms: objectness and category classification, as written by:
This probability maprelates to whether the anchor for each feature point is an object. While the above decoupled features are directly used for object classification (as well as rotation regression which will be discussed in Sec. 3.3).
During training, the probability map will be used as a weight for the regression loss (see Eq. 10), making those ambiguous positive samples get smaller weights and giving higher quality positive samples more attention. We find in the experiment that the introduction of the probability map can speed up the convergence of the model and improve the detection results, as shown in Table II.
Image denoising is a fundamental task in image processing, which may impose notable impact to image recognition, as has been recently studied and verified in . Specifically, the work  shows that the transformations performed by the network layers exacerbate the perturbation, and the hallucinated activations can overwhelm the activations due to true signal, which leads to worse prediction.
Here we also study this issue in the context of aerial images though we directly borrow the image level denoising model . As shown in Fig. 4, we add Gaussian noise on the raw aerial images and compare with the clean ones. The same feature map on clean and noisy images, extracted from the same channel of a res3 block in the same detection network trained on clean images are visualized. Though the noise has little effect and it is difficult to distinguish by naked eyes. However, it becomes more obvious in the feature map such that the objects are gradually submerged in the background or the boundary between the objects tends to be blurred.
Since the convolution operation and the traditional denoising filters are highly correlated, we resort to a potential solution  which employs convolutional layers to simulate different types of differential filters, such as non-local means, bilateral filtering, mean filtering, and median filtering. Inspired by the success of these operation in adversarial attacks , in this paper we migrate and extend these differential operations for object detection. We show the generic form of ImLD in Fig. 2. It processes the input features by a denoising operation, such as non-local means or other variants. The denoised representation is first processed by a
convolutional layer, and then added to the module’s input via a residual connection. The simulation of ImLD is expressed as follows:
where is the output by a certain filter. , represent the whole feature map of input image. The effect of the imposed denosing module is shown in Table I. In the following, we further show that the more notable detection improvement comes from the InLD module and its effect can well cover the image level one.
Horizontal and rotation detection settings are both considered. For rotation detection, we need to redefine the representation of the bounding box. Fig. 6 shows the rectangular definition of the 90 degree angle representation range [72, 75, 71, 46, 73]. denotes the acute angle to the x-axis, and for the other side we refer it as . Note this definition is also officially adopted by OpenCV https://opencv.org/.
The regression of the bounding box is given by:
where , , , , denote the box’s center coordinates, width, height and angle, respectively. Variables are for the ground-truth box, anchor box, and predicted box, respectively (likewise for ).
For horizontal detection, the multi-task loss is used which is defined as follows:
where indicates the number of anchors, is a binary value ( for foreground and for background, no regression for background). indicates the probability that the current anchor is the object. denotes the predicted offset vectors of the n-th anchor, is the targets vector between n-th anchor and ground-truth it matches. represents the label of object,, denote the label and predict of mask’s pixel respectively. The hyper-parameter , , control the trade-off and are set to 1 by default. The classification loss is focal loss . The regression loss is smooth L1 loss as defined in , and the InLD loss is pixel-wise softmax cross-entropy.
|Base Model||Mask Type||Coproduct||FPS||mAP (%)|
In contrast, rotation detection needs to carefully address the boundary problem. In particular, there exists the boundary problem for the angle regression, as shown in Fig. 7(a). It shows that an ideal form of regression (the blue box rotates counterclockwise to the red box), but the loss of this situation is very large due to the periodicity of angular (PoA) and exchangeability of edges (EoE). Therefore, the model has to be regressed in other complex forms like in Fig. 7(b) (such as the blue box rotating clockwise while scaling and ), increasing the difficulty of regression, as shown in Fig. 8(a). We introduce the IoU constant factor in the traditional smooth L1 loss to perfectly solve this problem, as shown in Eq. 11
. This new loss function is named IoU-smooth L1 loss. It can be seen that in the boundary case, the loss function is approximately equal to, eliminating the sudden increase in loss caused by , as shown in Fig. 8(b). The new regression loss can be divided into two parts: determines the direction of gradient propagation, and for the magnitude of gradient. In addition, using IoU to optimize location accuracy is consistent with IoU-dominated metric, which is more straightforward and effective than coordinate regression.
where denotes the overlap of the prediction box and ground-truth.
|InLD||RetinaNet-H ||RDet |
|dilated convolution |
in InLD on OBB task of DOTA. It can be found that supervised learning is the main contribution of InLD rather than more convolution layers.
|Dataset||Base Model||InLD||red||yellow||green||off||wait on||mAP|
|BSTLD ||RetinaNet ||69.91||19.71||77.11||22.33||–||47.26|
|Dataset and task||Base Model||Baseline||ImLD||InLD||ImLD + InLD|
|DOTA OBB||RetinaNet-H ||62.21||62.39||65.40||65.62 (+0.22)|
|RetinaNet-R ||61.94||63.96||64.52||64.60 (+0.08)|
|RDet ||65.73||67.68||69.81||69.95 (+0.14)|
|DOTA HBB ||RetinaNet ||67.76||68.05||68.33||68.50 (+0.17)|
|DIOR ||RetinaNet ||68.05||68.42||69.36||69.35 (-0.01)|
|FPN ||71.74||71.83||73.21||73.25 (+0.04)|
|ICDAR2015 ||RetinaNet-H ||77.13||–||78.68||–|
|COCO ||FPN ||36.1||–||37.2||–|
Experiments are performed on a server with GeForce RTX 2080 Ti and 11G memory. We first give the description of the dataset, and then use these datasets to verify the advantage of the proposed method. Source code is available at https://github.com/SJTU-Thinklab-Det/DOTA-DOAI.
|Base Method||Backbone||InLD||Data Aug.||PL||BD||BR||GTF||SV||LV||SH||TC||BC||ST||SBF||RA||HA||SP||HC||mAP|
|OBB (oriented bounding boxes)||Backbone||PL||BD||BR||GTF||SV||LV||SH||TC||BC||ST||SBF||RA||HA||SP||HC||mAP|
|FR-O ||ResNet101 ||79.09||69.12||17.17||63.49||34.20||37.16||36.20||89.19||69.60||58.96||49.4||52.52||46.69||44.80||46.30||52.93|
|RADet ||ResNeXt101 ||79.45||76.99||48.05||65.83||65.46||74.40||68.86||89.70||78.14||74.97||49.92||64.63||66.14||71.58||62.16||69.09|
|MFIAR-Net ||ResNet152 ||89.62||84.03||52.41||70.30||70.13||67.64||77.81||90.85||85.40||86.22||63.21||64.14||68.31||70.21||62.11||73.49|
|Gliding Vertex ||ResNet101||89.64||85.00||52.26||77.34||73.01||73.14||86.82||90.74||79.02||86.81||59.55||70.91||72.94||70.86||57.32||75.02|
|Mask OBB ||ResNeXt101||89.56||85.95||54.21||72.90||76.52||74.16||85.63||89.85||83.81||86.48||54.89||69.64||73.94||69.06||63.32||75.33|
|SCRDet++ MS (FPN-based)||ResNet101||90.05||84.39||55.44||73.99||77.54||71.11||86.05||90.67||87.32||87.08||69.62||68.90||73.74||71.29||65.08||76.81|
|Axis Learning ||ResNet101||79.53||77.15||38.59||61.15||67.53||70.49||76.30||89.66||79.07||83.53||47.27||61.01||56.28||66.06||36.05||65.98|
|O-DNet ||Hourglass104 ||89.31||82.14||47.33||61.21||71.32||74.03||78.62||90.76||82.23||81.36||60.93||60.17||58.21||66.98||61.03||71.04|
|SCRDet++ MS (RDet-based)||ResNet152||88.68||85.22||54.70||73.71||71.92||84.14||79.39||90.82||87.04||86.02||67.90||60.86||74.52||70.76||72.66||76.56|
|HBB (horizontal bounding boxes)||Backbone||PL||BD||BR||GTF||SV||LV||SH||TC||BC||ST||SBF||RA||HA||SP||HC||mAP|
|IoU-Adaptive R-CNN ||ResNet101||88.62||80.22||53.18||66.97||76.30||72.59||84.07||90.66||80.95||76.24||57.12||66.65||84.08||66.36||56.85||72.72|
|Mask OBB ||ResNeXt-101||89.69||87.07||58.51||72.04||78.21||71.47||85.20||89.55||84.71||86.76||54.38||70.21||78.98||77.46||70.40||76.98|
|SCRDet++ MS (FPN-based)||ResNet101||90.00||86.25||65.04||74.52||72.93||84.17||79.05||90.72||87.37||87.06||72.10||66.72||82.64||80.57||71.07||79.35|
|FMSSD ||VGG16 ||89.11||81.51||48.22||67.94||69.23||73.56||76.87||90.71||82.67||73.33||52.65||67.52||72.37||80.57||60.15||72.43|
|YOLOv3 ||Darknet‐53||72.2||29.2||74.0||78.6||31.2||69.7||26.9||48.6||54.4||31.1||61.1||44.9||49.7||87.4||70.6||68.7||87.3||29.4||48. 3||78.7||57.1|
We choose a wide variety of public datasets from both aerial images as well as natural images and scene texts for evaluation. The details are as follows.
DOTA : DOTA is a complex aerial image dataset for object detection, which contains objects exhibiting a wide variety of scales, orientations, and shapes. DOTA contains 2,806 aerial images and 15 common object categories from different sensors and platforms. The fully annotated DOTA benchmark contains 188,282 instances, each of which is labeled by an arbitrary quadrilateral. There are two detection tasks for DOTA: horizontal bounding boxes (HBB) and oriented bounding boxes (OBB). The training set, validation set, and test set account for 1/2, 1/6, 1/3 of the entire data set, respectively. Due to ranging of the image size from around to pixels, we divide the images into subimages with an overlap of 150 pixels and scale it to . With all these processes, we obtain about 27,000 patches. The model is trained by 135k iterations in total, and the learning rate changes during the 81k and 108k iterations from 5e-4 to 5e-6. The short names for categories are defined as (abbreviation-full name): PL-Plane, BD-Baseball diamond, BR-Bridge, GTF-Ground field track, SV-Small vehicle, LV-Large vehicle, SH-Ship, TC-Tennis court, BC-Basketball court, ST-Storage tank, SBF-Soccer-ball field, RA-Roundabout, HA-Harbor, SP-Swimming pool, and HC-Helicopter.
DIOR: DIOR is another large aerial images dataset labeled by a horizontal bounding box. It consists of 23,463 images and 190,288 instances, covering 20 object classes. DIOR has a large variation of object size, not only in spatial resolutions, but also in the aspect of inter‐class and intra‐class size variability across objects. The complexity of DIOR is also reflected in different imaging conditions, weathers, seasons, and image quality, and it has high inter‐class similarity and intra‐class diversity. The training protocol of DIOR is basically consistent with DOTA. The short names c1-c20 for categories in our experiment are defined as: Airplane, Airport, Baseball field, Basketball court, Bridge, Chimney, Dam, Expressway service area, Expressway toll station, Golf course, Ground track field, Harbor, Overpass, Ship, Stadium, Storage tank, Tennis court, Train station, Vehicle, and Wind mill.
UCAS-AOD : UCAS-AOD contains 1510 aerial images of approximately pixels, it contains two categories of 14,596 instances. In line with [64, 1], we randomly select 1,110 for training and 400 for testing.
BSTLD : BSTLD contains 13,427 camera images at a resolution of pixels and contains about 24,000 annotated small traffic lights. Specifically, 5,093 training images are annotated by 15 labels every 2 seconds, but only 3,153 images contain the instance, about 10,756. There are very few instances of many categories, so we reclassify them into 4 categories (red, yellow, green, off). In contrast, 8,334 consecutive test images are annotated by 4 labels at about 15 fps. In this paper, we only use the training set of BSTLD, whose median traffic light width is 8.6 pixels. In the experiment, we divide BSTLD training set into a training set and a test set according to the ratio of . Note that we use the RetinaNet with P2 feature level and FPN to verify InLD, and scale the size of the input image to .
STLD: STLD222STLD is available at https://github.com/Thinklab-SJTU/S2TLD is our newly collected and annotated traffic light dataset as released in this paper, which contains 5,786 images of approximately pixels (1,222 images) and pixels (4,564 images). It also contains 5 categories (namely red, yellow, green, off and wait on) of 14,130 instances. The scenes cover a variety of lighting, weather and traffic conditions, including busy street scenes inner-city, dense stop-and-go traffic, strong changes in illumination/exposure, flickering/fluctuating traffic lights, multiple visible traffic lights, image parts that can be confused with traffic lights (e.g. large round tail lights), as shown in Fig. 9. The training strategy is consistent with BSTLD.
The experiments are initialized by ResNet50  by default unless otherwise specified. The weight decay and momentum for all experiments are set 0.0001 and 0.9, respectively. We employ MomentumOptimizer over 8 GPUs with a total of 8 images per minibatch. We follow the standrad evaluation protocol of COCO, while for other datasets, the anchors of RetinaNet-based method have areas of to on pyramid levels from P3 to P7, respectively. At each pyramid level we use anchors at seven aspect ratios and three scales . For rotating anchor-based method (RetinaNet-R), the angle is set by an arithmetic progression from to with an interval of degrees.
The ablation study covers the detailed evaluation of the effect of image level denoising (ImLD) and instance level denoising (InLD), as well as their combintation.
Effect of Image-Level Denoising. We have experimented with five denoising modules introduced in  on DOTA dataset. We use our previous work RDet , one of the most state-of-the-art methods on the DOTA, as the baseline. From Table I, one can observe that most methods workable except the mean filtering. Among them, the non-local with Gaussian is the most effective (1.95% higher).
Effect of Instance-Level Denoising. The purpose of designing InLD is to make the feature of different categories decoupled in the channel dimension, while the features of the object and non-object are enhanced and weakened in the spatial dimension, respectively. We have designed some verification tests and obtained positive results as shown in Table II. We first explore the utility of weakening the non-object noise by binary semantic segmentation, and the detection mAP has increased from 65.73% to 68.12%. The result on multi-category semantic segmentation further proves that there is indeed interference between objects, which is reflected by the increase of detection mAP (reaching 69.43%). From the above two experiments, we can preliminarily speculate that the interference in the non-object area is the main reason that affects the performance of the detector. It is surprising to to find that coproducting the prediction score for objectness (see in Eq. 6) can further improve performance and speed up training with a final accuracy of 69.81%. Experiments in Table VI show that InLD has greatly improved the RDet’s performance of small objects, such as BR, SV, LV, SH, SP, HC, which increased by 3.94%, 0.84%, 4.32%, 8.48%, 10.15%, and 9.41%, respectively. While the accuracy is greatly improved, the detection speed of the model is only reduced by 1fps (at 13fps). In addition to the DOTA dataset, we have used more datasets to verify the general applicability, such as DIOR, ICDAR, COCO and STLD. InLD obtains 1.44%, 1.55%, 1.4% and 0.86% improvements in each of the four datasets according to Table V and Fig. 10 shows the visualization results before and after using InLD. In order to investigate whether the performance improvement brought by InLD is due to the extra computation (dilated convolutions) or supervised learning (), we perform ablation experiments by controlling the number of dilated convolutions and supervision signal. Table III shows that supervised learning is the main contribution of InLD rather than more convolution layers.
In particular, we conduct a detailed study on the SJTU Small Traffic Light Dataset (STLD) which is our newly released traffic detection dataset. Compared with BSTLD, STLD has more available categories. In addition, STLD contains two different resolution images taken from two different cameras, which can be used for more challenging detection tasks. Table IV shows the effectiveness of InLD on these two traffic light datasets.
Effect of combining ImLD and InLD. A natural idea is whether we can combine these two denoising structures, as shown in Fig. 2. For more comprehensive study, we perform detailed ablation experiments on different datasets and different detection tasks. The experimental results are listed in Table V, and we tend to get the following remarks:
1) Most of the datasets are relatively clean, so ImLD does not obtain a significant increase in all datasets.
2) The performance improvement of detectors with InLD is very significant and stable, and is superior to ImLD.
3) The gain by the combination of ImLD and InLD is not large, mainly because their effects are somewhat overlapping: InLD weakens the feature response of the non-object region while weakening the image noise interference.
Therefore, ImLD is an optional module depending on the dataset and computing environment. We will not use ImLD in subsequent experiments unless otherwise stated.
Effect of IoU-Smooth L1 Loss. The IoU-Smooth L1 loss333Source code of IoU-Smooth L1 Loss is separately available at https://github.com/DetectionTeamUCAS/RetinaNet_Tensorflow_Rotation eliminates the boundary effects of the angle, making it easier for the model to regress to the objects coordinates. Table VII shows that new loss improves three detectors’ accuracy to 69.83%, 68.65% and 76.20%, respectively.
Effect of Data Augmentation and Backbone. Using ResNet101 as backbone and data augmentation (random horizontal, vertical flipping, random graying, and random rotation), we observe a reasonable improvement as shown in Table VI (69.81% 72.98%). We improve the final performance of the model from 72.98% to 74.41% by using ResNet152 as backbone. Due to the extreme imbalance of categories in the dataset, this provides a huge advantage to data augmentation, but we have found that this does not affect the functioning of InLD under these heave settings, from 72.81% to 74.41%. All experiments are performed on the OBB task on DOTA, and the final model baesd on RDet is also named RDet++444Code of RDet and RDet++ are all available at https://github.com/Thinklab-SJTU/R3Det_Tensorflow..
Results on DOTA. We compare our results with the state-of-the-arts results in DOTA as depicted in Table VIII. The results of DOTA reported here are obtained by submitting our predictions to the official DOTA evaluation server555https://captain-whu.github.io/DOTA/. In the OBB task, we add the proposed InLD module to a single-stage detection method (RDet++) and a two-stage detection method (FPN-InLD). Our methods achieve the best performance, 76.56% and 76.81% respectively. To make fair comparison, we do not use overlays of various tricks, oversized backbones, and model ensemble , which are often used on DOTA’s leaderboard methods. In the HBB task, we also conduct the same experiments and obtain competitive detection mAP, about 74.37% and 76.24%. Model performance can be further improved to 79.35% if multi-scale training and testing are used. It is worth noting that FADet , SCRDet  and CAD-Det  use the simple attention mechanism as described in Eq. 1, but our performance is far better than all. Fig. 11 shows some aerial subimages, and Fig. 12 shows the aerial images of large scenes.
Results on DIOR and UCAS-AOD. DIOR is a new large-scale aerial images dataset, and has more categories than DOTA. In addition to the official baselines, we also give our final detection results in Table IX. It should be noted that the baseline we reproduce is higher than the official one. In the end, we obtain 77.80% and 75.11% mAP on FPN and RetinaNet based methods. Table X illustrates the comparison of performance on UCAS-AOD dataset. As we can see, our method achieves 96.95% for OBB task and is the best out of all the existing published methods.
We have presented an instance level denoising technique in the feature map for improving detection especially for small and densely arranged objects e.g. in aerial images. The core idea of InLD is to make the feature of different categories decoupled over different channels, while the features of the object and non-object are enhanced and weakened in the space, respectively. Meanwhile, the IoU constant factor is added to the smooth L1 loss to address the boundary problem in rotation detection for more accurate rotation estimation. We perform extensive ablation studies and comparative experiments on multiple aerial image datasets such as DOTA, DIOR, UCAS-AOD, small traffic light dataset BSTLD and our released STLD, and demonstrate that our method achieves the state-of-the-art detection accuracy. We also use natural image dataset COCO and scene text dataset ICDAR2015 to verify the effectiveness of our approach.
This research was supported by National Key Research and Development Program of China (2018AAA0100704, 2016YFB1001003), and NSFC (61972250, U19B2035), STCSM (18DZ1112300). The author Xue Yang is supported by Wu Wen Jun Honorary Doctoral Scholarship, AI Institute, Shanghai Jiao Tong University.
DAPAS: denoising autoencoder to prevent adversarial attack in semantic segmentation. arXiv preprint arXiv:1908.05195. Cited by: §1, §1, §2.3.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.2, TABLE VIII.
Stacked hourglass networks for human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 483–499. Cited by: TABLE VIII.