When the object such as text in natural scene and aerial target(e.g., airplane, ship, vehicle) appear in an image with a certain degree, the output form of horizontal bounding box which usually used in the detection of natural object no longer meet the detection requirement generally for it may include many redundant pixels which belong to background actually. Moreover, when detecting some objects which have a large aspect ratio and park densely, the cooperation between horizontal bounding box and NMS is easy to cause missed detection as shown in Fig. 1. In order to solve problems aforementioned, an oriented bounding box output form has been proposed, and the following detection of oriented objects has attracted more and more attention in recent years.
Many of recent research achievements of oriented objects detection rely on RCNN frameworks heavily. In the field of text detection in natural scene images, R2CNN adds a new branch to regress two points and the height of oriented bounding box based on Faster R-CNN. However, during the training stage, horizontal anchors are still be suppressed by NMS within the RPN network when the objects need to be detected have large aspect ratios and park densely. RRPN proposes rotation anchors to replace the horizontal ones and the corresponding NMS is also replaced by a new NMS algorithm calculated by rotating IOU. However, in order to cover objects of any angles, more anchors with different angles, aspect ratios and scales need to be set which increases the computational complexity of the model. Moreover, the introduction of rotation anchors in RRPN adds additional regression of angle information via Smooth L1, but the combination of rotating IOU and Smooth L1 is not perfect because the angle has boundary problem, so the oriented bounding box output by this algorithm is not accurate and often accompanied by angle jitter.
In the field of aerial images, the detection of oriented object is more difficult campare with text detection in natural scene for the complex background as well as the variation of spatial resolution of images. SCRDet proposes an IOU Loss to address the boundary problem for oriented bounding box mentioned in RRPN. In addition, a Multi-Dimensional Attention Network is added to deal with the complex background of aerial images. However, SCRDet is still anchor-based and NMS-based and also faces some problems bring by them. For instance, in the testing stage, 300 proposals are taken from 10000 regression boxes by NMS in  but according to our statistics, for DOTA dataset, a crop image of size can reach up to objects, which will cause missed detection.
Recently, in view of the disadvantages of the anchor-based models, a number of anchor-free algorithms[14, 39, 32, 38] have emerged. CornerNet and ExtremeNet can be divided into the process of detection and grouping of keypoints. While it is not suitable to apply them into aerial images which may obtain many objects due to the high time complexity of their grouping algorithm. CenterNet puts forward a new method of regressing the height and weight of object at the center point. In order to achieve NMS-free, they obtain the center point in heatmap through a method of searching 8-connect of each center point. However, this method still needs to choose top scoring objects which can not be applied to single image with numerous targets. In addition, it may lead to the drift of the center point, and further cause the bounding box to drift. Different from the anchor-free models based on keypoints detection above, FCOS belongs to the category of dense sampling, which regress in numerous pixels for one objects. Because the bounding box of an object is regressed by a large number of pixels, NMS is still needed to filter the redundant boxes in FCOS.
In this paper, we propose a novel anchor-free and NMS-free model named -DNet as shown in Fig. 2 to detect oriented objects by a pair of middle lines. Our model is a new form of anchor-free which combines the mothods of keypoints detection and dense sampling. In order to reach the aim of NMS-free, we choose the method of keypoint detection to locate the intersection point of each pair of median lines. For the problem of intersection point drift, we design a drift region inspired by the method of dense sampling to ensure that the intersection point drift in the drift region will not affect the position of the final bounding box. In order to successfully predict the middle line, we design a specific Line Loss according to the characteristics of lines(e.g., length, slope, position) to regress each median line. The Line Loss consists of three parts: the position loss to control the location of the endpoints of each middle line, the parallel loss to control the two endpoints of each middle line and the intersection point of two middle lines are collinear. The last one is vertical loss to control the geometric relationship between two median lines of one object. There is an order for the regression of endpoints of middle lines, so we wil also face the boundary problem. In order to solve it, we design -DNet as two branches, one for predicting horizontal objects with 90 degrees and the other for oriented objects of other angles. The design of two branches also enables us to apply -DNet to COCO without changing any network structure.
Our contributions and innovations are as follows:
(1) We propose a noval anchor-free and NMS-free model named -DNet to detect oriented objects by a pair of middle lines.
(2) Our -DNet is a new form of anchor-free which combines keypoints detection and dense sampling, and we design drift region to relax the requirement for accurate extracting keypoints.
(3) Our -DNet can detecting oriented objects and horizontal objects under a single network without increasing computational complexities via two branches. For the regression of middle lines, we design a special Line Loss.
The rest of this paper is organized as follows: In Section 2, we introduce the related works done by researchers before and basic principle in our method. The details of our network and algorithms are shown in Section 3. We place our experiments results and analysis in Section 4. At last, our work is summarized and concluded in Section 5.
2 Related Works
2.1 Detectors based on Manually Engineered Features
Traditional object detectors mainly depend on manually engineered features, they first select features like Histogram of Oriented Gradient (HOG)3]
to identify the existence of object. The generalization capability of these detectors is limited by features extraction and the robustness of this type of detector needs to be further improved.
2.2 Detectors based on DCNNs
In recent years, the success and development of deep convolution neural networks (DCNNs)[15, 13] bring great progress to the field of object detection. Compared with tradition detectors aforementioned, detectors based on DCNNs [27, 14, 39, 32, 38, 25, 26, 20] can automatically extract features through the backbone networks[29, 9], and the accuracy as well as robustness of models is greatly improved. There are two branches which are anchor-based and anchor-free in DCNNs based detectors at present.
2.3 Anchor-Based vs. Anchor-Free Detectors
The concept of anchor was proposed in Region Proposals Networks (RPN) of Faster R-CNN, which acts as extracting proposals and guiding the regression task of networks. Subsequently, the anchor mechanism within RPN is widely used in two-stage detectors[10, 22, 36, 2]. For one-stage detectors which detect objects[25, 26, 20, 17] without RPN, YOLOv1 not use the anchor mechanism can’t provide accuracy comparable to that of two-stage detectors. Afterwards, anchor methods are also extensively utilized in one-stage detectors[26, 20, 17] to improve the performance of models. In the detection of oriented objects, most algorithms rely on anchor mechanism heavily. In general, these models output the oriented bounding boxes by rotating anchors to regress an additional angle information, and then obtain the final bounding boxes by the filtering of rotated NMS algorithm.
The anchor mechanism promotes the development of object detection, but it is still not perfect and also has some problems like mentioned in [14, 39, 32, 38]. Recently, the research of anchor-free has become a hot topic in the field of object detection . At present, the anchor-free detectors can be roughly divided into two categories. One is to locate objects through keypoints such as corner points in CornerNet, extreme points in ExtremeNet and center points in CenterNet, the other is via the regression of a lot of points like FCOS to get the location of objects. For oriented objects, both of these two anchor-free categories have defects. In the inference stage, they all need to keep K of the highest scoring objects and may cause missed detection in the case of many targets in a single image like small cars in aerial image. -DNet is an one stage and anchor-free detector, which locates objects through a pair of median lines and their intersection point. In order to solve the top K problem, we combine the two categories of existing anchor-free alogrithm to design -DNet. The details of our model will be explained in the next section.
Fig. 3 illustrates the architecture of our method. -DNet locates per object by detecting a pair of corresponding middle lines. We use 104-Hourgalss as the backbone of our model following the CornerNet  for its excellent performance of extracting features. For an image of size , as input, our model outputs heatmaps to predict the intersection points of target middle lines, and regression maps to predict the corresponding two middle lines, where represents the number of classes in this image, the
is output stride of down sampling module, and the firstmeans two branches of -DNet. The design of two branches is to deal with the angle boundary problems of oriented objects via the independent prediction of objects with angle of 90 degree. For the prediction of middle lines, we obtain them by regressing their corresponding two endpoints. The form of regressing middle lines is inspired by CenterNet
, it is the relative position of endpoint from intersection point. Moreover, we design special loss functions to control the relationship of endpoints to ensure the predicted median lines more accurate. In addition, in order to reduce the dependence of middle lines extraction on the precision of intersection point extraction, we propose the point drift region to make-DNet output high-quality oriented bounding boxes when the extraction of intersection points is not accurate enough.
3.2 Hourglass Networks
Hourglass Networks was first proposed for human keypoints detection. In CornerNet, Law et al. modified hourglass network and introduced it into the field of object detection. We choose 104-Hourglass Networks modified in CornerNet as our backbone. For one image as input, HourglassNet regress channels of heatmaps with each pixel value which means the confidence of being judged as positive. Compared with CornerNet, -DNet defines each keypoint with value setting to 1, instead of the Gaussian Kernel. In the stage of inference, in order to avoid missed detection, instead of NMS and top K method to extract keypoints in heatmap used in [14, 39, 32, 38], we take a simple and rough way to extract keypoints, which is finding the connected domains in heatmap, and then define the center of each connected domain as the target keypoint. It is true that this method is not accurate enough, but it can achieve satisfactory results in the experiments through the collocation with intersection drift region proposed in our model.
3.3 Middle Lines and Their Intersection Point
As shown in Fig. 4, let denote middle line 1 of the object, and are the endpoint 1 and 2 of respectively. Similarly, is defined as the middle line 2. For the first branch of -DNet, we define the horizontal median line as , and the vertical one as . We regard the right endpoint and the top one as the endpoint 1 of and respectively. For the second branch, the longer one of the two median lines is defined as , the other is . Endpoints 1 are also defined as the right one and the top one in and in this branch. The intersection point of two middle lines can be obtained through simple operators in both two branches.
3.3.1 Intersection Point
We follow the modified focal loss in CornerNet to predict the heatmap of intersection point of target middle lines. Because the ground truth of our model with value setting to 1 instead of the Gaussian Kernel like mentioned in Section 3.2, the loss in -DNet is a little different compared with CornerNet in form. We name the loss of intersection point in our model as :
where is a hyper-parameter and the value fixed to 2 in our experiment. represents the pixel value at the coordinate in heatmap and corresponds to the ground truth. is the number of objects.
3.3.2 Middle Lines
The method of regressing middle lines is to regress the relative distance between each endpoint of per middle line with the intersection point. As shown in Fig. 6, for middle line , we need to regress maps, and the values of these 4 maps in the position of intersection point are and respectively. The form of regression of middle line is the same as middle line . The loss to regress each middle line is as follows:
where is the number of objects. denotes the endpoint of the corresponding middle line. means the coordinate of the regression map. and represent the ground truth.
The way of regressing each endpoint of per middle line independently may result in two endpoints and the intersection point being not collinear. In order to address this problem, we introduce a loss function as follows:
where means two middle lines of per object. and denote the endpoint and of each middle line.
There are two middle lines of one object, and they are vertical in space generally. In order to control the relationship of two target middle lines, we design as follows:
where means endpoint of middle line , means endpoint of middle line .
The , and make up the Line Loss of our model:
And the total loss of our model can be expressed as:
where , and are weights of losses.
3.3.3 Drift Region
The extraction of intersection point in heatmap will affect the accuracy of middle lines extraction. Inspired by FCOS, we set up circular drift regions in the center of objects according to the size of them. All pixel points in drift region will regress the different values according to their relative distances from endpoints of middle lines. The drift region guarantees that the extraction of intersection point from heatmap will not influence the position of final oriented bounding box. The radius of the drift region is set as follows:
where stride is output stride of our model, and are the middle line and respectively. Where is in our model.
Unlike remote sensing images, objects in natural scenes sometimes have overlaps to form fuzzy samples, we also follow FCOS to address this problem which is when a pixel belongs to two targets at the same time, we regress the small one.
In the stage of experiments, we select three datasets to verify the performance of our model. These datasets involve in different research fields: oriented objects detection of aerial images, text detection in natural scene, the detection of objects in nature images. Their detailed introductions are as follows:
DOTA is a common benchmark for the detection of objects in aerial images. It includes two detection tasks: horizontal bounding boxes and oriented bounding boxes, and we only use the oriented one in our experiments. There are aerial images with size ranges from to pixels total in DOTA. These images are annotated using categories (e.g., aircraft, small car, ship). In practice, we divide each large image to crop images in with overlap of .
4.1.2 Icdar 2015
ICDAR 2015 is a dataset used for the detection of text in natural scene. The training set and test set include and images with the size of , respectively.
The challenging MS COCO dataset contains 80k images for training, 40k for validation and 20k for testing and includes 80 categories. The annotations in COCO are horizontal bounding boxes which we used to test the generality of our model.
4.2 Training and Testing Details
All our experiments are implemented on PyTorch 1.0 by two NVIDIA Tesla V100 GPUs with GB memories. For DOTA, we set the input resolution to and the output stride to following settings in CornerNet during the training stage. Adam is selected as the optimizer for -DNet. We train our model from scratch to iterations with the batch size setting to . The learning rate starts from 0.001 and 10 times lower for every third iterations. Simple random horizontal and vertical flipping as well as color dithering are used to enhance the data in our experiments. The weights of loss , and (Section 3.3.2) are setting to and respectively during training. For ICDAR 2015 and COCO, -DNet is finetuned on two v100 GPUs for iterations with a batch size of from a pre-trained CornerNet model which trained on 10 GPUs for 500k iterations. Other settings are the same as DOTA. It is worth noting that for the two branches of our model, we do not strictly divide them by degrees, but by an angle range of degrees.
During test stage, as mentioned in Section 3.2, we need to transform the heatmap into a binary image to extract the intersection point of two target median lines, where the threshold is set to . When the angle of an object is critical in two branches, it may have output in both two. We take the one with the highest intersection point score as the final output of -DNet.
4.3 Comparisons with State-of-the-art Frameworks
In this part, we first prove the advancement of -DNet on the oriented objects datasets (DOTA, ICDAR 2015). Then we test the strong generality of our model on the dataset of natural objects (COCO).
As shown in Table 1, our ODNet achieve mAP on DOTA dataset, better than most two-stage and one-stage models used in the detection of aerial objects at present. For bridges with large aspect ratio and dense parked small vehicles, our anchor-free model achieves the most advanced accuracy on AP due to the better adaptive feature extraction ability than anchor-based models.
4.3.2 Icdar 2015
For ICDAR 2015 dataset, most of the annotation of objects is not in the form of rectangle, but in the form of irregular quadrilateral which is close to parallelogram. We shield the of the Line Loss, which is used to control the two target middle line to remain vertical. As shown in Table 2, our O-DNet achieve F1, better than other models we choose for comparison. The experimental results show that our model can be used not only in the detection of aerial images but also in natural scene text detection.
|CTPN ||51.56||74.22||60.85||SegLink ||76.80||73.10||75.00|
|RCNN ||79.68||85.62||82.54||EAST ||78.33||83.27||80.72|
|FOTS RT ||85.95||79.83||82.78||RRPN ||82.17||73.23||77.44|
In order to verify the general performance of our model, we also make experiments on the COCO dataset of natural scene object detection. For COCO with objects labeled in horizontal bounding boxes, our model will only have output in the first branch. As shown in Table 3, our O-DNet achieve AP on COCO dataset, leading most one-stage detectors.
|D-RFCN + SNIP||DPN-98||45.7||67.3||51.1||29.3||48.8||57.1|
|Without Line Loss||87.25||82.12||47.04||60.14||67.21||70.01||71.38||90.26||79.89||81.09||58.43||59.12||56.05||66.82||60.06||69.12|
4.4 Ablation Studies
In this part, we conduct three ablation experiments on the DOTA dataset, which are the influence of different backbones on the performance of our model, the influence of Line Loss on the performance of our model, and the influence of single branch on the performance of our model. Table 4 shows all experimental data. The following is the specific analysis:
4.4.2 Without Line Loss
In order to prove the validity of Line Loss, we shield the , part of Line Loss, and keep the other settings of O-DNet. Table 4 shows that our model with Line Loss improves mAP compared with the model without Line Loss. The Line Loss effectively controls the line segment property of the regression target median lines in our model. The effect of Line Loss is shown in Fig. 7.
4.4.3 Single branch
In order to verify that the two branches of O-DNet can solve the boundary problem better, we cut off the first branch and input the 90 degree object into the second branch in the form of the original ground truth defined in Section 3.3. The experimental results show that the mAP of two branches is higher than that of single branch. The design of two branches is significant for our model.
We propose a novel one-stage and anchor-free model named -DNet to detect oriented objects. -DNet locates each object by predicting a pair of middle lines inside them. As a result, our model is competitive compared with state-of-the-art detectors in several fields: oriented objects detection of aerial images, text detection in natural scene, the detection of objects in nature images.
-  (2018) Towards multi-class object detection in unconstrained remote sensing imagery. arXiv preprint arXiv:1807.02700. Cited by: Table 1.
-  (2018) Cascade r-cnn: delving into high quality object detection. In , pp. 6154–6162. Cited by: §2.3, Table 3.
-  (1995) Support-vector networks. Machine learning 20 (3), pp. 273–297. Cited by: §2.1.
-  (2017) Deformable convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 764–773. Cited by: Table 3.
-  (2005) Histograms of oriented gradients for human detection. In international Conference on computer vision & Pattern Recognition (CVPR’05), Vol. 1, pp. 886–893. Cited by: §2.1.
-  (2019-06) Learning roi transformer for detecting oriented objects in aerial images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1.
-  (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Table 3.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.2.
-  (2017) R2cnn: rotational region cnn for orientation robust scene text detection. arXiv preprint arXiv:1706.09579. Cited by: §1, §2.3, Table 1, Table 2.
-  (2015) ICDAR 2015 competition on robust reading. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pp. 1156–1160. Cited by: §4.1.2.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: §2.2.
-  (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §1, §2.2, §2.3, §3.1, §3.2, §3.3.1, §4.2, Table 3.
-  (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §2.2.
-  (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125. Cited by: §4.4.1, Table 3, Table 4.
-  (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §2.3, §3.3.1, Table 1, Table 3.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §4.1.3.
-  (2018) Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8759–8768. Cited by: Table 3.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §2.2, §2.3, Table 1, Table 3.
-  (2018) Fots: fast oriented text spotting with a unified network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5676–5685. Cited by: Table 2.
-  (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia 20 (11), pp. 3111–3122. Cited by: §1, §2.3, Table 1, Table 2.
Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pp. 483–499. Cited by: §3.2.
-  (2017) Automatic differentiation in pytorch. Cited by: §4.2.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §2.2, §2.3, Table 1.
-  (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §2.2, §2.3, Table 3.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1, §2.2, §2.3.
-  (2017) Detecting oriented text in natural images by linking segments. Cited by: Table 2.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §2.2.
-  (2018) An analysis of scale invariance in object detection snip. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3578–3587. Cited by: Table 3.
-  (2016) Detecting text in natural image with connectionist text proposal network. In European conference on computer vision, pp. 56–72. Cited by: Table 2.
-  (2019) FCOS: fully convolutional one-stage object detection. arXiv preprint arXiv:1904.01355. Cited by: §1, §2.2, §2.3, §3.2, §3.3.3, Table 3.
-  (2018-06) DOTA: a large-scale dataset for object detection in aerial images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §4.1.1.
-  (2019) R3Det: refined single-stage detector with feature refinement for rotating object. arXiv preprint arXiv:1908.05612. Cited by: Table 1.
-  (2018) Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks. Remote Sensing 10 (1), pp. 132. Cited by: Table 1.
-  (2019) Scrdet: towards more robust detection for small, cluttered and rotated objects. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8232–8241. Cited by: §1, §2.3, Table 1.
-  (2018) Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203–4212. Cited by: Table 3.
-  (2019) Objects as points. arXiv preprint arXiv:1904.07850. Cited by: §1, §2.2, §2.3, §3.1, §3.2, Table 3.
-  (2019) Bottom-up object detection by grouping extreme and center points. pp. 850–859. Cited by: §1, §2.2, §2.3, §3.2, Table 3.
-  (2017) EAST: an efficient and accurate scene text detector. In Proceedings of the IEEE conference on computer vision and pattern recognition, Cited by: Table 2.