The technique of automatic ship detection in optical remote sensing images can find important application in various tasks, ranging from maritime surveillance, seaborne traffic service, to fishery management and military reconnaissance. The increased resolution of available images in the past years makes it more attractive for relevant applications, and the ship detection in high-resolution (HR) remote sensing images has caught increasing attention of researchers.
In recent years, encouraged by the great success of convolutional neural network (CNN) and deep learning based object detection in natural images, many researchers propose to utilize similar methodology for ship detection[1, 2, 3, 4]. However, unlike most of the objects in natural images, the ship targets in remote sensing images are relatively small and less clear. In addition, they have special properties making them harder to be accurately detected than the general objects, and can easily suffer from the degradation by mists, clouds and ocean waves, as well as the impact from the complex background in the harbor. Although compelling results can be achieved by current CNN-based ship detection methods, there is still much room for improvement of the performance, as compared to their counterparts for the object detection in natural images.
One special issue is that the ships in remote sensing images may appear in notably diverse orientations, because of their narrow-rectangle shapes and the high-top viewpoint of imaging from the sky. This is different from the general objects in natural images which are taken from the ground. For those objects, the horizontal bounding boxes are typically used and capable of leading to good detection[5, 6, 7, 8, 9]. However, for the narrow-rectangle shaped ships in remote sensing images, detecting with horizontal bounding boxes irrespective of their orientations would results in inaccurate detection when they appear with inclined orientations in the image, and even worse, it could not separate each individual ship (or would result in miss detection) when they are docked densely beside each other in the harbor. Moreover, the horizontal ground-truth boxes of inclined ships labeled for the training would sometimes contain too much background besides the ships themselves, which may somewhat cause misleading in ship feature learning and extraction.
The more recent researches in the detection algorithm focus on the utilizing of rotated bounding boxes (i.e., the boxes that can be rotated according to the ship orientations) based on CNN and deep learning[12, 10, 11]. Rotated bounding boxes are crucial for detecting the ships in diverse orientations. However, most of such algorithms are directly extended from the general ones with horizontal bounding boxes, by introducing an additional variable of orientation into the framework, but the overall strategy remains almost unchanged and all related variables (including the orientation, width, height and center location of the bounding box) are predicted together in one process. The more unknown variables to be considered at the same time would definitely increase the complexity. The current approaches, though effective, are sub-optimal for generating accurate rotated bounding boxes, because of the large uncertainty of the orientation, and its influence on other variables that needs to be handled more properly.
Besides the issue of diverse orientations, other factors or related properties, such as the images not clear enough, background clutters, relatively-small sizes and the shapes with large aspect ratio, make the detection of ships more challenging compared with general objects. Even the well-tuned state-of-the-art object detection methods, for example[9, 13], can not obtain very satisfactory detection results for the ships in remote sensing images. Therefore, more dedicated and innovative treatments are necessary to improve both the detection accuracy and reliability.
In this paper, a novel CNN-based algorithm is proposed to better detect the ships in remote sensing images. For the detection of arbitrary-oriented ships, the existing approaches have to predefine a set of anchors (or called default boxes) in various orientations, on which basis the ship orientation is then be predicted together with other unknown variables via one regression process. In contrast to these approaches, we are able to predict the orientation and other variables independently by creating two regression branches based on the CNN-features with different characteristics. It brings multiple benefits, such as the fewer anchors that must be defined, the easier training, and the increased capacity to achieve more accurate prediction for the orientation as well as other variables. This approach, as will be shown, is able to effectively improve the quality of rotated region proposals, contributing ultimately to generating better detection results with rotated bounding boxes.
Furthermore, we create a feature representation more suitable for the ship detection, which is achieved via a novel feature pooling process on the proposals, named as the multilevel adaptive pooling. The contribution and novelty mainly lie in two aspects. (I) Unlike the typical ROI pooling performed in a fixed pattern, we develop a shape-adaptive pooling which can obtain better spatially-distributed features for the detection of ships with various aspect ratios. (II) We are the first to incorporate multilevel features via the process of pooling, and create a compact feature representation more qualified for the multi-tasks of classification and regression in ship detection. Our approach is different from other detection algorithms using multilevel features or feature pyramid. It is built on the motivation that the higher-level features are better at object-level classification, while the lower-level ones are more useful in accurate localization, and we integrate them into one set of representation with a spatially-variant pooling, which enables to meanwhile fully take their advantages.
Our method achieves a considerable boost in the detection performance compared with the state-of-the-art CNN-based methods, without obviously increasing the complexity of network. Experimental results on a variety of images show that it is not only able to generate more accurate rotated bounding boxes, but can also effectively reduce both the false and miss detections. Moreover, we performed a detailed quantitative ablation analysis on the proposed techniques, from which one can get a more comprehensive understanding of why the proposed techniques can work well, and obtain some insights regarding a proper utilizing of multilevel CNN-features (including the feature representation) to improve the detection performance for ship targets in remote sensing images.
The rest of this paper is organized as follows. The related work is introduced in Section II. Section III describes the details of the proposed ship detection method. Experimental results and detailed comparisons are shown in Section IV to verify the superiority of our method. Finally, conclusions are drawn in Section V.
Ii Related Work
Ship detection has been an active research topic in the fields of remote sensing for a long period of time. Most of the early ship detection methods rely on geometric elements and some manually-designed features to locate ships from backgrounds. For instance, and utilize the features of contour and line segment for inshore ship detection. A hierarchical ship detection method was proposed in by using shape and texture features. Generally, the detection with these basic geometric features are susceptible to complex background interference. Some other methods utilize the more prominent features of ship head for preliminary localization. In, the regions of potential ship heads are first predicted by transforming local pixels into the polar coordinate system, based on which the saliency of directional gradient information is then employed to identify ship body. The ship heads are detected in by corner features, and then the methods of shape analysis and region growth are used to determine the complete ship region. The method first determines the potential ship regions by saliency segmentation, and then the structure-LBP features are used to identify the real ships. Since the manually-designed features can only utilize the low-level information with poor generalization capability, these methods often suffer from the influence of complex background, resulting in either false or miss detection.
Features extracted by the convolutional neural network have stronger semantic information compared with the manually-designed features. They are very useful for the object detection under complex environments. Current CNN-based object detection methods are generally divided into two categories. The first category firstly generate a number of region proposals. On each proposal the classification and regression are then performed to predict the bounding box of object[5, 9, 14]. Faster R-CNN is one of the most representative methods in this category. It integrates the CNN-feature extraction, proposal generation and the subsequent detection steps into a unified network. The second category deal with object detection as a one-step regression and classification process by discarding the first step of region proposal generation[6, 7, 8, 13]. For example, SSD (single shot multibox detector)
directly predict the bounding boxes together with probability scores based on the densely distributed default boxes. YOLO (you only look once) divides the image into several grids and sets two default boxes in the center location of each grid to perform region classification and bounding box regression. In general, the one-step methods achieves inferior detection accuracy than the proposal-based methods. Nevertheless, YOLOv3 (the latest version of YOLO) still achieves impressive performance after several versions of upgradation.
In the past few years, many CNN-based algorithms are proposed specially to detect the ships or other objects in remote sensing images. For example, a coarse-to-fine method based on CNN was proposed in to better detect diverse objects including the inshore ships in remote sensing images. The SVD algorithm is used to construct a more compact and efficient CNN structure for ship detection, in which the candidate regions are extracted through multi-scale features. To improve the localization accuracy, iterative bounding box regression is used in the CNN-based method for ship detection. An end-to-end detection for the ships of various scales was proposed based on region proposal network and the multi-scale feature mapping through hierarchical selection. Deng et al. presented a more sophisticated detection algorithm for multi-scale objects in remote sensing images, by using multi-scale region proposal coupled with multi-scale feature maps. Such an algorithm, however, comes at a high computational cost. Although the above CNN-based detection methods can typically produce good results, they are unable to accurately locate arbitrary-oriented ships and distinguish the inclined ones that are docked closely beside each other. In order to alleviate this problem, uses a post-processing approach of Soft-NMS to preserve the highly-overlapped boxes in the detection results. In, the ROIs are rotated according to their main directions before the detection is performed. However, both the post and preprocessing approaches can not fully resolve the above problem.
Recently, some researchers tried to detect arbitrary-oriented ships in remote sensing images more accurately and reliably with rotated bounding box. In
, an angle parameter is introduced to define the orientation of the rotatable bounding box, for which the estimation is performed within the typical one-step detection framework. Similarly, to generate the rotated bounding box, takes account of the additional angle information in a one-step regression process based on the framework of YOLOv2. One limitation of those methods is that one-step regression is inadequate to accurately estimate the orientation of rotated bounding box, because of its large uncertainty and the more unknown variables that have to be considered in the bounding box regression. To overcome this limitation, many proposal-based detection methods were proposed. For instance, produces the detection results with rotated bounding boxes based on the framework of Faster R-CNN, in which the region proposal network and ROI pooling are generalized to their rotated versions. In, the features generated from a densely-connected version of the feature pyramid is used for producing the rotated region proposals and the final bounding boxes. A similar detection approach was proposed in, where an alternative feature-pyramid based structure to extract multi-scale context features are employed. To segment the ship targets in remote sensing images, introduces a rotated region proposal layer into the framework of Mask R-CNN and uses an adapted pooling approach modified from ROI-Align for the feature extraction within the proposals.
Despite the effectiveness of these detection algorithms, the rotated region proposal networks of them are directly extended from the normal horizontal ones by introducing an additional variable of orientation, in which all unknown variables are predicted indiscriminately in one process. This would limit the quality of generated region proposals, and thus impact the final detection results. Another essential issue is the CNN-features that are directly utilized for the ship detection. Although sophisticated approaches, for example, densely-connected feature pyramid or complicated hierarchical connection, can improve the representation power, they are with high complexity and produce the features of high dimensionality, which loses the property of compactness and may increase the risk of over-fitting for the ship targets. In this paper, these problem are well resolved in our method and a considerable boost in the detection performance is rewarded according to the experimental results.
Iii The Proposed Method
The overall framework of our method is shown in Fig. 1. In addition to the typical backbone network, there are two distinctive networks in the proposed algorithm, i.e., dual-branch regression for region proposal and multilevel adaptive pooling. The dual-branch regression, which consists of two independent regression branches (i.e., orientation-agnostic regression and orientation regression), is used to generate high-quality rotated region proposals. Multilevel adaptive pooling is to produce a feature representation more qualified for accurate ship detection in remote sensing images based on the proposals. In the following, we describe them as well as other relevant parts of the proposed algorithm in detail.
Iii-a Rotated Region Proposal Network
In many CNN-based detection algorithms, a number of object region proposals are first generated by the region proposal network (RPN). In the absence of any prior information about the object’s location and size in the image, the RPN predefines a set of anchors centered on each location of the feature map. Each anchor provides the coarse initial bounding box of object, typically denoted by , in which represents its center point while and represent the width and height, respectively. The refined boxes for object region proposal are then obtained from those anchors by the prediction with the corresponding CNN-features. Recently, it has been discovered that using rotated bounding boxes can effectively improve ship detection performance in remote sensing images[12, 34]. In this case, the bounding box is expressed as , where the additional variable represents the ship orientation. However, it becomes more complex compared with using typical horizontal bounding boxes, not only due to the increased number of unknown variables, but also the fact that the ship orientation can be arbitrary in images[10, 11]. To resolve this problem, current methods have to define a set of anchors in various orientations, and predict all of the five variables together in one process. One obvious problem is that it would increase the burden of the shared CNN-features, and consequently limit the potential to acquire more accurate prediction results for individual variables.
Fortunately, for a specific ship in remote sensing images, the shape of its bounding box (determined by the variables ) can be assumed to be rotationally invariant (see Fig. 2). This indicates that it is possible to predict and independently with separated while more effective approaches. Therefore, unlike existing RPNs that are built on one set of shared CNN-features, we construct two regression branches to predict (along with the center point ) and respectively (see Fig. 1). Each branch generates particular CNN-features that are more suitable for corresponding regression task.
Orientation-agnostic regression: A rotation-invariant feature module is generated to predict irrespective of the ship orientation . The benefit of this module is twofold: first, it helps to decouple the influence of during feature learning, and produce more capable features enabling to better predict independently; second, the ships in any orientation can all be fully used to train the features without the need to differentiate their orientations, and there is also no requirement for the training samples in diverse orientations as compared to current methods[24, 34, 29].
Specifically, the rotation-invariant feature module consists of multi-oriented response layers and the followed orientation pooling to produce rotation-invariant features. The layers of multi-oriented response are constructed by leveraging Active Rotating Filters (ARFs). An ARF is a filter bank containing an initial convolutional filter and its clockwise rotated versions . Given an input , using the ARF we can produce a feature map consisting of channels of oriented convolutional response:
where is the convolution operator, and denotes each channel of . Similarly, a set of different feature maps can be produced by using the ARFs with differing initial filters. Next, in order to obtain rotation-invariant feature maps, the orientation pooling is performed on each feature map by picking up the maximum oriented response (i.e., the maximum value) at each location among its channels.
Compared with the straightforward way of data augmentation, i.e., rotating the training samples, which often relies on rich convolutional filters to achieve the invariance/tolerance to rotation, using ARFs requires fewer network parameters due to the weight sharing in each ARF, and also decreases both the training cost and the risk of over-fitting. Moreover, the ARF explicitly encodes the orientation-related information into the feature map via the multi-oriented response, which would be beneficial for the orientation regression described below.
Orientation regression: This branch is to focus on the prediction of ship orientation . To this end, rotation-sensitive features are used, by sharing the feature maps produced from the ARFs mentioned above. An obvious advantage of such feature maps is that under the transform of rotation, direct changes can be found across the orientation channels, helpful for the subsequent orientation-offset prediction performed by this branch. Another advantage is that they actively comprise the feature response in various orientations, facilitating the efficacy to predict arbitrary orientation of ships through the regression, yet without the necessity of extensive training with a large variety of orientation samples.
Although theoretically it may be possible to predict the orientation through direct regression, we found that the training is hard to converge and it is impossible to achieve the expected performance in the test, because the range of unknown orientation can be as large as (two opposite orientations are assumed as the same one). To enable a feasible regression, we resort to predicting the angle offset relative to some predefined orientations which can be much smaller. In our implementation, we define 6 fixed orientations that are evenly distributed with the interval of . The regression is then performed to predict each of the corresponding angle offsets , and obtain the final results by correcting with .
The orientation-agnostic regression branch predicts based on orientation-agnostic anchors (i.e., the horizontal anchors as illustrated in Fig. 1). In our implementation we predefine 8 anchors with 4 scales and 2 aspect ratios. This branch predicts the offsets relative to each anchor with an 84-dimensional output layer. Similarly, the output layer for the branch of orientation regression is 61-dimensional. Thus, the number of outputs for the regression is totally 38. As a comparison, current approaches[12, 10, 11, 34, 29] which predict all five variables together need to generate 246 anchors that are differing in scale, aspect ratio or orientation. In this case there would be 485 outputs for the regression in total.
In addition to the regression branches, a classification branch is used to output the objectness scores for the 6 fixed-orientation clones of each orientation-agnostic anchor, among which only the one with the highest score is selected to be corrected with corresponding offsets .
Loss function and training: For training the proposed RPN, the multi-task loss is defined as follows:
where , and are the classification loss, orientation-agnostic regression loss and orientation regression loss, respectively; and denote the ground-truth labels for the orientation-agnostic anchor and predefined orientation, respectively. The classification and regression terms are balanced by the parameter ( is used by default in our method).
We assign the label for orientation-agnostic anchors based on their Intersection-over-Union (IoU) overlap with the ground-truth boxes like typical approaches[12, 11, 29]. However, in our case we need to define an alternative IoU, named as orientation-agnostic IoU, to achieve this goal. The orientation-agnostic IoU is computed by taking no account of the orientation difference between the anchor and ground-truth box, i.e., they should be aligned in orientation when the IoU is computed (see Fig. 3(a)). In our method, two kinds of orientation-agnostic anchors are assigned with the positive label : (I) the orientation-agnostic anchor/anchors having the highest orientation-agnostic IoU overlap with a ground-truth box, or (II) an orientation-agnostic anchor having an orientation-agnostic IoU overlap higher than 0.7 with any ground-truth box. The others are assigned with the negative label , so that their corresponding regression loss can be disabled in Eq. (2). For a similar purpose, we assign the positive label to a predefined orientation if its deviation from the orientation of any ground-truth box at the location is smaller than , and assign the negative label otherwise.
The regression outputs the offsets and from the two independent branches, respectively, and we have:
in which , are for the orientation-agnostic anchor and the predicted box, respectively (likewise for ), and , denote the predefined and predicted orientations, respectively. Given the ground-truth offsets and , we employ the smooth- loss for the regression:
in which, , , , and denote the corresponding ground-truth values.
calculates the classification loss for the oriented anchor obtained from the combination of the given orientation-agnostic anchor and predefined orientation . is defined as:
in which denotes the ground-truth label for the oriented anchor ( if it is positive, and otherwise), and is the predicted probability for it. Different from the widely used cross entropy loss, there is an additional modulating factor  in Eq. (7). The modulating factor
down-weights the loss of well-classified examples with large, and thus focus the loss on hard samples that have been misclassified ( is used in our implementation).
To assign the ground-truth label for the oriented anchor, typical approaches[12, 11, 29] use the standard IoU, which is calculated based on the direct overlap between the oriented anchor and ground-truth box (see Fig. 3(b)). However, the standard IoU is sensitive to the orientation deviation between the oriented anchor and ground-truth box because of their narrow-rectangle shapes. As shown in Fig. 3(b), when the orientation deviation is only , the standard IoU can drop to 0.35, whereas the orientation-agnostic IoU () is 0.74 (see Fig. 3(c)). Because of this, the standard IoU is not so effective as expected in determining the ground-truth label for the oriented anchor of ship targets. Therefore, instead of using the standard IoU, we assign the ground-truth label by taking into consideration the orientation-agnostic IoU (denoted by ) and orientation deviation separately. In our method, the relevant criterions to identify the positive oriented anchor are (or the one having the highest with a ground-truth box) and . This is equivalent to use to determine the positive and non-positive oriented anchors. Among the non-positive oriented anchors, only the definitely-negative ones with or contribute to the related training objective.
Iii-B Multilevel Adaptive Pooling
In the second stage, further refinement of proposals is achieved based on the features from each proposal. For this purpose, the first step is to pool the features in different proposals to a fixed size, usually termed as ROI (region of interesting) pooling[9, 11, 34, 37]. However, to obtain a fixed size of features, the typical ROI pooling uniformly outputs the same number of feature samples along the width and height directions regardless of the aspect ratio of proposal. For the ship region proposals which are mostly in narrow-rectangle shapes (i.e., with large aspect ratio ), it would result in severely uneven pooling of features along the two directions, i.e., one is extremely dense while the other is very sparse (Fig. 4(a)). The resultant feature distribution emphasized on the narrow side would be apparently unfavourable to accurate ship detection. The ROI pooling designed for general object detection, however, has not taken account of this problem. In this paper, we propose an adaptive pooling to get the feature representation that is better distributed according to the shape of proposal (Fig. 4(b)). This approach would particularly function well in our method, since the proposed RPN is able to provide high-quality proposals for the ship targets.
Furthermore, we propose to incorporate multilevel features through the pooling, aiming at generating the features more qualified for the subsequent multi-tasks of classification and regression. Generally, high-level convolutional features can be more insensitive to intra-class variations, including geometric distortion and small shift, which is beneficial for dealing with the classification task. To a certain extent, however, this contradict the demand of regression in object detection, whose task is to locate the object’s bounding box as accurately as possible. On the contrary, low-level features encoded in the shallower layers of convolutional network, such as the features of edges, texture and corners, are more spatially-sensitive. They are better at localization but less semantically meaningful for object-level classification. Although previous work[24, 13, 29] also found the importance of using multilevel features for object detection, the underlying motivation as well as the purpose in this paper is quite different. To the best of our knowledge, we are the first to advocate the incorporation of multilevel features for promoting simultaneous object localization and classification, and achieve this goal via a novel pooling process.
More specially, unlike the conventional pooling that is conducted in rows and columns within the proposal, we perform a ring-like pooling as shown in Fig. 5. It makes the pooled features spatially arranged along different square-shaped circles (denoted by the enclosed dashed-lines in Fig. 5). The basic idea is to let the feature samples along outer circles pooled from lower-level feature maps, while the inner feature samples come from the feature maps of higher-level. Such spatially-variant multilevel pooling helps to create a compact representation, which enables to meanwhile take full advantage of both lower and higher-level features. This is based on the consideration that, within the region of an object, the positions farther away from the center are typically more sensitive to the object’s geometric transformation (e.g., in scale, aspect ratio and orientation). At those positions the lower-level features are preferred, helping to maximum their advantage in object localization. For relatively indiscernible ships in remote sensing images, in particular, the only prominent cues served for accurate localization of them are usually their low-level features of boundary, which are expected to be incorporated by the outermost circle of pooling. Conversely, the higher-level features, which have larger receptive fields and focus more on object-level classification, should be pooled from the inner areas.
Along each circle, the aforementioned shape-adaptive pooling is performed. Specially, instead of dividing the neighborhood into bins and aggregating the feature values of each bin using max or average operation as usual, we only sample the value of feature at the center of each bin by bilinear interpolation. It consequently leads to a set of sampled points evenly spaced along each circle as shown in Fig.5(a) (denoted by the brown dots). The underlying consideration is that interpolating a single value of the center is nearly as effective as the max or average pooling, while the calculation is sensibly simplified, and can (like the ROI-Align) avoid the harsh quantization which introduces harmful misalignment between the proposal and extracted features. This process is made sure to totally output the feature values of fixed size, e.g. in our implementation (see Fig. 5(b)), among which the three groups of values , and are sampled from different circles, respectively. The space between neighboring sampled points along each circle depends on the circle’s perimeter, and the distribution of each group of the sampled points varies according to its aspect ratio. It is such sampling in a circle that facilitates the performance of shape adaptation. To ensure an even distribution of the three circles within the proposal, in our implementation their widths & heights are set to be , and of those of the proposal, respectively.
The above pooling shares partly a similar goal with the deformable ROI pooling, i.e., augmenting the distribution of sampling locations within the proposal. Our approach is however based on a readily assigned sampling pattern without the learning and computation of the offsets (which may vary significantly for different instances), and able to properly accommodate multilevel features.
The input feature maps for different circles should also be different as aforementioned. However, we found that directly input different level of raw feature maps from deep ConvNet would suffer degradation in performance, which is probably because of the large semantic gaps between different levels. To address this problem, as shown in Fig. 1, we ‘smooth’ the gaps through a convolution between the higher and lower levels (prior to which the 2x upsampling of higher level needs to be performed).
Iii-C Implementation Details
We use all convolutional layers in VGG16-Net as the backbone network, which followed by the proposed RPN and multilevel adaptive pooling. There are 3 successive multi-oriented response layers in our RPN, each with 40 Active Rotating Filters for 8 orientations. The four scales of the orientation-agnostic anchor are 32, 64, 128 and 256, and the two aspect ratios are 1:4 and 1:7. Up to 2000 and 300 rotated region proposals generated by the proposed RPN are reserved for training and test, respectively. Similar to some other methods[37, 40], we enlarge both the width and height of the generated proposals by a factor of 1.2 to utilize more contextual information. Next, multilevel adaptive pooling outputs feature maps of for each rotated region proposal. To facilitate the pooling operation, multilevel features are unified to a same size after being smoothed. In our implementation, we achieve this by the reshape operation which converts larger feature maps to a set of smaller ones. To obtain the final refined bounding box, after the pooling, we set two 1024-d fully-connected (FC) layers to perform the classification and regression like other proposal-based ship detection methods[11, 29]. In the end, NMS post-processing with a threshold of 0.2 is used to remove the redundancy.
Iv-a Dataset and Training Details
We train and test the proposed method on a ship dataset with 1300 images, of which 1055 are from the HRSC2016 dataset. To show the performance of various detection methods on densely arranged ships, we added an additional 245 ship images taken from Google Earth. For HRSC2016, the image sizes range from to ( of which are larger than ), while the resolutions are between 2m and 0.4m. There are an average of 2.8 ships per image. Compared to the image size, the maximum, minimum and average sizes of the ships are , , and , respectively. In addition, the images taken by ourselves have a fixed size and resolution, which are and 1.5m, respectively. The average number of ships per image is 8.2, and the average size of these ships is .
For a fair comparison with the object detection methods using horizontal bounding boxes, all the ships are annotated by both horizontal bounding boxes and rotated bounding boxes. The entire dataset is randomly divided into three parts (training set, validation set and test set) with a ratio of 6:1:3. Data augmentation is performed by flipping each image in horizontal and vertical reflections to triple the images for training.
Besides the ImageNet pre-trained backbone network, all the other learnable layers are initialized from a zero-mean Gaussian distribution with standard deviation 0.01. The input image is resized such that its shorter side has 600 pixels, while ensuring that the longer side does not exceed 1000 pixels. The network is trained with Adam optimizer on GTX1080ti GPU with 2 images per mini-batch and a total of 256 anchors with a foreground-to-background ratio of 1:3 per image. During training, we use a learning rate of 0.001 for 40k mini-batches, and 0.0001 for the next 40k mini-batches. We also use a momentum of 0.9 and a weight decay of 0.0005. Moreover, we take the IoU threshold of 0.6 for proposal classification. That is, the proposals having IoU overlap with any ground truth box larger than 0.6 are regarded as positive samples, and the others are negative samples. Compared with the typical value of 0.5, a higher threshold can help to preserve more accurate proposals for the detection.
Iv-B Experimental Analysis
We mainly use the average precision (AP) and precision-recall curve to evaluate the performance of different methods. The AP is the average value of precisions based on different recalls. The precision and recall indicators are formulated as follows:
Here, , , and denote the number of true-positives, false-positives and false-negatives respectively. A true-positive means that the IoU overlap between the predicted bounding box and the ground truth box is higher than 0.5.
In addition, we use the average recall (AR) to quantitatively evaluate the quality of region proposals, which calculates the average recall for fixed number of proposals between IoU 0.5 to 1.
1) Evaluation of the proposed RPN: To verify the superiority of the proposed RPN with dual-branch regression, we conduct the comparison with a baseline model in which all variables of rotated proposals are predicted together in a single regression process. It is a direct extension of the typical horizontal RPN by simply introducing the unknown variable of into the original formalization, and widely used in the existing ship detection algorithms with rotated bonding box[12, 10, 11, 34, 29]. For a fair comparison, the predefined scales, aspect ratios and orientations in the baseline are all consistent with those in the proposed RPN.
We conduct several experiments to evaluate the quality of proposals generated by these two models, and the experimental results are shown in Table I. We report the ARs for 100, 500 and 1000 top-scoring proposals per image (denoted by , and , respectively). The higher scores of the proposed RPN on all the ARs indicate that the dual-branch regression is able to effectively improve the performance of region proposal for ships. In addition, from Table I it can also be seen that the mean IoU (only positive proposals having an IoU overlap higher than 0.5 with any ground-truth box are involved here) is obviously improved by the proposed RPN compared with the baseline model.
2) Evaluation of multilevel adaptive pooling: To evaluate the effectiveness of multilevel adaptive pooling, we constructed the detection methods using different types of pooling operations for comparison. The other parts of these compared detection methods are all kept consistent except the subnetwork of pooling (the proposed RPN is used in all of them for region proposal). The comparison results are provided in Table II, where we use the index of AP to evaluate the performance of detection based on different types of pooling operations. In the comparison, the benchmark (the first row) is the typical ROI pooling on the single-level features (i.e., the highest-level features from the backbone network). This type of pooling operates on regular rotated bins, which can be seen as a rotated version of the original ROI pooling  and is used in many current detection algorithms[11, 34, 37]. We can see that using the proposed multilevel adaptive pooling (the fourth row) is able to obtain obviously higher value of AP in the detection.
For more detailed evaluation, we also compared other different types of pooling in Table II, i.e., the adaptive pooling on single-level features (the second row) and the typical pooling on multilevel features (the third row). We can see that they produce comparable improvement over the benchmark. It is interesting to notice that the adaptive pooling brings significantly higher improvement on multilevel features (row #4 via row #3) than that on single-level features (row #2 via row #1). This can probably be explained as following. The single level only contains the highest-level features, which are relatively insensitive to spatial distortion. This characteristic alleviates the problem of uneven pooling in the spatial domain caused by the typical method. However, when one wants to take advantage of multilevel features, since more lower-level features are involved, the adaptive pooling which can generate better spatially-distributed features would show more superiority. In this case, the multilevel features can also better release its potential for the detection (row #4 via row #2).
|Pooling method||Features of pooling||AP|
|adaptive||multi-level (channel concat)||88.6%|
|adaptive||multi-level (reverse connect)||88.4%|
3) Different arrangements of multilevel features: In this paper, we perform a ring-like spatially-variant pooling, in which the outer feature samples are pooled from the lower-levels, while the inner feature samples come from the higher-levels. To verify the superiority of this arrangement for multilevel features, we compare it with the other two possible arrangements in Table II. The first one is the direct concatenation of multilevel features in channel-wise (denoted by ‘channel concat’ in Table II) after the pooling. In this case, the lower and higher-level pooled features are fully overlapped at each location, and the total number of feature channels are increased. The second one is the reverse arrangement opposite to our approach, in which the inner feature samples come from the lower-levels while the outer ones are pooled from the higher-levels. From Table II, it can be seen that the above two arrangements both cause worse average detecting precision. Though not very apparent, we can see that the reverse one, in particular, is inferior to the one with channel concatenation.
4) Different numbers of feature levels: The backbone network produces a total of five different level features according to the size of feature maps. To assess the influence of the number of used feature levels, the plot of AP versus number of feature levels is provided in Fig. 6, where the number of used feature levels ranges from 1 to 5. In each case, only the top levels of the backbone features are employed. From the plot in Fig. 6, we can see that more than one level would definitely improve the detection performance. However, when sufficient levels are used (in our method, the right number is 3), involving more lower-level features would cause a degradation in the performance. This reveals that the extremely lower-level features are not only useless, but might cause misleading in the learning-based ship detection.
5) Different backbones: Besides the VGG-16 network which is adopted as the backbone in this paper, we experimented with two deeper networks ResNet-50 and ResNet-101 to study the effects of different backbones on the proposed method. As shown in Table III, compared to VGG-16, ResNet-50 and ResNet-101 boost the AP by 0.2 and 0.3 percentage points, respectively.
A possible explanation for this comparison result is that we have made improvements in other parts of the algorithm except the backbone, mainly including independent orientation prediction in RPN and multilevel adaptive pooling. These effective measures have improved the accuracy of the algorithm to a considerable degree, which also makes it difficult to achieve further improvements by using different backbones.
Iv-C Detection Results and Comparison
The proposed method is compared with four other representative CNN-based detection methods which are Faster R-CNN , YOLOv3 , method, method  and R-DFPN . Some examples of the detection results by different methods are shown in Fig. 7, where we use the ground-truth bounding boxes plotted in green dashed lines to mark the ships that are unsuccessfully detected in each image. From the first row of Fig. 7, we can see that although all of the methods can successfully detect the ships, our method is able to generate the bounding boxes with much higher accuracy, especially for the lower-left ship in a larger size. Still, in the second row, the other four methods produced less accurate detection results, and even worse, they have failed to detect some of the small ships closely beside the dock. Besides the higher detection accuracy, our method would be with a higher robustness under the interference of complex background (or the other objects with similar appearance). For example, in the third row of Fig. 7, there is a ship surrounded closely by several other objects. While all of the other methods failed to detect this ship, it can be successfully located by our method with a satisfactory accuracy. In the fourth row, we can see that Faster R-CNN, method and R-DFPN produce false detections on the land and a dock which looks like a ship, respectively.
Another challenge is to detect the ships docked densely in the harbor. It would be hard to accurately locate each individual one of the dense ships, especially when they are docked in an inclined orientation or vary in size. The last three rows of Fig. 7 provides some examples. From the enlarged image patches, we can see that Faster R-CNN and YOLOv3, which detect the ships with horizontal bounding boxes, are prone to producing more inaccurate or miss detection results. The method, method and R-DFPN perform better in accurately locating the individual ships by using rotated bounding boxes based on the orientation prediction. However, there still exist some miss detections when the narrow ships docked densely and very closely to each other. Moreover, they can not guarantee to provide the correct orientation of the bounding box (see the last but one row of Fig. 7), due to the difficulty to reliably predict the orientation together with other variables in one regression process. By contrast, our method is able to obtain much better detection results in the above cases. More detection results obtained by our method can be seen in Fig. 8.
Quantitative comparisons on the test set are provided in Table IV and Fig. 9. Faster R-CNN and YOLOv3 are evaluated with the horizontal ground truth box, while other methods are evaluated with the rotated ground truth box. From Table IV, it can be seen that our method obtains the highest scores on all the indexes of precision, recall and AP. The method of R-DFPN achieves slightly higher AP compared with the method, while Faster R-CNN gets the worst performance. Although YOLOv3 achieves higher AP than R-DFPN, its recall is lower. This is probably because of the more miss detections obtained by YOLOv3 for the inclined dense ships. In addition, as a typical one-stage detection algorithm, the performance of method is only better than that of Faster R-CNN. To a certain extent, this comparison shows that it is quite difficult to get all the variables of the rotated bounding box accurately through only one step regression. In other words, the advantages of the two-stage algorithm still exist for predicting the rotated bounding box.
Fig. 9 presents a more comprehensive comparison of the above methods with the precision-recall curves. It can be seen that our method outperforms the others remarkably at the majority of points on the curve. With the elaborately-designed more powerful backbone network and the undergoing of several versions of upgradation in algorithm, the YOLOv3 can now have very good performance to detect various objects, and has become a highly competitive method in many object-detection tasks[33, 43, 44, 45]. In Fig. 9, we can see that overall YOLOv3 achieves the second-best performance for ship detection, except that it would obtain lower recall than R-DFPN when the precision drops. The fact that R-DFPN as well as the method and method shows the inferior performance than YOLOv3, to some extent reveals the difficulty of detecting ships via rotated bounding box. That is, the unknown variable of orientation additionally introduced into the system would increase the complexity, and if not properly addressed it would in turn affect the performance of algorithm. We can see that our method has well resolved the relevant problems and achieves a boost in performance, though a more powerful backbone network (as well as the other sophisticated techniques) is not used like YOLOv3.
From Fig. 9, we can see that R-DFPN and the method overall achieve comparable performance. The difference is that R-DFPN would be with a higher recall but a lower precision, and converse for the method. In fact, the two compared methods can be recognized as two opposite extreme cases in the feature employment for ship detection. The method only employs a single-level raw features from the backbone, while R-DFPN uses the complicated multilevel features from densely-connected feature pyramid. It is generally believed that using more sophisticated multilevel features would be helpful to improve the detection performance. However, this may also increase the risk of over-fitting for the ship targets due to the increased complexity of network, and hinder the extraction of intrinsic features. In this case, it would generate more false detections, which might be one of the causes to make R-DFPN have the lower precision. By contrast, our method can alleviate this problem via a novel compact representation of multilevel features.
In this paper, we propose a novel CNN-based method to detect the ships in high-resolution optical remote sensing images via rotated bounding box. The proposed method first generates high-quality rotated proposals with a dual-branch regression network. Compared with other methods which handle all unknown variables together with shared features in one regression process, the dual-branch regression network predicts the orientation and other variables independently with particular CNN-features that are more suitable for corresponding regression task. Next, a multilevel adaptive pooling method is proposed to alleviate the uneven sampling problem caused by typical pooling methods, and meanwhile incorporate multilevel features via a spatially-variant pooling operation. This novel approach creates a compact feature representation enabling to fully take advantages of multilevel features to improve the detection performance. Finally, detailed ablation study is performed on the proposed techniques, and the superiority of the proposed detection method is verified by comparing it with some representative CNN-based detection methods, from which some useful insights regarding a proper use of multilevel CNN-features for ship detection are provided as well. In our future work, we may consider using a structure similar to the graph model to model the spatial relationship between different parts of the ship. The learned spatial relationship can better promote the feature integration in the ROI pooling process, thereby further improving the detection performance. Code will be publicly available at https://github.com/lilinhao/ShipDetection.
-  R. Zhang, J. Yao, K. Zhang, C. Feng, and J. Zhang, “S-CNN ship detection from high-resolution remote sensing images,” ISPRS-International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, pp. 423–430, 2016.
-  G. Cheng, P. Zhou, and J. Han, “Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 12, pp. 7405–7415, 2016.
-  Y. Long, Y. Gong, Z. Xiao, and Q. Liu, “Accurate object localization in remote sensing images based on convolutional neural networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 5, pp. 2486–2498, 2017.
X. Li, S. Wang, B. Jiang, and X. Chan, “Inshore ship detection in remote sensing images based on deep features,” in2017 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC). IEEE, 2017, pp. 1–5.
R. Girshick, “Fast R-CNN,” in
IEEE International Conference on Computer Vision, 2015, pp. 1440–1448.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once:
Unified, real-time object detection,” in
Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
-  J. Redmon and A. Farhadi, “YOLO9000: better, faster, stronger,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7263–7271.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, p. 1137, 2017.
-  W. Liu, L. Ma, and H. Chen, “Arbitrary-oriented ship detection framework in optical remote-sensing images,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 6, pp. 937–941, 2018.
-  Z. Zhang, W. Guo, S. Zhu, and W. Yu, “Toward arbitrary-oriented ship detection with rotated region proposal and discrimination networks,” IEEE Geoscience and Remote Sensing Letters, vol. 15, no. 11, pp. 1745–1749, 2018.
-  L. Liu, Z. Pan, and B. Lei, “Learning a rotation invariant detector with rotatable bounding box,” arXiv preprint arXiv:1711.09405, 2017.
-  J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
-  L. Lei and Y. Su, “An inshore ship detection method based on contour matching,” Remote Sens. Technol. Appl., vol. 22, no. 5, pp. 622–627, 2007.
-  J. Lin, X. Yang, S. Xiao, Y. Yu, and C. Jia, “A line segment based inshore ship detection method,” in Future Control and Automation. Springer, 2012, pp. 261–269.
-  C. Zhu, H. Zhou, R. Wang, and J. Guo, “A novel hierarchical method of ship detection from spaceborne optical image based on shape and texture features,” IEEE Transactions on geoscience and remote sensing, vol. 48, no. 9, pp. 3446–3456, 2010.
-  S. Li, Z. Zhou, B. Wang, and F. Wu, “A novel inshore ship detection via ship head classification and body boundary determination,” IEEE geoscience and remote sensing letters, vol. 13, no. 12, pp. 1920–1924, 2016.
-  G. Liu, Y. Zhang, X. Zheng, X. Sun, K. Fu, and H. Wang, “A new method on inshore ship detection in high-resolution satellite images using shape and context information,” IEEE geoscience and remote sensing letters, vol. 11, no. 3, pp. 617–621, 2013.
-  F. Yang, Q. Xu, and B. Li, “Ship detection from optical satellite images based on saliency segmentation and structure-LBP feature,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 5, pp. 602–606, 2017.
-  X. Li and S. Wang, “Object detection using convolutional neural networks in a coarse-to-fine manner,” IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 11, pp. 2037–2041, 2017.
-  Z. Zou and Z. Shi, “Ship detection in spaceborne optical image with SVD networks,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 10, pp. 5832–5845, 2016.
-  F. Wu, Z. Zhou, B. Wang, and J. Ma, “Inshore ship detection based on convolutional neural network in optical satellite images,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 11, no. 11, pp. 4005–4015, 2018.
-  Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu, “Hsf-Net: Multiscale deep feature embedding for ship detection in optical remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 12, pp. 7147–7161, 2018.
-  Z. Deng, H. Sun, S. Zhou, J. Zhao, L. Lei, and H. Zou, “Multi-scale object detection in remote sensing imagery with convolutional neural networks,” ISPRS journal of photogrammetry and remote sensing, vol. 145, pp. 3–22, 2018.
-  S. Nie, Z. Jiang, H. Zhang, B. Cai, and Y. Yao, “Inshore ship detection based on mask R-CNN,” in IGARSS 2018-2018 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2018, pp. 693–696.
-  N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-NMS–improving object detection with one line of code,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5561–5569.
-  S. Zhang, R. Wu, K. Xu, J. Wang, and W. Sun, “R-CNN-Based ship detection from high resolution remote sensing imagery,” Remote Sensing, vol. 11, no. 6, p. 631, 2019.
-  X. Yang, H. Sun, K. Fu, J. Yang, X. Sun, M. Yan, and Z. Guo, “Automatic ship detection in remote sensing images from google earth of complex scenes based on multiscale rotation dense feature pyramid networks,” Remote Sensing, vol. 10, no. 1, p. 132, 2018.
-  Y. Feng, W. Diao, X. Sun, M. Yan, and X. Gao, “Towards automated ship detection and category recognition from high-resolution aerial images,” Remote Sensing, vol. 11, no. 16, p. 1901, 2019.
-  Y. Zhang, Y. Zhang, Z. Shi, J. Zhang, and M. Wei, “Rotationally unconstrained region proposals for ship target segmentation in optical remote sensing,” IEEE Access, vol. 7, pp. 87 049–87 058, 2019.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
Q. Zhao, T. Sheng, Y. Wang, Z. Tang, Y. Chen, L. Cai, and H. Ling, “M2det: A
single-shot object detector based on multi-level feature pyramid network,”
Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 9259–9266.
-  Z. Liu, J. Hu, L. Weng, and Y. Yang, “Rotated region based CNN for ship detection,” in 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 900–904.
-  Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Oriented response networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 519–528.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
-  J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue, “Arbitrary-oriented scene text detection via rotation proposals,” IEEE Transactions on Multimedia, vol. 20, no. 11, pp. 3111–3122, 2018.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  J. Ma, Z. Zhou, B. Wang, H. Zong, and F. Wu, “Ship detection in optical satellite images via directional bounding boxes based on ship center and orientation prediction,” Remote Sensing, vol. 11, no. 18, p. 2173, 2019.
-  Z. Liu, H. Wang, L. Weng, and Y. Yang, “Ship rotated bounding box space for ship extraction from high-resolution optical satellite images with complex backgrounds,” IEEE Geoscience and Remote Sensing Letters, vol. 13, no. 8, pp. 1074–1078, 2016.
-  J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What makes for effective detection proposals?” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 4, pp. 814–830, 2015.
-  M. Braun, S. Krebs, F. Flohr, and D. Gavrila, “Eurocity persons: a novel benchmark for person detection in traffic scenes,” IEEE transactions on pattern analysis and machine intelligence, 2019.
-  B. Benjdira, T. Khursheed, A. Koubaa, A. Ammar, and K. Ouni, “Car detection using unmanned aerial vehicles: Comparison between faster r-cnn and yolov3,” in 2019 1st International Conference on Unmanned Vehicle Systems-Oman (UVS). IEEE, 2019, pp. 1–6.
-  G. Ghiasi, T.-Y. Lin, and Q. V. Le, “Nas-fpn: Learning scalable feature pyramid architecture for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 7036–7045.