I Introduction
Object detection in optical remote sensing images is one of the hottest issues in many fields, such as disaster control, traffic monitoring, and traffic planning, has received increasing research interests in the fields of remote sensing image analysis over the last decades [1, 2].
The traditional object detection methods typically comprise three key procedures: candidate region proposal, feature extraction and classification. The latter can extract relatively less effective candidate boxes according to different rules
[3, 4, 5]. In [3], selective search was proposed which combines the strength of both an exhaustive search and segmentation. In [4], region proposal method called binarized normed gradient (BING) was proposed for object detection, which produce a small set of candidate object windows. In
[5], an iterative training method based on invariant generalized Hough transform was used to learn a robust shape model automatically. The model could capture the shape variability of the target contained in the training data set, and every point in the model is equipped with an individual weight according to its importance, which greatly reduces the falsepositive rate. Compared with the sliding window method, the above methods have greatly reduced the candidate region number and not affect the efficiency.Conventional features adopted for object detection in remote sensing images are handcrafted features such as color histograms, texture features, shape features histogram of oriented gradients (HOG) [6, 7], scale invariant feature transform (SIFT) [8, 9], bag of words (BOW) [10, 11], Saliency [12, 13], and so on [14].
After feature extraction, a classifier can be trained using the training dataset by a number of possible approaches with the objective of minimizing the misclassification error. In practice, there are many different learning approaches including support vector machine (SVM)
[11], AdaBoost [15, 16], knearestneighbor (kNN)
[17, 18], conditional random field (CRF) [19], sparse representationbased classification (SRC) [20], and artificial neural network (ANN)
[21].Currently, due to the powerful feature representation of deep learning, the convolutional neural networks (CNNs) has been used in classsification
[22, 23, 24, 25, 26] and object detection. In particular, object detection using CNNs has achieved remarkable successes, such as twostage detection approachs [27, 28, 29, 30] and singlestage detection methods [31, 32]. For twostage detection approachs, in order to locate and classify objects efficiently, RCNN applied highcapacity convolutional neural networks (CNNs) to bottomup region proposals that generated by selective search [27], which obtained good performance compared with traditional detection methods. Fast RCNN [28] extends the previous work RCNN [27] by using the ROI pooling layer to deal with the problem of multistage pipeline, and thus improving the training and testing speed while also increasing detection accuracy. For further phrases, the RPN was proposed in Faster RCNN [29] to replace the selective search, which generates multiscale and translationinvariant region proposals based on the highlevel features of images, for this reason Faster RCNN achieved remarkable success on object detection. And there are a sequence of advances [30] based on the twostage framework.For singlestage detection methods, YOLO [31] unified the separate components of multiple bounding boxes regression and classification into a single neural network, greatly improved the speed of detection. And based on the unified detection architecture, the SSD [32] used multiscale convolutional bounding box outputs attached to multiple feature maps at the top of the network to achieve better performance. To improve the detection accuracy of SSD, SFD [33] developed the SSD by proposing a scaleequitable framework, to guarantee the enough number of match anchors for each object with different scales. And there are a sequence of advances [34, 35] based on the singlestage SSD framework. Generally, the main advantage of the singlestage detectors is high efficiency, and the twostage detectors are superior to the singlestage detectors in accuracy.
How to pursue the accuracy and the speed of detectors simultaneously, has always been a concern remain to be solved. The RON [36] answered this question by using the reverse connection to assists the former layers of CNNs with more semantic information, and guided the searching of objects with the objectness prior, which has obtained good performance reflected in speed and accuracy. The RetinaNet [37] is another successful method that match the speed of previous singlestage detectors while surpassing the accuracy of twostage detectors, it proposed the focal loss which applied a modulating term to the cross entropy loss in order to focus learning on hard examples and downweight the numerous easy negative. The success of focal loss was largely attributed to overcome the extreme foregroundbackground class imbalance encountered during training of detectors, it proves that the imbalance between the foreground and background is a major problem that limits the performance of detectors, especially for dense detectors.
Driven by the success of focal loss [37] and SFD [33], guarantee plenty of positive and negative balance anchors for each object is a key point of improving the performance of detectors. To obtain enough anchors for each object, singlestage detectors generate anchors in each position of multiple layers of deep CNNs, and according to the overlaps between anchors and ground truth to select the positive and the negative anchors. However, without the region proposals, the detectors have to suppress too many negative anchors and regress the positive anchors, and it will increase the difficult to train the network with good performance.
Compared with singlestage detection methods [32, 37, 33], the twostage detectors divided the object detection task into two subnetworks: the region proposal networks and the accuracy detection networks respectively. The region proposal networks rejected most of the negative samples, which ensure the balance of positive and negative anchors in a largely reduced searching space for accuracy detection networks. Besides, The twice regression generated accuracy final detection results. However, the twostage detectors [29, 30, 38, 39] fixed the candidate boxes to input the accuracy detection networks, while the overall arrangement of objects is different in different images, as illustrated in Fig. 1, it is unreasonable to treat all image equality with fixed candidate boxes.
As mentioned above, detecting objects from high resolution aerial images is still a challenging task because the complex background information and uneven distribution of objects in remote sensing images, this challenge typically degenerated the performance of remote sensing images object detection. In this paper, a novel and effective DAPNet is proposed to tackle above problem by learning a category prior network (CPN) and fineregion proposal network (FRPN) to facilitate sparse and dense objects detection in remote sensing images. The CPN is achieved by learning a new regression layer followed the convolution feature to automatically detect every class objects number relative to the given image.
For clarity, the main contributions of this paper are as follows:

Focusing on the problem of sparse and dense objects detection in remote sensing images, such as ships and vehicles, and so on, we propose a effective DAPNet framework based on FasterRCNN, which extensively reduce the redundancy boxes for accuracy detectors and detect multiscale targets under complex background in remote sensing images, while significantly improve the performance with that of stateofart approaches.

We develop a category prior network (CPN) to compute each category prior information via a small branch network, and model the process with a global ROI pooling layer followed by a binary classification layers and a regression layers. To adaptation the sparse and dense objects, we define nine levels base number as regression reference.

Aiming at the category balance characteristic of training data for accuracy object detection networks, we build a fine region proposal network (FRPN) by change the classification branch and the testing strategies included in the existing RPN. Combining the result of the CPN and FRPN to achieve adaptive region proposals for each images, and thus facilitate sparse and dense objects detection in remote sensing images.
The remainder of this paper is organized as follows. Section II introduces the proposed DAPNet framework in detail. Experimental results and analysis are presented in Section III. Finally, Section IV concludes this paper.
Ii Proposed Method
Iia Overview of the Proposed Method
The proposed object detection method, called DAPNet, is composed of four major components: the  backbone network, the category prior network (CPN), the fineregion proposal network (FRPN), and the accuracy detection network (ARCNN), represented by aqua, yellowish, pink, and wathet respectively, as represented in Fig. 2. The DAPNet method is a novel network that can automatically adjust the number of candidate boxes according to the distribution of various objects in images. First, we use a deep fully convolutional backbone network to generate highlevel convolutional features of the image. Then, the features are sent to three separate networks: the CPN, the FRPN, and the ARCNN, in which the CPN predicts each category number for every image, and the FRPN generates category independent possible candidate regions and classification results for each image. Finally, according to the each category number and candidate regions, we can produce adaptive candidate boxes, and perform an accurate object detection process to address these adaptive candidate boxes.
The overall architecture of  network consists of sixteen weighted layers, including thirteen convolutional layers , , , , and , two fully connected layers and , and one softmax classification layer. Readers can refer to [40] for more details. We build on the successful  to learn our DAPNet model. In the training stage, the parameters of the first thirteen layers (thirteen convolutional layers) are transferred from  weights which pretrained on  competition data set, and the last three layers are discarded. Besides, to ensure that small objects have enough features for learning, we made some adjustments in the structure of VGG16 backbone network, we deleted the pool4 layer, and thus the highlevel features  has more information for small objects, as shown in the left of Fig. 2. The light green layer represents the convolution layer, and the light grey layer indicates the pooling layer.
IiB Category Prior Network
A category prior network (CPN) takes an image (of any size) as input, and outputs each category prior information for the image as mentioned above. We are specifically inspired by the work of the Faster RCNN [29], which uses an RPN to generate multiscale candidate regions. Our category prior network (CPN) generates multilevel numbers for each category similar to the structure of RPN. We model this process with a small fully convolutional network, which follows the backbone network. This small network includes a global ROI pooling layer, and a convolutional layer, followed by two sibling output layers, as shown in Fig. 3(2).
To generate category number priors, we slide a small network over the highlevel convolutional feature map output by the  layer. First, the whole highlevel convolutional feature map is mapped to an feature map by a global ROI pooling layer. Then this small feature is fed into two convolutional layers, the first convolutional layer uses filters with size, and the last convolutional layer produces regression values and 2 classification scores and for each level category regression, with filters size , and thus has a ()channels output layer with object categories.
IiB1 MultiLevel Number as Regression Reference
Object numbers can occur at any value because of the complex scene and uneven distribution of objects in remote sensing images. For each image highlevel feature, we simultaneously predict levels regression number for each category, and the regression of each category does not affect each other.
Our design of CPN presents a novel scheme for addressing multilevel category number regression. In supervised batch learning, one of the keys to learn the category prior network is the definition of training samples. For each image, we predict levels (1, 2, 4, 8, 16, 24, 32, 48 and 64 numbers) as regression base references, as illustrated in Fig. 3(1). First, the ground truth (GT) category numbers in each image are calculated and levels reference base number for each category are fixed. Then, the difference values for each category level are computed according to the following formula:
(1) 
in which represents the category, and means the regression level within the range of . denotes the ground truth number for category, and is the level reference base number. To facilitate different images and obtain the highquality regression category numbers, the position of the difference value higher than 0 in every nonzero category number are recorded, and the category references at the position above are defined as the positive training samples. However, the ratio of to within to and 2 to 4 are ignored, in order to expand the disparity between the positive and negative training samples, and all of the rest category regression base numbers are defined as the negative searching space. The positive samples are shown the red color numbers in Fig. 3(1)(e). Besides, to ensure the balance between the positive and negative numbers, we random select 3 times the number of positive samples in negative searching space as negative samples, as shown the orange color numbers in Fig. 3(1)(f).
IiB2 Loss Function
For training CPN, we assign a new regression and classification loss, we minimize an objective function following the multitask loss in Fast RCNN. Our multitask loss function for an image is defined as
(2) 
Here, is the level index of all reference level , let be an indicator for matching the category base number to positive samples, it means if the category object exist in this image and the base number is responsible for this category.
is the predicted probability of
category base number. The groundtruth label is means that the image including category object and the base number is positive sample, and is represents the category base number is negative sample. The is the regression result of category level, and is that of ground truth regression value. The classification loss is log loss over two class (object contain or not ). represents the class number. For the regression loss, we use where is the robust function defined in [28]. The term means the regression loss is activated only for existing object category base number () and is disabled otherwise ().The classification and regression loss terms are normalized by and respectively, and weighted by balancing parameter . In our current implementation, is the total classification numbers, including the positive and negative samples, about four times the number of objects. The represents the total positive numbers for regression, about two times the number of objects. By default we set , which means that we bias toward better category number regression. For category number regression, similar to RPN, we regress the offsets of the category number.
(3)  
(4) 
where variables , , and represents the predicted category number, ground truth category number, and base level category number respectively.
IiB3 Training and Testing CPN
The CPN is a small network based on the output of backbone network, as mentioned before, the backbone network was pretrained on the  competition dataset [41]. For the other convolutional layers in CPN, we initialize the parameters with the method [42]
. The CPN can be trained endtoend by backpropagation and stochastic gradient descent (SGD)
[43], which includes iterative forward passes that takes labeled data as input. According to the above multitask loss function, it is possible to optimize for each category number.After trained the CPN, the training images were sent to the network and the offset of each image category level number was produced. However, the purpose of the CPN was to obtain the every category number, instead of all category levels number. Therefore, during the testing process of CPN, we predicted scores for each category level regression. According to the output scores of CPN, we fixed a 0.7 threshold to filter the negative category and low score category levels regression, we only keep category levels regression which the scores higher than 0.7 threshold, and compute the average regression result as this category final predicted number result.
IiC Fine Region Proposal Network
The advances in Faster RCNN are driven by the success of region proposal network that extracting highquality candidate regions and nearly costfree. Typically, the existing RPN takes an image as input and produces a fixed collection of rectangular object boxes with different scales and aspect ratios. In order to produce the candidate regions, the binary classification scores were used to filter the negative rectangular object boxes, and the top fixed ranked rectangular object boxes were selected as candidate regions. Yet the number of objects in different images is not necessarily the same, it is unreasonable to select the same fixed candidate regions for all images. Therefore, we proposed a fine region proposal network (FRPN) to facilitate the adaptive region proposals for different images.
Similar to RPN, the architecture of FRPN is based on a  backbone network to extract highlevel semantic features from the image. The FRPN consists of a convolutional layer  and two sibling convolutional output layers, the one for boxes fine classification and other for boxes regression. In particular, we designed a novel proposal selective strategy and a fine classification network for generating more effective and comprehensive highquality region proposals. The overall architecture of FRPN was shown in the center right of Fig. 4.
Before training the fine classification and regression network, the positive and negative anchor training samples was required to determine. For every position in  feature maps, we generated anchors of four scales (, , , and pixels) and three aspect ratios (, , and ). We assigned the anchors which value higher than 0.5 with positive label that corresponding to its category. For negative anchor samples, to guarantee the balance between the positive and negative anchor training samples, we randomly selected three times number of positive anchors from the overlap with all ground truth box lower than , and assigned the negative label with label to these anchors.
IiC1 Fine Classification Network
The major challenge for object detection is accurate object localization, and the one problem is the imbalance region proposals for each object which limits the performance of object localization. An important property of our approach is that it has the CPN and the FRPN, and thus we can balance the region proposal numbers for every image. Therefore, we designed the fine classification network that derived from the RPN but is extended to handle multi object categories, which determines the detail class of an anchor and its confidence. The training objective for fine classification network is the softmax loss over all positive and negative anchor samples:
(5) 
where
(6) 
in which is the predicted softmax outpus and the denotes the corresponding predicted category scores, represents the predicted scores for samples ground truth category, the class 0 represents the background. is an indicator for matching the candidate regions to the ground truth boxes, if matched and otherwise . is the number of matched candidate regions. The fine classification network is nearly costfree compared with the original classification network in RPN, after obtained the class for each region proposal, we used a category NMS, the method will be explain in the testing strategy section in detail.
IiC2 Regression Network
The architecture of regression network is inherited from the RPN in FasterRCNN. To regress the offset between the positive anchor boxes and the corresponding ground truth boxes, we slide a convolutional layer over the  to implement the regression purposes. Each anchor is represented by four parameters of center point , center point , anchor width, and anchor height. The robust function is used to calculate the difference between the output of regression network and the offset of anchors with corresponding ground truth boxes. Therefore, the training loss for single image is the average value of all positive anchors:
(7) 
in which represents the index of the positive sample, is the corresponding offset between the anchor with the ground truth boxes, is the predicted offset. And where , , , are the center , center , width of boxes, and height of boxes, respectively. Where the is a robust
loss that is less sensitive to outliers compared with the
loss. It is uncomplicated for us to train the FRPN network, while the challenge is how to select the highquality region proposals from all predicted boxes, we will illustrate the testing strategy in section.IiC3 Training Objective
The FRPN can be trained endtoend by a multitask loss function that including the regression loss and the fine classification loss as mentioned above. we defined a weight term to balance the regression network and classification network in FRPN:
(8) 
where is the loss to train the FRPN. In experiment, we set the , which means that we bias toward better box location.
IiC4 Testing Strategy
After trained the FRPN network, all training images were sent to the network to product the enough amount of region proposals for each image, the next problem is to select highquality proposals that contain the object. Faster RCNN [29] deals with the problem by threshold filter and NMS to select region proposals for each image, while the NMS that according to binary scores is unreasonable, because of the predicted scores for different class of object is inequable. Beside, the amount of region proposals for each image is too much, actually there are not so many objects in each image that need to be detected. Therefore, we extended the testing strategy based on the RPN, the details are introduced as follows.
Scores Filter: The output of FRPN includes two items, the one is the region proposals, and the other is the category scores for each region proposal. we filter the region proposals that do not contain any object according to the scores, if the predict scores for a region proposal is the class , it means this region proposal is background, then the equivalent of the region proposal was filtered. Beside, we filter all predicted region proposals that do not contained in the image.
Category NMS: Generally, the object information of single image is different, and the category scores for every box was obtained by the fine classification network. Therefore, we propose the category NMS strategy to filter redundant region proposals and keep the highquality region proposals. It is easy to realize the category NMS, according to the scores for every category, we set different value of threshold that basis on the category number to filter single category redundant boxes. To a certain degree, the category NMS can reduce the interaction between classes.
At test time, after the CPN prediction, the category numbers for each image is produced, this prior information is the key point to select the candidate regions and reduce the computation. Due to the category number in a single image is far less than obviously, most of the region proposals is redundant. And thus, the number of each category retained in a image is in accordance with the category number and the output results of category NMS. In short, the fineregion proposal networks can generate the adaptive and highquality candidate boxes.
IiD Accurate Object Localization
The existence of CPN and FRPN is to generate highquality adaptive candidate boxes, and the candidate boxes are the area that most likely contains the object. Therefore, in order to achieve the target of object detection, the ARCNN is used to realize the regression and classification of this candidate boxes. To implement the ARCNN, the candidate boxes were mapped to the high level feature , and then we used the ROI pooling to divide each proposal feature into a fixed size, followed by two fully connected layers to classify and regress this object, in our experiments, we set , as shown in the lower right of Fig. 2. The loss of ARCNN subnet is the same as that of Fast RCNN [28], we do not elaborate too much here.
The purpose of the CPN, the FRPN and the ARCNN are to strengthen the feature of the object and restrain the background characteristics, and thus the three subnets are based on the output of backbone high level feature , it allows for sharing convolutional layers between the three network, rather than learning three separate networks. In the training phase, the loss of DAPNet for each image is the combination of three subnets loss as mentioned earlier, and the loss weight of each subnet was set to 1. During the test process, after the accurate object proposals were obtained, we used category NMS to get the final object localization, combined with the result of classification scores, the final detection results were produced.
Iii Experiments
For the experiments in this paper, NWPU VHR10 geospatial object detection dataset is used to investigate the performance of our proposed DAPNet framework. We first introduce the data set and the evaluation metrics for the experiments. Then, the implementation details of the DAPNet is presented, which including the analysis of hyperparameters and a ablation experiment. Finally, the proposed DAPNet method is compared with several traditional methods and deep learning methods, including the results presentation and numerical analysis.
Iiia Description of Data Set
The NWPU VHR10 [44] is one of the pioneering works in remote sensing object detection filed, which is designed to provide a standardized test bed for multiclass object detection in remote sensing images. This data set was cropped from Google Earth and Vaihingen dataset and then manually annotated by experts, it contains ten classes of objects, are airplane, ship, storage tank, baseball diamond, tennis court, basketball court, ground track field, harbor, bridge, and vehicle respectively. In particular, the data set contains sparse and dense objects, in which the property of sparse and dense is according to the object distributions in images. For instance, the part of storage tanks and vehicles are dense objects, while the ground track fields are sparse object. The ten classes object statistics of the NWPU VHR10 data set is shown in Table I.
Class Name  Total Number  Minimum Number  Maximum Number 

Airplane  757  1  31 
Ship  302  1  15 
Storage tank  655  6  63 
Baseball diamond  390  1  8 
Tennis court  524  1  24 
Basketball court  159  1  6 
Ground track field  163  1  1 
Harbor  224  3  18 
Bridge  124  1  5 
Vehicle  477  2  48 
THE ANALYSIS OF THE TRAINING DATASET
Table I shows the object distribution of the NWPU VHR10 dataset, including the total object number, the minimum number, and the maximum number for each category in each image. It is obvious that the data set contains the sparse and dense object in different images, and thus its suitable for the evaluation of our proposed method. The part image of the NWPU VHR10 data set is shown in Fig. 1.
The NWPU VHR10 data set includes two parts: 650 positive images and 150 negative images, totally 800 images for training our DAPNet framework and comparative trials. First, we randomly select 500 images from the collection of positive images, combining with 150 negative images to form the training dataset. And the rest 150 positive images is used to test the performance of the object detection methods in this paper.
IiiB Evaluation Metrics
The mean Average Precision (mAP) and the Precisionrecall curves (PRC) are adopted to compare the detection performance of different approaches. To allow a better understanding of the mAP, we first explain the PRC and the Average precision (AP).
IiiB1 PrecisionRecall Curves
The output of the object detection framework is a collection of bounding boxes, and the bounding boxes that the overlap with ground truth higher than 0.5 are supposed to the positive samples, while the others are considered as the negative samples. In addition, if several detections overlap with a same groundtruth bounding box, only one is considered as true positive, and the others are supposed as false negatives.
The precision indicator measures the proportion of detections that are true positives, and the recall indicator represents the fraction of positives that are correctly identified. The precision and recall indicators are formulated as follows:
(9) 
(10) 
where, TP, FP, and FN denote the number of truepositive, the number of falsepositive, and the number of falsenegative respectively. The precisionrecall curves takes the recall as abscissa and the precision as ordinate. If the detection method can keep high value of precision with the increasing of recall, it means the good performance of the detection approach.
IiiB2 Average Precision
The average precision (AP) computes the average value of the precision over the interval from to , i.e, it can be formulated as:
(11) 
where, the denotes the value of precision and the is the value of recall. Hence, the AP is equal to taking the area under the curve, the higher the AP, the better the performance, and the mean AP represents the average value of all category AP.
IiiC Implementation Details
We run a considerable amount of experiments to analyze the behavior of the DAPNet framework with various hyperparameters, including the contribution of the CPN network with different level references, and the commitment of FRPN network with various anchor scales. For fair comparisons with other methods, we use the same images and data enhancement strategy for all network training and testing.
IiiC1 Baselines and Network Initialization
For comprehensive reveal the superiority of our proposed DAPNet methods, we carry out comparative experiments with several baselines. Our DAPNet is inspried by Faster RCNN [29], so we directly use Faster RCNN as the baselines, marked as Faster RCNN. Then, we exhibit the influence of fineregion proposal network and the CPN network, marked as FFaster RCNN. Besides, there are several baselines to compare with our DAPNet framework, including the traditional detectors FDDL, COPD and the deep leraning methods transferred CNN, RICNN [38].
For network initialization, there are two ways to initialize network weights in the training process: one is a random initialization using NWPU VHR10 data set and the other is a finetuned initialization by using the trained CNN models on a publicly data set. As is common practice, all network backbones are pretrained on the  classification dataset [41]
and then finetuned on our detection dataset. The training and testing process of our DAPNet are performed using the open source Caffe
[45] framework.IiiC2 Training and Testing Strategy
The experiment consists of two procedures: the training process and the testing process, both of them are conducted on a computer with an NVIDIA Titan X GPU, 64 GB of memory, and the Compute Unified Device Architecture (CUDA) to improve the speed.
For the detection network in remote sensing images, we describe a novel DAPNet method that learns a unified network composed of CPN, FRPN and ARCNN with shared convolutional layers. We use the approximate joint training strategy to train our proposed method DAPNet, the CPN, FRPN and ARCNN are merged into one network during training as in shown Fig. 5. In each stochasitc gradient descent (SGD) iteration, during the process of forward propagation, CPN generates category priors, which combine with the output of FRPN to produce the adaptive region proposals. Then, the adaptive region proposals are used to train the ARCNN detector. The backward propagation takes place as usual, where for the shared layers the backward propagated signals from all the CPN loss, the FRPN loss and the ARCNN loss are combined. This solution is easy to implement, but it ignores the derivative w.r.t the proposal boxes coordinate that are also network responses, so is approximate. In our experiments, in order to ensure fairness, the Faster RCNN and FRPN are both trained in this way. The hyperparameters of these networks are defined in Table II.
Argument  Value 

Learning rate  0.001(0.0001 for finetune) 
Batch size  1(for FRPN and CPN) 
Dropout  0.5 
Momentum  0.9 
Weight decay  0.0005 
Max iter  50000 
ARGUMENTS FOR TRAINING DAPNet
IiiC3 Category Prior with CPN
In the training processing, to achieve better performance in our category prior network. In the experiments, 9 levels base number (1, 2, 4, 8, 16, 24, 32, 48, 64) are set as regression references, it means that there are enough regression range for each category, to facilitate the dense and sparse objects in remote sensing images. And we assign the category levels which difference value higher than 0 as positive level number, the ratio of groundtruth number to base number between the to and 2 to 4 are assigned as ignored level numbers to avoid error regression. The rest of all category level are assigned as negative level numbers. During the test process, we only trust the regression results that whose category level scores is higher than 0.5. For each category, we calculate the average result of this convincing regression results, as the final prediction prior result.
To investigate the behavior of our CNPNs, we conduct a ablation study. To verify the contribution of CPN on the final detection precision, the proposed network DAPNet is compared with the FRPN without the CPN, the results are represented in Table III. It is obvious that the DAPNet method has better performance on the NWPU VHR10 data set, especially the category of storage tank and vehicle.
Besides, to perform a further experiment, we disentangle the influence of CPN on detection speed. As we know, the key point that limits the speed of twostage detection framework is the number of region proposals, and the dense object requires more region proposals compared with the demands of the spare object. For more intuitive, the output region proposal numbers of FFaster RCNN and DAPNet are extracted and observed, as shown in Fig. 6.
Fig. 6 shows the region proposal number of our proposed FFaster RCNN and DAPNet, and Table III shows the detection results of proposed DAPNet method and FFaster RCNN. it can be seen that the proposed DAPNet algorithm demonstrates better overall detection performance compared with FFaster RCNN, while demands less region proposals. In addition, the DAPNet framework generates adaptive region proposals according to the object distribution in images.
IiiC4 Region Proposal with FRPN
In experiments, the architecture of FRPN is similar to RPN. However, the FRPN produces plenty of proposal regions and the detail category confidences for each proposal region, and thus the specific class for each region proposal can be obtained by FRPN. According to the category priors predicted by CPN and the positive factor, the adaptive candidate boxes are produced. For example, if the positive factor and a category number is 3, we retain the 3 candidate boxes for this category. Besides, in order to insure the searching space for ARCNN, the base number of candidate boxes for each category is 100. Compared with original Faster RCNN, the number of candidate boxes for each image is fixed 2000. However, the DAPNet uses the FRPN and CPN to adjust the number of candidate boxes, in line with the dense or spare characteristic of objects. Finally, the DAPNet naturally achieves the goal of reducing computation and improving accuracy.
The number of candidate boxes is the key point to limit the speed of twostage detectors, and directly affect the recall rate of detection methods. Therefore, several groups of contrastive experiments are done by changing the positive factor. Fig. 7 explores the tradeoff between the proposed DAPNet method performance (measured with the Mean AP) and the quantity of the object proposals (measured with the positive factor) on our test data set. During the train process, the other hyperparameters maintain fixed except for the positive factor. Hence, we derive the following from the Fig. 7: 1) The mean AP improves rapidly and then tend to be stable with the increase of positive factor. 2) It is obvious that there is an eminent gap between the mean AP of object region proposals and the final detections. This gap proves the necessity of ARCNN.
Method  airplane  ship  storage tank  baseball diamond  tennis court  basketball court  ground track field  harbor  bridge  vehicle  mAP 

FDDL  0.2934  0.3768  0.7714  0.2584  0.0269  0.0361  0.2004  0.2541  0.2163  0.0436  0.2477 
COPD  0.6301  0.7027  0.6580  0.8208  0.3351  0.3407  0.8527  0.5606  0.1564  0.4412  0.5499 
Transferred CNN  0.6617  0.5709  0.8493  0.8174  0.3506  0.4611  0.7954  0.6224  0.4265  0.4305  0.5986 
RICNN without finetuning  0.8617  0.7581  0.8502  0.8758  0.3927  0.5797  0.8579  0.6649  0.5834  0.6811  0.7106 
RICNN with finetuning  0.8853  0.7789  0.8573  0.8857  0.4072  0.5780  0.8694  0.6804  0.6182  0.7151  0.7276 
Faster RCNN  0.9091  0.8021  0.7790  0.9091  0.9027  0.8182  0.8859  0.8031  0.7038  0.7882  0.8301 
FFaster RCNN  0.9947  0.8053  0.6893  0.9925  0.8926  0.9053  0.8856  0.8818  0.8148  0.7570  0.8619 
DAPNet  0.9990  0.8075  0.7888  0.9091  0.9870  0.8956  0.9026  0.8564  0.7935  0.8028  0.8742 
PERFORMANCE COMPARISONS OF EIGHT DIFFERENT METHODS IN TERMS OF AP VALUES.
THE BOLD NUMBERS DENOTE THE HIGHEST VALUES IN EACH ROW
IiiD Experimental Results and Comparisons
Visual results of the experiments are shown in Fig. 8, which are obtained by the proposed DAPNet framework and original Faster RCNN on the test image of the NWPU VHR10 dataset. The text on the leftupper of the rectangle denotes the object category and the predicted score for this rectangle box. Fig. 8(a) is the detection results of the Faster RCNN model, and Fig. 8(b) is the detection results of the proposed DAPNet method. As listed in Fig. 8, the Faster RCNN method suffers from many missed detections, especially for dense objects, such as storage tank, and vehicle. In opposite, despite the large variations in the distribution and size of objects, the proposed approach has successfully detected most of the objects.
To comprehensive evaluate the effectiveness and superiority of the proposed DAPNet model, we compare it with the following methods: 1) The fisher discrimination dictionary learning (FDDL). 2) The collection of part detector (COPD). 3) A transferred CNN model from AlexNet [46]
. 4) A newly trained rotationinvariant convolutional neural network (RICNN) with only the rotationinvariant layer and the softmax layer being trained (without finetuning the other layers). 5) A newly trained RICNN model (RICNN with finetuning) with all layers being trained. 6) A twostage detection method Faster RCNN.
In addition, to understand DAPNet better, we conduct ablation experiments to examine how each proposed component affects the final performance. We evaluate the performance of our method under two different settings: 1) FFaster RCNN: it only uses the FRPN component based on the Faster RCNN model, and ablates the CPN component. Comparison of the results of FFaster RCNN and Faster RCNN, the effectiveness of FRPN component can be shown. 2) DAPNet: it is our complete model, consisting of the CPN and the FRPN components. The superiority of CPN is verified by observing the results of DAPNet and FFaster RCNN.
To make a fair comparison: 1) The same training data set and testing data set are used for the proposed DAPNet method and other comparison methods. 2) We uniformly employ the same not improved region proposal network parameter for Faster RCNN, the presented FFaster RCNN, and the proposed DAPNet method to generate object proposals. 3) Our proposed DAPNet model, the FFaster RCNN model, the Faster RCNN model, and the RICNN model are based on the VGG16 model, and trained with the same weights that pretraind on  classification dataset.
Fig. 10 reports the qualitative evaluation results of ten object categories for the proposed DAPNet and the Faster RCNN algorithms. It can be seen that the proposed DAPNet algorithm demonstrates better detection performance on the ten classes, especially for the ship, basketball court, harbor, bridge and vehicle object categories. However, the proposed DAPNet indicates a less improvement on the object class of storage tank and ground track field. This can be easily explained: the DAPNet generates adaptive region proposals, while the Faster RCNN produces 2000 region proposals for each image. Therefore, the number proposals for DAPNet is much less compared with Faster RCNN for spare objects, and the Faster RCNN reflects the same performance with DAPNet. In contrast, this phenomenon proves that our DAPNet can obtain equivalent performance with less region proposals.
Table IV summarizes the computation cost of eight different methods. It shows that with the fewer computation cost, our DAPNet model improves significantly the overall detection precision. This adequately shows the effectiveness of the proposed DAPNet model learning method.
Methods  Average running time per image (second) 

FDDL  7.54 
COPD  1.19 
Transferred CNN  5.37 
RICNN without finetuning  8.83 
RICNN with finetuning  8.83 
Faster RCNN  0.289 
FFaster RCNN  0.382 
DAPNet  0.408 
COMPUTATION TIME COMPARISONS OF EIGHT DIFFERENT METHODS
Table III and Fig. 9 show the quantitative comparison results of eight different methods, measured by AP values, and PRCs, respectively. As can be seen from them: 1) The proposed DAPNet method outperforms all other comparison approachs for all ten object classes in terms of mean AP. Specially, our DAPNet methods obtained , , , , performance gains in terms of mean AP over the airplane, tennis court, basketball court, harbor, and bridge, compared with the Faster RCNN model, respectively. This demonstrates the high superiority of the proposed method compared with the existing stateoftheart object detection methods in remote sensing images. 2) By adding the category prior network, the performance measured in mean AP is further boosted by , especially for the storage tank and the vehicle. However, our method has achieved the best performance, the detection accuracy for the object category of storage tank is lower than RICNN and transferred CNN. This is mainly due to the small size of storage tank category, the region proposal network was according to the highlevel convolution features, while the selective search in RICNN produces region proposals on the basis of original images, and thus RICNN can obtained higher AP in small size and characteristic features compared with feature based region proposal networks. In summary, the experiments show the superiority of the proposed DAPNet model.
Iv Conclusion and future work
In this paper, an novel and effective DAPNet framework is proposed to adapt the dense and sparse objects in optical remote sensing images and further improve the detection quality. The framework uses the CPN to predict category information for each class to as the prior information of FRPN, combining the output region proposala of FRPN, achieving the adaptive proposal network for each image. The experiments demonstrate that our three contributions lead DAPNet to the stateoftheart performance on a publicly available tenclass VHR object detection data set, especially for small objects. However, the FRPN based on the highlevel convolution features that limits the scale of objects, in particular for the small objects with distinct feature information, such as the storage tank. And thus the traditional selective search shows the better performance compared with FRPN in DAPNet. Hence, in our future work, we intend to future improve the accuracy of small objects by learning a scale adaptation network.
V Acknowledgment
The authors would like to thank the anonymous reviewers for their helpful comments. Meanwhile, the authors would also like to thank Prof. Han team open the NWPU VHR10 dataset.
References
 [1] G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” Isprs Journal of Photogrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016.
 [2] J. Han, D. Zhang, G. Cheng, N. Liu, and D. Xu, “Advanced deeplearning techniques for salient and categoryspecific object detection: A survey,” IEEE Signal Processing Magazine, vol. 35, no. 1, pp. 84–100, 2018.

[3]
J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective
search for object recognition,”
International journal of computer vision
, vol. 104, no. 2, pp. 154–171, 2013. 
[4]
M.M. Cheng, Z. Zhang, W.Y. Lin, and P. Torr, “Bing: Binarized normed gradients for objectness estimation at 300fps,” in
Proceedings of the IEEE conference on computer vision and pattern recognition
, 2014, pp. 3286–3293.  [5] J. Xu, X. Sun, D. Zhang, and K. Fu, “Automatic detection of inshore ships in highresolution remote sensing images using robust invariant generalized hough transform,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 12, pp. 2070–2074, 2014.
 [6] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2005, pp. 886–893.
 [7] W. Shao, W. Yang, G. Liu, and J. Liu, “Car detection from highresolution aerial imagery using multiple features,” in Geoscience and Remote Sensing Symposium, 2012, pp. 4379–4382.

[8]
J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in optical remote sensing images based on weakly supervised learning and highlevel feature learning,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 6, pp. 3325–3337, 2015.  [9] D. G. Lowe, “Lowe, d.g.: Distinctive image features from scaleinvariant keypoints. int. j. comput. vision 60(2), 91110,” International Journal of Computer Vision, vol. 60, no. 2, 2004.
 [10] Y. Karklin and M. S. Lewicki, “A hierarchical bayesian model for learning nonlinear statistical regularities in nonstationary natural signals,” Neural Computation, vol. 17, no. 2, pp. 397–423, 2005.
 [11] S. Xu, T. Fang, D. Li, and S. Wang, “Object classification of aerial images with bagofvisual words,” IEEE Geoscience and Remote Sensing Letters, vol. 7, no. 2, pp. 366–370, 2010.

[12]
F. Zhang, B. Du, and L. Zhang, “Saliencyguided unsupervised feature learning for scene classification,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 2175–2184, 2014.  [13] J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, S. Bu, and J. Wu, “Efficient, simultaneous detection of multiclass geospatial targets based on visual saliency modeling and discriminative learning of sparse coding,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 89, no. 1, pp. 37–48, 2014.
 [14] G. Cheng and J. Han, “A survey on object detection in optical remote sensing images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016.
 [15] T. T. Nguyen, H. Grabner, H. Bischof, and B. Gruber, “Online boosting for car detection from aerial images,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 63, no. 3, pp. 382–396, 2008.
 [16] J. Leitloff, S. Hinz, and U. Stilla, “Vehicle detection in very high resolution satellite images of city areas,” IEEE Transactions on Geoscience and Remote Sensing, vol. 48, no. 7, pp. 2795–2806, 2010.
 [17] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Sparse representation for target detection in hyperspectral imagery,” IEEE Journal of Selected Topics in Signal Processing, vol. 5, no. 3, pp. 629–640, 2011.
 [18] G. Cheng, L. Guo, T. Zhao, J. Han, H. Li, and J. Fang, “Automatic landslide detection from remotesensing imagery using a scene classification method based on bovw and plsa,” International Journal of Remote Sensing, vol. 34, no. 1, pp. 45–59, 2013.
 [19] P. Zhong and R. Wang, “A multiple conditional random fields ensemble model for urban area detection in remote sensing optical images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 45, no. 12, pp. 3978–3988, 2007.
 [20] H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, “Automatic target detection in highresolution remote sensing images using spatial sparse coding bagofwords model,” IEEE Geoscience and Remote Sensing Letters, vol. 9, no. 1, pp. 109–113, 2011.
 [21] D. C. Park, M. A. Elsharkawi, I. Marks, R. J., L. E. Atlas, and M. J. Damborg, “Electric load forecasting using an artificial neural network,” IEEE Transactions on Power Systems, vol. 6, no. 2, pp. 442–449, 1991.
 [22] X. Liu, L. Jiao, J. Zhao, J. Zhao, D. Zhang, F. Liu, S. Yang, and X. Tang, “Deep multiple instance learningbased spatialspectral classification for pan and ms imagery,” IEEE Transactions on Geoscience and Remote Sensing, 2018.
 [23] L. Jiao, M. Liang, H. Chen, S. Yang, H. Liu, and X. Cao, “Deep fully convolutional networkbased spatial distribution prediction for hyperspectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 10, pp. 5585–5599, 2017.
 [24] W. Zhao, L. Jiao, W. Ma, J. Zhao, J. Zhao, H. Liu, X. Cao, and S. Yang, “Superpixelbased multiple local cnn for panchromatic and multispectral image classification,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 4141–4156, 2017.
 [25] L. Jiao and F. Liu, “Wishart deep stacking network for fast polsar image classification,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3273–3286, 2016.
 [26] F. Liu, L. Jiao, B. Hou, and S. Yang, “Polsar image classification based on wishart dbn and local spatial information,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 6, pp. 3292–3308, 2016.
 [27] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” Computer Science, pp. 580–587, 2013.
 [28] R. Girshick, “Fast rcnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
 [29] S. Ren, K. He, R. Girshick, and J. Sun, “Faster rcnn: Towards realtime object detection with region proposal networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 6, p. 1137, 2017.
 [30] T.Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie, “Feature pyramid networks for object detection.” in CVPR, vol. 1, no. 2, 2017, p. 4.
 [31] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
 [32] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016, pp. 21–37.
 [33] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li, “S^ 3fd: Single shot scaleinvariant face detector,” in Computer Vision (ICCV), 2017 IEEE International Conference on. IEEE, 2017, pp. 192–201.
 [34] C.Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg, “Dssd: Deconvolutional single shot detector,” arXiv preprint arXiv:1701.06659, 2017.
 [35] Z. Shen, Z. Liu, J. Li, Y.G. Jiang, Y. Chen, and X. Xue, “Dsod: Learning deeply supervised object detectors from scratch,” in The IEEE International Conference on Computer Vision (ICCV), vol. 3, no. 6, 2017, p. 7.
 [36] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron: Reverse connection with objectness prior networks for object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, 2017, p. 2.
 [37] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal loss for dense object detection,” in IEEE International Conference on Computer Vision, 2017, pp. 2999–3007.
 [38] G. Cheng, P. Zhou, and J. Han, “Learning rotationinvariant convolutional neural networks for object detection in vhr optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 12, pp. 7405–7415, 2016.
 [39] K. Li, G. Cheng, S. Bu, and X. You, “Rotationinsensitive and contextaugmented object detection in remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 56, no. 4, pp. 2337–2348, 2018.
 [40] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[41]
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “Imagenet large scale visual recognition challenge,”
International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2014. 
[42]
X. Glorot and Y. Bengio, “Understanding the difficulty of training deep
feedforward neural networks,”
Journal of Machine Learning Research
, vol. 9, pp. 249–256, 2010. 
[43]
Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,”
Neural Computation, vol. 1, no. 4, pp. 541–551, 1989.  [44] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multiclass geospatial object detection and geographic image classification based on collection of part detectors,” Isprs Journal of Photogrammetry and Remote Sensing, vol. 98, no. 1, pp. 119–132, 2014.
 [45] Jia, Yangqing, Shelhamer, Evan, Donahue, Jeff, Karayev, Sergey, Long, and Jonathan, “Caffe: Convolutional architecture for fast feature embedding,” pp. 675–678, 2014.
 [46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in International Conference on Neural Information Processing Systems, 2012, pp. 1097–1105.
Comments
There are no comments yet.