With the advance of deep neural networks, object detection (, Faster R-CNN, YOLO , SSD ) has witnessed great progress for natural images (, 600400 images in MS COCO ) in recent years. Despite the promising results for general object detection, the performance of these detectors on the aerial images (, 2,0001,500 images in VisDrone ) are far from satisfactory in both accuracy effectiveness and efficiency awareness, which are caused by two challenges: (1) targets typically have small scales relative to the images; and (2) targets are generally sparsely and nonuniformly distributed in the whole image.
Compared with objects in natural images, the scale challenge causes less effective feature representation of deep networks for objects in aerial images. Therefore, it is difficult for the modern detectors to effectively leverage appearance information to distinguish the objects from surrounding background or similar objects. In order to deal with the scale issue, a natural solution is to partition an aerial image into several uniform small chips, and then perform detection on each of them [10, 23]. Although these approaches alleviate the resolution challenge to some extent, they are inefficient in performing detection due to the ignorance of the target sparsity challenge. Consequently, a lot computation resources are inefficiently applied on regions with sparse or even no objects (see Fig. 1). We observe from Fig. 1 that, in an aerial image objects are not only sparse and nonuniform but also tend to be highly clustered in certain regions. For example, pedestrians are usually concentrated in squares and vehicles are often clustered in highways. Hence, an intuitive way to improve detection efficiency is to focus the detector on these clustered regions with a large mount of objects.
Inspired by this motivation, this paper proposes a novel clustered detection (ClusDet) network for addressing both challenges aforementioned by integrating object and cluster detection in a uniform framework. As illustrated in Fig. 2, ClusDet consists of three key components including a cluster proposal sub-network (CPNet), a scale estimation sub-network (ScaleNet) and a baseline detection network (DetecNet). According to the initial detection of an aerial image, CPNet generates a set of regions of object clusters. After obtaining the clustered regions, they are cropped out for subsequent fine detection. To such end, these regions have to be firstly resized to fit the detector, which may result in extremely large or small objects in the clustered regions and thus deteriorate the detection performance . To handle this issue, we present the ScaleNet to estimate an appropriate scale for the objects in each cluster chip and then rescale the chip accordingly before feeding it to a detector, which is different from [10, 23, 18] by directly resizing cropped chips. Afterwards, each clustered chip is fed to the dedicated detector DetecNet for fine detection. The final detection is achieved by fusing the detection results on both cluster chips and the global image.
Compared to previous approaches, the proposed ClusDet shows several advantages: (i) Owing to the CPNet, we only need to deal with the clustered regions with plenty of objects, which significantly reduces the computation cost and improves detection efficiency; (ii) With the help of the ScaleNet, each clustered chip is refined for better subsequent fine detection, leading to improvement in accuracy; and (iii) The DetecNet is specially designated for clustered region detection and implicitly models the prior context information to further boost detection accuracy. In extensive experiments on three aerial image datasets, ClusDet achieves the best performance using a single mode while with less computation cost.
In summary, the paper has the following contributions:
A novel ClusDet network is proposed to simultaneously address the scale and sparsity challenges for object detection in aerial images.
An effective ScaleNet is present to alleviate nonuniform scale issue in clustered chips for better fine detection.
2 Related work
Object detection has been extensively explored in recent decades with a huge amount of literature. In the following we first review three lines of works that are the most relevant to ours, and then highlight the differences of existing approaches with ours.
Generic Object Detection. Inspired by the success in image recognition 
, deep convolutional neural networks (CNNs) have been dominated in object detection. According to the detection pipeline, existing detectors can roughly be categorized into two types: region-based detectors and region-free detectors. The region-based detectors separate detection into two steps including proposal extraction and object detection. In the first stage, the search space for detection is significantly reduced through extracting candidate regions (, proposals). In the second stage, these proposals are further classified into specific categories. Representatives of region-based detectors include R-CNN, Fast/er R-CNN [11, 26], Mask R-CNN  and Cascade R-CNN . On the contrary, the region-free detectors, such as SSD  YOLO , YOLO9000 , RetinaNet  and RefineDet , perform detection without region proposal, which leads to high efficiency at the sacrifice of accuracy.
Despite excellent performance on natural images (, 500400 images in PASCAL VOC  and 600400 images in MS COCO ), these generic detectors are degenerated when applied on high-resolution aerial images (, 2,0001,500 images in VisDrone ).
Aerial Image Detection.
Compared to detection in natural images, aerial image detection is more challenging because (1) objects have small scales relative to the high-resolution aerial images and (2) targets are sparse and nonuniform and concentrated in certain regions. Since this work is focused on deep learning, we only review some relevant works using deep neural networks for aerial image detection. In, a simple CNNs based approach is presented for automatic detection in aerial images. The method in  integrates detection in aerial images with semantic segmentation to improve performance. In , the authors directly extend the Fast/er R-CNN [11, 26] for vehicle detection in aerial images. The work of  proposes a coupled region-based CNNs for aerial vehicle detection. The approach of  investigates the problem of misalignment between Region of Interests (RoI) and objects in aerial image detection, and introduces a ROI transformer to address this issue. The algorithm in  presents a scale adaptive proposal network for object detection in aerial images.
Region Search in Detection. The strategy of region search is commonly adopted in detection to handle small objects. The approach of  proposes to adaptively direct computational resources to sub-regions where objects are sparse and small. The work of  introduces a context driven search method to efficiently localize the regions containing a specific class of object. In , the authors propose to dynamically explore the search space in proposal-based object detection by learning contextual relations. The method in 
proposes to leverage reinforcement learning to sequentially select regions for detection at higher resolution scale. In a more specific domain, vehicle detection in wide aerial motion imagery (WAMI), the work of suggests a two-stage spatial-temporal convolutional neural networks to detect vehicles from a sequence of WAMI.
Our Approach. In this paper, we aim at solving two aforementioned challenges for aerial image detection. Our approach is related to but different from the previous region search based detectors (, [23, 10]), which partition high-resolution images into small uniforms chips for detection. In contrast, our solution first predicts cluster regions in the images, and then extract these clustered regions for fine detection, leading to significant reduction of the computation cost. Although the method in  also performs detection on chips that potentially contain objects, our approach significantly differs from it. In , the obtained chips are directly resized to fit the detector for subsequent detection. On the contrary, inspired by the observation in  that objects with extreme scales may deteriorate the detection performance, we propose a ScaleNet to alleviate this issue, resulting in improvement in fine detection on each chip.
3 Clustered Detection (ClusDet) Network
As shown in Fig. 2
, detection of an aerial image consists of three stages: cluster region extraction, fine detection on cluster chips and fusion of detection results. In specific, after the feature extraction of an aerial image, CPNet takes as input the feature maps and outputs the clustered regions. In order to avoid processing too many cluster chips, we propose an iterative cluster merging (ICM) module to reduce the noisy cluster chips. Afterwards, the cluster chips as well as the initial detection results on global image are fed into the ScaleNet to estimate an appropriate scale for the objects in cluster chips. With the scale information, the cluster chips are rescaled for fine detection with DetecNet. The final detection is obtained by fusing the detection results of each cluster chip and global image with standard non-maximum suppression (NMS).
3.2 Cluster Region Extraction
Cluster region extraction consists of two steps: initial cluster generation using cluster proposal sub-network (CPNet) and cluster reduction with iterative cluster merging (ICM).
3.2.1 Cluster Proposal Sub-network (CPNet)
The core of the cluster region extraction is the cluster proposal sub-network (CPNet). CPNet works on the high-level feature maps of an aerial image, and aims at predicting the locations and scales of clusters111In this work, a cluster in aerial images is defined by a rectangle region containing at least three objects.. Motivated by the region proposal networks (RPN) , we formulate CPNet as a block of fully convolutional networks. In specific, CPNet takes as input the high-level feature maps from feature extraction backbone, and utilizes two subnets for regression and classification, respectively. Although our CPNet shares the similar idea with RPN, they are different. RPN is used to propose the candidate regions of objects, while CPNet aims at proposing the candidate regions of clusters. Compared to the object proposal, the size of cluster is much larger, and thus CPNer needs a larger receptive field than that of RPN. For this reason, we attach CPNet on the top of the feature extraction backbone.
It is worth noting that the learning of CPNet is a supervised process. However, none of existing public datasets provide groundtruth for clusters. In this work, we adopt a simple strategy to generate the required groundtruth of clusters for training CPNet. We refer the readers to supplementary material for details in generating cluster groundtruth.
3.2.2 Iterative Cluster Merging (ICM)
As shown in Fig. 3 (a), we observe that the initial clusters produced by CPNet are dense and messy. These dense and messy cluster regions are difficult to be directly leveraged for fine detection because of their high overlap and large amount, resulting in extremely heavy computation burden in practice. To solve this problem, we present a simple yet effective iterative cluster merging (ICM) module to clean up clusters. Let represent the set of cluster bounding boxes detected by CPNet, and denote the corresponding cluster classification scores. With a pre-defined overlap threshold and maximum number of clusters after merging, we can obtain the merged cluster set with clusters with Alg. 1.
Briefly speaking, we first find the with highest score, then select the clusters whose overlaps with are larger the threshold to merge with . All the merged clusters are removed. Afterwards, we repeat the aforementioned process until is empty. All the processes mentioned above correspond to the non-max merging (NMM) in Alg. 1. We conduct the NMM several times until the preset is reached. For the details of the NMM, the readers are referred to supplementary material. Fig. 3 (b) demonstrates the final merged clusters, showing that the proposed ICM module is able to effectively merge the dense and messy clusters.
3.3 Fine Detection on Cluster Chip
After obtaining the cluster chips, a dedicated detector is utilized to perform fine detection on these chips. Unlike in existing approaches [23, 18, 10] that directly resize these chips for detection, we present a scale estimation sub-network (ScaleNet) to estimate the scales of objects in chips, which avoids extreme scales of objects degrading detection performance. Based on the estimated scales, we perform partition and padding (PP) operations on each chip for detection.
3.3.1 Scale Estimation Sub-network (ScaleNet)
We regard scale estimation as a regression problem and formulate ScaleNet using a bunch of fully connected networks. As shown in Fig. 4, ScaleNet receives three inputs including the feature maps extracted from network backbone, cluster bounding boxes and initial detection results on global image, and outputs a relative scale offset for objects in the cluster chip. Here, the initial detection results are obtained from the detection subnet.
Let be the relative scale offset for cluster , where and represent the reference scale of the detected objects and the average scale of the groundtruth boxes in cluster , respectively. Thus, the loss of the ScaleNet can be mathematically defined as
3.3.2 Partition and Padding (PP)
The partition and padding (PP) operations are utilized to ensure that the scales of objects are within a reasonable range. Given the cluster bounding box , the corresponding estimated object scale and the input size of a detector, we can compute the approximate object scale in the input space of the detector . If the scale is larger than a certain range, the cluster is padded proportionally, otherwise the cluster is partitioned it into two equal chips. Note that detections in the padded region are ignored in final detection. The visualization of the process is in Fig. 5. The specific scale range setting is discussed in Section 4.
3.4 Final Detection with Local-Global Fusion
The final detection of an aerial image is obtained by fusing the local detection results of cluster chips and global detection results of the whole image with the standard NMS post-processing (see Fig. 6). The local detection results are obtained through the proposed approach mentioned above, and the global detection results are derived from detection subnet (Fig. 2). It is worth noting that any existing modern detectors can be used for global detection.
4.1 Implementation Details
We implement ClusDet based on the publicly available Detectron  and Caffe2. The Faster R-CNN (FRCNN)  with Feature Pyramid Network (FPN)  are adopted as the baseline detection network (DetecNet). The feature subnet and detection subnet (Fig. 2) share the weights with the backbone network and object detection subnet in DetecNet, respectively. The architecture of the CPNet is implemented with a convolutional layer followed by two sibling convolutional layers (for regression and classification, respectively). In ScaleNet, the FC layers to convert feature map into feature vector are with size of 1024; The size of FC layers in the scale offset regressor are 1024 and 1 respectively. The IoU threshold for merging clusters in NMM process is set to 0.7. Following the definition in the COCO dataset, the object scale range in cluster chip partition and padding is set to pixels.
Training hase. The input size of the detector is set to pixels on the VisDrone  and UAVDT  datasets and pixels on the DOTA  dataset. On the three datasets, the training data is augmented by dividing images into chips. On the VisDrone  and UAVDT  datasets, each image is uniformly divided into 6 and 4 chips without overlap. The reason of setting a specific number of chips is that the size of cropped chip can be similar with that in COCO  dataset. On the DOTA  dataset, we use the tool provided by the authors to divide the images. When training model on the VisDrone  and UAVDT  datasets by using 2 GPUs, we set the base learning rate to 0.005 and total iteration to 140k. After the first 120k iterations, the learning rate decreases to 0.0005. Then, we train the model for 100k iterations before lowering the learning rate to 0.00005. A momentum of 0.9 and parameter decay of 0.0005 (on weights and biases) are used. On the DOTA  dataset, the base learning and the total iterations are set to 0.005 and 40k, respectively. The learning rate is deceased by a factor of 0.1 after 30k and 35k iterations.
Test phase. The input size of detector is the same with that in training phase whenever not specified. The maximum number of clusters (TopN) in cluster chip generation is empirically set to 3 on VisDrone , 2 on UAVDT , and 5 on the DOTA  dataset. In fusing detection, the threshold of the standard non-max suppression (NMS) is set to 0.5 in all datasets. The final detection number is set to 500.
VisDrone. The dataset consists of 10, 209 images (6,471 for training, 548 for validation, 3,190 for testing) with rich annotations on ten categories of objects. The image scale of the dataset is about pixels. Since the evaluation server is closed now, we cannot test our method on the test dataset. Therefore, the validation dataset is used as test dataset to evaluate our method.
UAVDT. The UAVDT ] dataset contains 23,258 images of training data and 15,069 images of test data. The resolution of the image is about pixels. The dataset is acquired with an UAV platform at a number of locations in urban areas. The categories of the annotated objects are car, bus, and truck.
DOTA. The dataset is collected from multiple sensors and platforms (e.g. Google Earth) with multiple resolutions (800800 through 4,0004,000 pixels) at multiple cities. Fifteen categories are chosen and annotated. Considering our method is based on the cluster characteristic of the objects in aerial image, some categories in the dataset are not suitable for our method, e.g., roundabout, bridge. Thus, we only choose the images with movable objects in the dataset to evaluate our method, i.e., plane, ship, large vehicle, small vehicle, and helicopter, Thus, the training and validation data contain 920 images and 285 images, respectively.
4.3 Compared Methods
we compare our ClusDet with evenly image partition (EIP) method on all datasets. On some datasets if the EIP is not provided, we implement it according to the property of the datasets. In addition, we also compare our method with representative state-of-the-art methods on all datasets.
4.4 Evaluation Metric
Following the evaluation protocol on the COCO  dataset, we use , , and as the metrics to measure the precision. Specifically, is computed by averaging over all categories. and are computed at the single IoU threshold 0.5 and 0.75 over all categories. The efficiency is measured by the number of images needed to be processed by the detector in inference stage. In specific, the number of images refer to the summation of global images and cropped chips. In the subsequent experiments, the number of images is denoted as .
4.5 Ablation Study
To validate the contributions of the cluster detection and scale estimation to the improvement of detection performance, we conduct extensive experiments on VisDrone  dataset.
In the following experiments, the input size of detector in test phase is set to pixels. To validate if the proposed method can gain consistent improvement in performance under different backbone networks, we conduct experiments with three backbone networks: ResNet-50 , ResNet-101 , and ResNeXt-101 .
Effect of EIP. The experimental results are listed in Table 1. We note that FRCNN  performs inferior compared to that in COCO  dataset (AP=36.7). This is because the relative scale of object to image in the VisDrone  dataset is much smaller than that in the COCO  dataset. By applying EIP to the image, the performance of detectors are increased significantly, especially on small objects (). However, the number of images needed to be processed increases 6 times (3,288 vs 548). In addition, we note that although the overall performance is improved by applying EIP, the performance of large scale objects () is decreased. This is because the EIP truncates the large objects into pieces, which results in a lot of false positives.
Effect of Cluster Detection. From Table 1, we note that the DetecNet+CPNet processes much less amount of images (1,945 vs 3,288) but achieves better performance than FRCNN  plus EIP. This demonstrates that the CPNet not only selects the clustered regions to save computation resource but also implicitly encodes the prior context information to improve the performance. In addition, we note that compared to EIP, the CPNet dose not reduce the performance of large objects (), this can be attributed to the CPNet, which introduces the spatial distribution information of the object into the ClusDet network so as to avoid truncating the large object.
Effect of Scale Estimation. After integrating ScaleNet into CPNet and DetecNet, we note that the number of processed image increases to 2,716, this is because the PP module partitions some cluster chips into pieces. This mitigates the small scale problem when performing detection, such that the performance () is improved to 26.7 on ResNet50  backbone network. In addition, we see that the ScaleNet improves the detection performance on all types of backbone networks. Particularly, the metric is boosted by 2-3 points. In addition, the is increased by 1.6 points even on very strong backbone, ResNeXt101 . This demonstrate that the ScaleNet does alleviate the scale problem to certain extent.
The Effect of Hyperparameter TopN.
The Effect of Hyperparameter TopN.To fairly investigate the effect of TopN, we only change the setting in test phase, which avoids the influence by the amount of training data. From Fig. 7, we see that after , the number of processed images gradually increases, yet the AP dose not change too much and just fluctuates around . This means that a lot of cluster regions are repetitively computed when TopN is set to a high value. This observation also indicates that the cluster merge operation is critical to decrease the computation cost.
4.6 Quantitative Results
VisDrone The detection performance of the proposed method and representative detectors, i.e., Faster RCNN and RetinaNet is shown in Table 2. We note that our method outperforms the state-of-the-art methods by a large margin over various backbone settings. Besides, we observe that when testing the model using multi-scale setting (denoted by ), the performance is significantly boosted, except for the methods using EIP. This is because in multi-scale test, the cropped chips are resized to extremely large scale such that detectors output a lot of false positives on background or local regions of objects.
UAVDT The experimental results on the UAVDT  dataset are displayed in Table 3. The performance of the compared methods, except for FRCNN +FPN , is computed using the experimental results provided in . From the Table 3, we observe that applying EIP on test data dose not improve the performance. On the contrary, it dramatically decreases the performance (11.0 vs 6.1). The reason of this phenomenon is that the objects, i.e. vehicles, in the UAVDT always appear in the center of the image, while the EIP operation divides the objects into pieces such that the detector cannot correctly estimate the objects scale. Compared to FRCNN +FPN  (FFPN), our method is superior to the FFPN and FFPN+EIP. The performance improvement mainly benefits from the different image crop operation. In our method, the image is cropped based on the clusters information, which is less likely to truncate numerous objects. The performance of detectors on UAVDT  is much lower than that on VisDrone , this is caused by the extremely unbalanced data.
DOTA On the DOTA dataset, our method achieves similar performance with state-of-the-art methods but processes dramatically less image chips. This is because the CPNet significantly reduces the number of chips for fine detection. Although our method does not outperform the state-of-the-art methods in term of the overall performance (), it obtains higher value, which indicates that our method can more precisely estimate the object scale. Besides, we observe that the performance does not change too much when more complex backbone networks are adopted. This can be attributed to the limited training images. Without a large amount of data, the complex model cannot achieve its superiority.
This work presents a Clustered object Detection (ClusDet) network to unify object cluster and object detection in an end-to-end framework. We show that the ClusDet network can successfully predict the clustered regions in image to significantly reduce the number of blocks for detection so as to improve the efficiency. Moreover, we propose a cluster-based object scale estimation network to effectively detect the small object. In addition, we experimentally demonstrate that the proposed ClusDet network implicitly models the prior context information to improve the detection precision. By extensive experiments, we show that our method obtains state-of-the-art performance on three public datasets.
-  B. Alexe, N. Heess, Y. W. Teh, and V. Ferrari. Searching for objects driven by context. In NIPS. 2012.
-  N. Audebert, B. Le Saux, and S. Lefèvre. Segment-before-detect: Vehicle detection and classification through semantic segmentation of aerial images. Remote Sensing, 9(4):368, 2017.
-  Z. Cai and N. Vasconcelos. Cascade r-cnn: Delving into high quality object detection. In CVPR, 2018.
-  X. S. Chen, H. He, and L. S. Davis. Object detection in 20 questions. In WACV, 2016.
-  J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016.
-  Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou. Toward fast and accurate vehicle detection in aerial images using coupled region-based convolutional neural networks. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 10(8):3652–3664, 2017.
-  J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu. Learning roi transformer for detecting oriented objects in aerial images. In CVPR, 2019.
-  D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian. The unmanned aerial vehicle benchmark: object detection and tracking. In ECCV, 2018.
-  M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. IJCV, 111(1):98–136, 2015.
-  M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis. Dynamic zoom-in network for fast object detection in large images. In CVPR, 2018.
-  R. Girshick. Fast r-cnn. In ICCV, 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
-  R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen. Ron: Reverse connection with objectness prior networks for object detection. In , pages 5936–5944, 2017.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
-  R. LaLonde, D. Zhang, and M. Shah. Clusternet: Detecting small objects in large scenes by exploiting spatio-temporal information. In CVPR, 2018.
-  T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017.
-  T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal loss for dense object detection. In ICCV, 2017.
-  T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
-  Y. Lu, T. Javidi, and S. Lazebnik. Adaptive object detection using adjacency and zoom prediction. In CVPR, 2016.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
-  J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. In CVPR, 2017.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  I. Ševo and A. Avramović. Convolutional neural network based automatic object detection on aerial images. GRSL, 13(5):740–744, 2016.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  B. Singh and L. S. Davis. An analysis of scale invariance in object detection snip. In CVPR, 2018.
-  L. W. Sommer, T. Schuchert, and J. Beyerer. Fast deep vehicle detection in aerial images. In WACV, 2017.
-  G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang. Dota: A large-scale dataset for object detection in aerial images. In CVPR, 2018.
-  S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He. Aggregated residual transformations for deep neural networks. In CVPR, 2017.
-  S. Zhang, G. He, H.-B. Chen, N. Jing, and Q. Wang. Scale adaptive proposal network for object detection in remote sensing images. GRSL, 2019.
-  S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li. Single-shot refinement neural network for object detection. In CVPR, 2018.
-  P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu. Vision meets drones: a challenge. arXiv:1804.07437, 2018.
-  P. Zhu et al. Visdrone-det2018: The vision meets drone object detection in image challenge results. In ECCVW, 2018.