Vehicle detection in aerial photography is challenging but widely used in different scenarios, e.g., traffic surveillance, urban planning, satellite reconnaissance, or UAV detection. Since the introduction of Region-CNN , which uses region proposals and learns possible region features using a convolutional neural network instead of traditional manual features, many excellent object detection frameworks based on this structure were proposed, e.g., Light-head R-CNN , Fast/Faster R-CNN [6, 24], YOLO [22, 23], and SSD . These frameworks do, however, not work well for aerial imagery due to the challenges specific to this setting.
In particular, the camera’s bird’s eye view and the high-resolution images make target recognition hard for the following reasons: (1) Features describing small vehicles with arbitrary orientation are difficult to extract in high-resolution images. (2) The large number of visually similar targets from different categories (e.g., building roofs, containers, water tanks) interfere with the detection. (3) There are many, densely packed target vehicles with typically monotonous appearance. (4)
Occlusions and shadows increase the difficulty of feature extraction. Fig.1 illustrates some challenging examples in aerial imagery.
 evaluate recent frameworks on the DOTA dataset. Their results indicate that two-stage object detection frameworks [2, 24] do not work well for finding objects in dense scenarios, whereas one-stage object detection frameworks [17, 22] cannot detect dense and small targets. Moreover, all frameworks have problems detecting vehicles with arbitrary orientation. We argue that one of the important reasons is that RoI pooling uses interpolation to align region proposals of all sizes, which leads to a reduced accuracy or even loss of spatial information of the feature.
To address these problems, we propose the Local-aware Region Convolutional Neural Network (LR-CNN) for vehicle detection in aerial imagery. The goal of LR-CNN is to make the deeper high-level semantic representation regain high-precision location information. We, therefore, predict affine transformation parameters from the shallower layer feature maps, containing a wealth of location information. After spatial transformation processing the pixels of the shallower layer feature maps are projected based on these transformation parameters onto the corresponding pixels of deeper feature maps containing higher-level semantic information. Finally, the resampled features, guided by the loss function, possess local invariance and contain location and high-level semantic information. To summarize, our contributions are the following:
A novel network framework for vehicle detection in aerial imagery.
Preserving the aggregate RoIs’ feature translation invariance and addressing the boundary quantization issue for dense vehicles.
Proposing a resampled pooled feature, which allows higher-level semantic features to regain location information and have local feature invariance. This allows detecting vehicles at an arbitrary orientation.
An analysis of our results showing that we can detect vehicles in aerial imagery accurately and with tighter bounding boxes even in front of complex backgrounds.
2 Related Work
Recent object detection techniques can be roughly summarized in two ways. Two-step strategies first generate many candidate regions, which likely contain objects of interest. Then a separate sub-network determines the categories of each of these candidates and regresses the location. The most representative work is Faster R-CNN , which introduced the Region Proposal Network (RPN) for candidate generation. It is derived from R-CNN , which uses Selective Search  to generate candidate regions. SPPnet  proposed a Spatial Pyramid Pooling layer to obtain multi-scale features at a fixed feature size. Lastly, Fast R-CNN  introduced the ROIpooling layer and enabled the network to be trained in an end-to-end fashion. Because of its high precision and good performance on small objects and dense objects, Faster R-CNN is currently the most popular pipeline for object detection. In contrast, one-step approaches predict the location of objects and their category labels simultaneously. Representative works are YOLO [21, 22, 23] and SSD . Because there is no separate region proposal step this strategy is fast but achieves lower detection accuracy.
Vehicle detection is a special case of object detection, i.e. the aforementioned methods can be directly applied [25, 28]. These methods are, however, carefully designed to work on images collected from the ground, in which the objects have rich appearance characteristics. In contrast, visual information is very limited and monotonous when seen from an aerial perspective. Moreover, aerial images have much higher resolution (e.g., in ITCVD  compared to
in ImageNet) and cover a wider area. The objects of interest (vehicles in this work) are much smaller, and their scale, size, and orientation vary strongly. An important prior for object detection on ground-view images is that the main or large objects within an image are mostly at the image center . In contrast, an object’s location is unpredictable in an aerial image. Selective search, RPN, or YOLO are therefore likely not ideal to handle these challenges. Given inaccurate region proposals, the following classifier cannot work well to make a final decision. More challenges include that vehicles can be in dark shadow, occluded by buildings, or packed densely on parking lots. All these challenges make the existing sophisticated object detection algorithms not well suited for aerial images.
Vehicle detection in aerial images has been investigated by many recent studies, e.g. [1, 10, 16, 19, 20, 26, 31]. [26, 31] extract features from shallower convolution layers (conv3 and conv4) through skip connections and fuse with the final features (output of conv5). Then a standard RPN is used on multi-scale feature maps to obtain proposals at different scales.  train a set of boosted classifiers to improve the final prediction accuracy.  use the focal loss  instead of the cross entropy as loss function for the RPN and the classification layer during training to overcome the easy/hard examples challenge. They report a significant improvement in this task.  propose to extract features hierarchically at different scales so that the network is able to detect objects in different sizes. To address the arbitrary orientation problem, they rotate the anchors of the proposals to some predefined angles , similar to . The number of anchors increases, however, dramatically to and computation is costly.
3 Our Approach
Motivated by DFL-CNN , our approach uses a two-stage object detection strategy, as shown in Fig. 2. In this section, we will give details for each of the sub-networks and discuss how our approach improves the accuracy for detecting vehicles in aerial images.
3.1 Base feature extractor
Excessive downsampling can lead to a loss of feature information for small target vehicles. In contrast, low-level features from shallower layers can retain not only rich feature details of small targets, but also rich spatial information. We adopt ResNet-101  and extract the base features from the shallow layers. As shown in Fig. 2, we use feature maps from the third and forth convolutional block, which have the same resolution. Since there is a 69 convolutional layer gap between the output of the third and fourth convolutional blocks, the latter contains deeper features, whereas the third convolutional block is relatively shallow and its output retains better spatial information of the pooled objects’ features.
3.2 Region proposal network
Twin region proposals.
We model the region proposal network (RPN) as in . For each input image, the RPN outputs 128 potential RoIs, which are mapped to the features maps from the third and fourth convolutional block.  argue that the RoI pooling’s nearest neighbor interpolation leads to a loss in translation invariance of the aligned RoI features. Low RoI alignment accuracy is, however, counterproductive for region proposal features that represent small target vehicles. We, therefore, use RoIAlign  instead of RoI pooling to aggregate high-precision RoIs.
RoI feature processing.
As Fig. 3 illustrates, the input from the third convolutional block will be sent into a large separable convolution (LSC) module containing two separate branches. Afterwards, the feature is compressed to position-sensitive score maps, which have 49 3-channel feature map blocks. This will greatly reduce the computational expense of generating position-sensitive score maps since the feature is now much thinner than it used to be .
In the LSC module, each branch uses a large kernel size to enlarge the receptive field to preserve large local features. Large local features, while not accurate enough, retain more spatial information than local features extracted with small convolution kernels. This means that the larger local features facilitate further affine transformation parameterization, which effectively preserves the spatial information.
As discussed above, RoI pooling increases noise in the feature representation when RoIs are aggregated. Additionally,  demonstrates that the translation invariance of the feature is lost after the RoI pooling operation. Inspired by both and following the structure of  we build the position-sensitive RoIAlign by replacing RoI pooling with RoIAlign. As the structure of position-sensitive RoIAlign indicates in Fig. 3, after aggregating by RoIAlign the precision of the RoIs’ alignment strongly improves the sensitive position scoring and significantly reduces the noise of the small target feature.
Since the distribution of large and small vehicle samples in aerial images is sparse, the ratio of positive and negative examples for training is very unbalanced. Hence, we use the focal loss , which reduces the weight for easy to classify examples, in order to improve the learnability of dense vehicle detection. The loss function of the RPN is defined as
Here, denotes the index of the proposal,
is the predicted probability of the corresponding proposal,represents the ground truth label (, ).
describes the predicted bounding box vector andindicates the ground truth box vector if . We set the balance parameters and . The focusing parameter of the modulating factor is as in .
3.3 Resampled pooled feature
[2, 7, 12] argue that RoI pooling uses interpolation to align the region proposal, which causes the pooled feature to lose location information. Due to this, they propose higher precision interpolations to improve the precision of RoI pooling. We instead assume that the region proposal undergoes an affine transformation after interpolation alignment, such as stretching, rotation, shifting, etc
. We thus exploit spatial transformer networks (STNs) to let the deep high-level semantic representation regain location information from the shallower features that retain the spatial information. Thereby, we strengthen the local feature invariance of the target vehicle in the RoI.
The STN trains a model to predict the spatial variation and alignment of features (including translation, scaling, rotation, and other geometric transformations) by adaptively predicting the parameters of an affine transformation. Fig. 4 depicts the architecture of a resampled pooled feature subnetwork. Six parameters are sufficient to describe the affine transformation . We feed the position-sensitive pooled feature from into the localization network and then parameterize the location information in the RoI as , which are regressed parameters for describing the affine transformation. Next, standard pooled features from are converted to a parameterised sampling grid to model the correspondence coordinate matrix with transformation . It is placed at the pixel level between the resampled pooled feature and by the grid generator. Once has been modeled, will be pixel-wise resampled from , and thus the spatial information is re-added to .
The feature map visualization in Fig. 9 shows that our resampled pooled features have enhanced the local feature invariance, and the feature representation of the vehicle placed at any direction is also very strong.
3.4 Loss of classifier and regressor
For the final classifier and regression, we continue using the focal loss and the smooth loss function, respectively:
where represents the index of the proposal. All other definitions are as in Eq. (1). The parameters remain as , and . The total loss function can then be represented as
We evaluate the proposed method on three datasets with different characteristics, testing different aspects of the accuracy of our method.
The VEDAI  dataset consists of satellite imagery taken over Utah in 2012. It contains 1210 RGB images with a resolution of pixels. VEDAI contains sparse vehicles and is challenging due to strong occlusions and shadows.
DOTA  has 2806 aerial images, which are collected with different sensors and platforms. Their resolutions range from to about pixels. The dataset is randomly split into three sets: Half of the original images form the training set, 1/6 are used as validation set, and the remaining 1/3 form the testing set. Annotations are publicly accessible for all images not in the testing set. The experimental results on DOTA reported in this paper are therefore from the validation set. Furthermore, we evaluate the accuracy of detecting large and small vehicles separately for comparison purposes.
The DLR 3K dataset  consists of 20 images (10 images for training and the other 10 for testing), which are captured at the height of about 1000 feet over Munich with a resolution of pixels. This dataset is used to evaluate the generalization ability of our method.
DOTA and VEDAI provide annotations of different kinds of object categories. Given the goal of this paper, we only use the vehicle annotations. Our method can, however, likely be generalized to detect arbitrary categories of interest.
Because of the very high resolution of the images and limited GPU memory, we process images larger than pixels in tiles. I.e., we crop them into pixel patches with an overlap of 100 pixels. This truncates some targets. We only keep targets with more than remaining as positive samples.
In order to assess the accuracy of our framework, we adopt the standard VOC 2010 object detection evaluation metric for quantitative results of precision, recall, and average precision.
4.1.1 Implementation details
We use ResNet-101 as backbone network to learn features and initialize its parameters with a model pretrained on ImageNet . The remaining layers are initialized randomly. During training, stoch-astic gradient descent (SGD) is used to optimize the parameters. The base learning rate is 0.05 with a
decay every 3 epochs. The IoU thresholds for NMS arefor training and for inference. The RPN part is trained first before the whole framework is trained jointly. All experiments were conducted with NVIDIA Titan XP GPUs. A single image with size keeps a maximum of 600 RoIs after NMS, and takes ca. 1.4s during training and ca. 0.33s for testing.
4.2 Results and comparison
|AP||AP||SV AP||LV AP||mAP|
4.2.1 Quantitative results
Tab. 1 summarizes the experimental results. Note that our method outperforms all methods on all datasets. Furthermore, small vehicle and large vehicle on the DOTA Evaluation Server get 68.56% and 69.87% of AP respectively, and the mAP is 69.22%. Particularly, compared to the baseline method and the state-of-the-art, our model increases the AP by and on the most challenging dataset DOTA, respectively, corresponding to and relative gains. When small and large vehicles are considered as two classes, our model achieves and relative gains, respectively, against the baseline. The significant gains prove that our Large Separable Convolution, Position-Sensitive RoIAlign and Spatial Transform Network modules work efficiently.
Fig. 5 depicts the precision-recall curves of different methods on DOTA. We can see that for vehicle detection our method (blue solid line) has a wider smooth region (until a recall of ) and smoother tendency, which means our method is more robust and has higher object classification precision than others. In contrast, both Faster R-CNN and DFL (red and green solid lines, respectively) have a rapid drop at the high-precision end of the plot. In other words, our method achieves higher recall without the cost of obviously sacrificing precision. We also can see that small vehicle detection is more difficult for all methods: The curves (pointed lines) begin to obviously drop much earlier (for LR-CNN at a recall of 0.4) than the general or large-vehicle detection (at a recall of 0.65), and the transition region is also wide (until a recall of 0.67 for LR-CNN). It is worth mentioning that DFL and LR-CNN have very good curves for large vehicle detection (dashed lines) with long smooth regions and a rapid drop.
4.2.2 Qualitative results
Fig. 6 gives a qualitative comparison between different methods on DOTA. It shows a typical complex scene: vehicles are in arbitrary places, dense or sparse, and the background is complex. As shown in the first row, Faster R-CNN fails to detect many vehicles, especially when they are dense (Regions 2, 3) or in shadow (Regions 5, 6). DFL detects more small vehicles. In particular, it is sensitive to the dark small vehicles, e.g., an unclear car on the road (Region 1) is detected. However, this has side effects: DFL cannot distinguish small dark vehicles from shadow well. E.g., the shadow of the white vehicle in Region 4 is detected as a small vehicle but the vehicles in Regions 5 and 6 are not detected. Furthermore, its accuracy for detecting vehicles in dense cases and classifying the vehicles’ type is not good enough (Regions 2, 3). Fig. 6(c) shows that our method distinguishes large and small vehicles well. It can also detect individual vehicles in dense parts of the scene. The advantages of detecting vehicles in dense situations and distinguishing the vehicles from the similar background objects are further showcased in the second row.
4.2.3 Generalization ability
To evaluate the generalization ability of our approach, we test it on the DLR 3K dataset with models trained on different datasets. Because the ground truth of the test set of DLR 3K is not publicly accessible, we test the models on the training and validation set whose annotations are available. We also compare the results with the ones reported in HRPN , which was trained on DLR 3K. Experimental results are listed in Tab. 2. We can see that, for each method, the model trained on DOTA reports higher AP than that trained on VEDAI. The main reason is that DOTA has more and more diverse training samples. DFL and our method trained on DOTA outperform HRPN with our method reporting about better results than HRPN. These results show that our model has good generalization abilities as well as transferability. For better understanding, we show some examples in Fig. 7. When comparing the dashed purple boxes (results of models trained on VEDAI) with the green boxes (results of models trained on DOTA) from the same method, we can see that the models trained on DOTA detect more vehicles. When comparing the results of different methods trained on DOTA, we can see that LR-CNN successfully detects more vehicles. Within the region highlighted by the dashed yellow box where vehicles are dense, LR-CNN successfully detects almost all individual vehicles.
4.2.4 Ablation study
To evaluate the impact of the STN placed at different locations in the network, we conduct an ablation study. We do not provide separate experiments to evaluate the impact of focal loss and RoIAlign pooling because these have been provided in [14, 26] and , respectively. Tab. 3 reports our results. When the STN is placed at the output of the conv3_x block, the model achieves better results, especially for large vehicle detection. The reason is that the STN mainly processes spatial information, which is much richer in the output features of conv3_x than in those of conv4_x.
For better understanding, we visualize some feature maps in Fig. 9. The features extracted from conv3_x (second row) contain more spatial and detailed information than those from conv-4_x (fourth row): The edges are clearer and the locations corresponding to the vehicle show stronger activations. Comparing the feature maps before and after the STN (2nd row vs. 3rd row and 4th row vs. 5th row) shows that the activations of the background regions are weaker after the STN. Active regions corresponding to the foreground are closer to the vehicle’s shape and orientation than before applying the STN since the features are transformed and regularized by the STN module. Furthermore, after STN processing, in addition to being accurate in position, the feature representation is also slimmer. This is why our bounding boxes are tighter than other detectors. From these observation, we can intuitively conclude that the STN module is better able to find the transformation parameters on conv3_x to regularize the features used to regress the location and classify the RoIs.
Fig. 8 illustrates how the quality of proposals from RPN affects the final localization and classification. When comparing the final detection results (green boxes) with the RPN proposals (dashed purple boxes) of different methods, we can make the following observations: LR-CNN correctly detects more vehicles. In addition, the green bounding boxes given by LR-CNN are tighter, which means that LR-CNN gives more precise localization. To analyze the reasons for this, we compare the proposals (dashed purple boxes) of different methods. We can see that the proposals given by DFL and our method are closer to the targets than the ones of Faster R-CNN. Even though each vehicle is detected by its own RPN, the final classifier removes these proposals (Proposals 2 and 4) since they deviate from the ground truth location too much and contain too much background. Thus, the features pooled from these RoIs are not precise enough to represent the targets. Consequently, the final classifier cannot determine well based on these features whether they are an object of interest, especially in dense cases. To analyze why LR-CNN localizes the objects better, we look at the mathematical definition of target regression. The regression target for width is
denotes the ground truth width and is the prediction. The target height is handled equivalently. Only when the prediction is close to the target, the equation can approximate a linear relationship: (because the regression targets of center shift are already defined as a linear function and all these four parameters are predicted simultaneously. The regression layer is easier to be trained and works better when all the four target equation are linear). For all these reasons, our framework obtains better proposals in our RPN and yields better final classification and localization.
Compared to Faster R-CNN and DFL, our approach performs much better on detecting small targets. This improvement benefits from the skip connection structure that fuses the richer detail information from the shallower layers with the features from deeper layers, which contain higher-level semantic information. This is important for detecting small objects in high-resolution aerial images. In our method, the position-sensitive RoIAlign pooling is adopted to extract more accurate information compared with the traditional RoI pooling. An accurate representation is important for precisely locating and classifying small objects. Then our final classifier works better to determine the targets and further refine their location. Most importantly, the STN module in our framework regularizes the learned features after RoIAlign pooling well, which reduces the burden of the following layers that are expected to learn powerful enough feature representations for classification and further regression. That is the reason why LR-RCNN distinguishes small and large vehicles better and has more precise detection. All the above elements enable our method to have a good generalization ability and to reach a new state-of-the-art in vehicle detection in high resolution aerial images.
We present an accurate local-aware region-based framework for vehicle detection in aerial imagery. Our method improves not only the boundary quantization issue for dense vehicles by aggregating the RoIs’ features with higher precision, but also the detection accuracy of vehicles placed at arbitrary orientations by the high-level semantic pooled feature regaining location information via learning. In addition, we develop a training strategy to allow the pooled feature of location information lacking the precision to reacquire the accurate spatial information from shallower layer features via learning. Our approach achieves state-of-the-art accuracy for detecting vehicles in aerial imagery and has good generalization ability. Given these properties, we believe that it should also be easy to generalize by detecting additional object classes under similar circumstances.
This work was supported by German Research Foundation (DFG) grants COVMAP (RO 2497/12-2) and PhoenixD (EXC 2122, Project ID 390833453).
-  (2018) Towards multi-class object detection in unconstrained remote sensing imagery. arXiv preprint arXiv:1807.02700. External Links: Cited by: §2.
-  (2016) R-fcn: object detection via region-based fully convolutional networks. In Neural Information Processing Systems, pp. 379–387. Cited by: §1, §3.2, §3.3.
-  (2009) ImageNet: a large-scale hierarchical image database. In Computer Vision and Pattern Recognition, Cited by: §2, §4.1.1.
-  (2015) The pascal visual object classes challenge: a retrospective. International Journal of Computer Vision 111 (1), pp. 98–136. Cited by: §4.1.
-  (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, pp. 580–587. Cited by: §1, §2.
-  (2015) Fast R-CNN. In International Conference on Computer Vision, pp. 1440–1448. Cited by: §1, §2.
-  (2017) Mask R-CNN. In International Conference on Computer Vision, pp. 2980–2988. Cited by: §3.2, §3.3, §4.2.4.
-  (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In European Conference on Computer Vision, pp. 346–361. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In Computer Vision and Pattern Recognition, pp. 770–778. Cited by: §3.1.
-  (2004) Detection of vehicles and vehicle queues in high resolution aerial images. Photogrammetrie-Fernerkundung-Geoinformation. Cited by: §2.
-  (2015) Spatial transformer networks. In Neural Information Processing Systems, pp. 2017–2025. Cited by: §3.3, §3.3.
-  (2018) Acquisition of localization confidence for accurate object detection. arXiv preprint arxiv:1807.11590. External Links: Cited by: §3.3.
-  (2017) Light-head R-CNN: in defense of two-stage object detector. arXiv preprint arxiv:1711.07264. External Links: Cited by: §1, §3.2.
-  (2020) Focal loss for dense object detection. Transactions on Pattern Analysis and Machine Intelligence 42 (1), pp. 318–327. Cited by: §2, §3.2, §4.2.4.
-  (2015) Fast multiclass vehicle detection on aerial images. IEEE Geosci. Remote Sensing Lett. 12 (9), pp. 1938–1942. Cited by: §4.1.
-  (2017) Learning a rotation invariant detector with rotatable bounding box. arXiv preprint arXiv:1711.09405. External Links: Cited by: §2.
-  (2016) SSD: single shot multibox detector. In European Conference on Computer Vision, pp. 21–37. Cited by: §1, §1, §2.
-  (2018) Arbitrary-oriented scene text detection via rotation proposals. IEEE Transactions on Multimedia 20 (11), pp. 3111–3122. Cited by: §2.
-  (2017) Vehicle detection from high-resolution aerial images using spatial pyramid pooling-based deep convolutional neural networks. Multimedia Tools and Applications 76 (20), pp. 21651–21663. Cited by: §2.
-  (2015-03) Vehicle detection in aerial imagery: a small target detection benchmark. Journal of Visual Communication and Image Representation 34, pp. . Cited by: §2, §4.1.
-  (2016) You only look once: unified, real-time object detection. In International Conference on Computer Vision, pp. 779–788. Cited by: §2.
-  (2017) YOLO9000: better, faster, stronger. arXiv preprint arxiv:1612.08242. External Links: Cited by: §1, §1, §2, §2.
-  (2018) YOLOv3: an incremental improvement. arXiv preprint arxiv:1804.02767. External Links: Cited by: §1, §2.
-  (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Neural Information Processing Systems, pp. 91–99. Cited by: §1, §1, §2, §3.2, §4.2.
-  (2017) Forward vehicle detection based on incremental learning and Fast R-CNN. In CIS, pp. 73–76. Cited by: §2.
-  (2017) Vehicle detection in aerial images based on region convolutional neural networks and hard negative example mining. Sensors 17 (2), pp. 336. Cited by: §2, §4.2.3, §4.2.4.
-  (2013) Selective search for object recognition. International Journal of Computer Vision 104 (2), pp. 154–171. Cited by: §2.
-  (2018) Vehicle re-identification with the space-time prior. In CVPR Workshop (CVPRW) on the AI City Challenge, Cited by: §2.
-  (2018) DOTA: a large-scale dataset for object detection in aerial images. In Computer Vision and Pattern Recognition, Cited by: §1, §4.1.
-  (2019) Vehicle detection in aerial images. Photogrammetric Engineering & Remote Sensing (PE&RS) 85 (4), pp. 297–304. Cited by: §2.
-  (2018) Deep learning for vehicle detection in aerial images. In International Conference on Image Processing (ICIP), pp. 3079–3083. Cited by: §2, §3, §4.2.