I Introduction
ALONG with the now widespread availability of aeroplanes and unmanned aerial vehicles (UAVs), the detection and localization of small targets in high resolution airborne imagery have been attracting a lot of attentions in the remote sensing community [1, 2, 3, 4, 5, 6]. They have numerous useful applications, to name a few, surveillance, defense, and traffic planning [7, 8, 9, 10, 11]. In this paper, vehicles are considered the small targets of interest, and our task is to automatically detect and localize vehicles from complex urban scenes (see Fig. 1). This is actually an exceedingly challenging task, because of 1) huge differences in visual appearance among cars (e.g., colors, sizes, and shapes) and 2) various orientations of vehicles.
Ia Detection of Vehicles Using Feature Engineering
Since effective feature representation is a matter of great importance to an object detection system, traditionally, vehicle detection in remote sensing images was dominated by works that make use of lowlevel, handcrafted visual features (e.g., color histogram, texture feature, scaleinvariant feature transform (SIFT) [12], and histogram of oriented gradients (HOG) [13]
) and classifiers. For example, in
[14], the authors incorporate multiple visual features, local binary pattern (LBP) [15], HOG, and opponent histogram, for vehicle detection from high resolution aerial images. Moranduzzo and Melgani [16]first use SIFT to detect interest points of vehicles and then train a support vector machine (SVM) to classify these interest points into vehicle and nonvehicle categories based on the SIFT descriptors. They later present an approach
[17] that performs filtering operations in horizontal and vertical directions to extract HOG features and yield vehicle detection after the computation of a similarity measure, using a catalog of vehicles as a reference. In [18], the authors make use of an integral channel concept, with Haarlike features and an AdaBoost classifier in a softcascade structure, to achieve fast and robust vehicle detection. In [19], the authors use a sliding window framework consisting of four stages, namely window evaluation, extraction and encoding of features, classification, and postprocessing, to detect cars in complex urban environments by using a combined feature of the local distributions of gradients, colors, and texture. Moreover, in [20], the authors utilize a multigraph regionbased matching method to detect moving vehicles in segmented UAV video frames. In [21], the authors apply a bagofwords (BoW) model and a local steering kernel with the sliding window strategy to detect vehicles in arbitrary orientations, but this method is quite time consuming.IB Detection of Vehicles Using Convolutional Neural Networks
The aforementioned methods mainly rely on manual feature engineering to build a classification system. Recently, as an important branch of deep learning family, convolutional neural networks (CNNs) have become the method of choice in many computer vision and remote sensing problems
[22] (e.g., object detection [23, 24, 7, 25, 11, 26, 27, 28, 5]) as they are capable of automatically extracting mid and highlevel features from raw images for the purpose of visual analysis. For example, Chen et al. [29] propose a vehicle detection model, called hybrid deep neural network, which consists of a sliding window technique and CNN. The main insight behind their model is to divide feature maps of the last convolutional layer into different scales, allowing for the extraction of multiscale features for vehicle detection. In [30], the authors segment an input image into homogeneous superpixels that can be considered as vehicle candidate regions, making use of a pretrained deep CNN to extract features, and train a linear SVM to classify these candidate regions into vehicle and nonvehicle classes. Moreover, several recent works focus on a similar task, vehicle instance segmentation. For instance, Audebert et al. [10] propose a deep learningbased threestage method called “segmentbeforedetect” for the semantic segmentation and subsequent classification of several types of vehicles in highresolution remote sensing images. The use of SegNet [31] in this method is capable of producing pixelwise annotations for vehicle semantic mapping. Mou et al. [27] propose a unified multitask learning network that can simultaneously learn two complementary tasks – namely, segmenting vehicle regions and detecting semantic boundaries. The latter subproblem is helpful for differentiating “touching” vehicles, which are usually not correctly separated into instances.IC Is NonRotatable Detection Enough for Vehicle Detection?
As our survey of related work shows above, most of existing approaches have focused on nonrotatable car detection [16, 17, 18, 32, 33, 19, 34], i.e., detecting all instances of vehicles and localizing them in the image in the form of horizontal bounding boxes with confidence scores. Detecting vehicles of arbitrary orientations in complex urban environments has received much less attentions and remains a challenging for most car detection algorithms, while the orientation information of vehicles is of importance for some practical applications such as traffic monitoring. In this paper, we make an effort to build an effective, rotatable detection model for vehicles of arbitrary orientations in complicated urban scenes.
In this paper, we propose an endtoend trainable network, rotatable regionbased residual network (RNet), for simultaneously localizing vehicles and identifying their orientations in high resolution remote sensing images. To this end, we introduce a series of effective rotatable operations in the network, aiming at generating multioriented bounding boxes for our task. When directly applied to detect vehicles of arbitrary orientations, conventional detection networks (e.g., Faster RCNN [35] and RFCN [36]) that are primarily designed for horizontal detection would result in low precisions. In contrast, the proposed rotatable network is capable of offering better performance, especially in some complex scenes such as a crowed parking lot. In this paper, we also try to apply our network to car detection and tracking in aerial videos and find that RNet is able to provide satisfactory vehicle trajectories. Moreover, when we take into account the temporal information of a video (i.e., the frame association produced by multiple object tracking algorithms), better detection results can be obtained. This paper contributes to the literature in the following three aspects:

We take advantage of the axial symmetry property of vehicles to create a novel network architecture, which is based on conventional twostage object detection framework (e.g., RFCN) but able to generate and handle rotatable bounding boxes by two tailored modules, namely rotatable region proposal network (RRPN) and rotatable detection network (RDN). In addition, on top of RRPN and RDN, we use a modified Shamos Algorithm to obtain regular quadrilaterals.

We propose a novel rotatable positionsensitive RoI pooling operation, namely RPS pooling, in order to reduce the dimension of feature maps of rotatable regions and meanwhile, keep the information of targets in specific directions.

We propose a novel strategy, called BAR anchor, to initialize rotatable anchors of region proposals using the size information of vehicles in the training set. Experimental results show that this strategy is able to estimate vehicle poses more accurately as compared to traditional anchor generation method.
The remainder of this paper is organized as follows. After the introductory Section I, detailing vehicle detection from high resolution remote sensing imagery, we enter Section II, dedicated to the details of the proposed RNet for multioriented vehicle detection. Section III then provides dataset information, implementation settings, and experimental results. Finally, Section IV concludes the paper.
Ii Methodology
Our approach for vehicle detection utilizes an endtoend trainable twostage detection framework, including a rotatable region proposal network (RRPN) and a rotatable detection network (RDN). As shown in Fig. 2
, after the feature extraction by a ResNet101
[37], we devise the framework in a rotatable detection domain to preserve competitive detection rates, and the details are elaborated in the following subsections. Besides, we also utilize two other typical deep networks for feature extraction, VGG16 [38] and ResNet101 with feature pyramid network (FPN) [39] (see Fig. 3).Iia Rotatable region proposal network
The main task of RRPN is to generate cursory rotatable region of interests (RRoIs) for the subsequent thorough detection. Referring to the strategy of generating region of interests (RoIs) in Faster RCNN [35], we propose to take batch averaging rotatable anchors (BAR anchors) as the input of RRPN and produce a series of RRoIs as output. We next describe the main ingredients of RRPN.
BAR Anchors. In the RPN stage of twostage detection methods, such as Faster RCNN [35] and RFCN [36], anchors are initial shapes of RoIs on each feature point, and the pattern of the anchors contributes to the pattern of RoIs. For instance, an anchor in the shape of rectangle finally regresses to an RoI of rectangle shape. Likewise, an RRoI can be obtained by a rotatable anchor, and the size of anchors depends on the size of detected objects.
With regard to the vehicle detection task from remote sensing images, we use anchors of permanent size in each training minibatch for the mostly invariant overall dimension of vehicle. Formally, we consider that there are minibatches in the training process, and batch size is set to . We use to represent the th () training sample in the th () minibatch, so the th rotatable ground truth box in can be represented by a coordinate . Then we define the average width and height of initialized anchors in th training minibatch as follows:
(1) 
where is the number of rotatable ground truth boxes in . Hence, for the th training minibatch, the coordinate of BAR anchor on feature point can be defined as , where and scale factor is set to in this paper. As a result, we get 12 BAR anchors on each feature point.
Definition of Parameters in Regression. Unlike traditional object detection networks that generate horizontal RoIs and represent them by 4d vectors (i.e., an RoI’s center coordinates and its width and height), the proposed method needs to produce RRoIs, which should be defined by 5d vectors, namely an RoI’s center coordinates , width , height , and intersection angle (), between its lengthwise direction and horizontal direction. However, in our experiments, we found that the use of a 8d vector for representing RRoI box and ground truth box , where , , , and , can make regression loss easier to be optimized. This is mainly because the 8d vectorbased way is capable of alleviating the unsteadiness caused by parameter when computing the regress loss. The following experimental results will show different influences of these two definitions.
To match with the dimension of RRoI, we also convert the BAR anchors into the 8d vectors from as follows:
(2) 
where is a biunique geometric transformation, , and . Notably, the collation of these 4 vertexes in each BAR anchor’s representation should be rigorously matched with those in ground truth labels and RRoIs, and the collation rule of all rotatable rectangles is defined as follows:

Confirm the intersection angle between its lengthwise direction and horizontal direction ();

Rotate the rectangle around its center to the horizontal direction by ;

Label four vertexes in the order of coordinate.
Consequently, in bounding box regression step, for the th feature point (, e.g., in ResNet101), we define as an 8d vector representing eight parameterized coordinates of a predicted bounding box, and is its corresponding ground truth box where ( = 1,…,12) is the index of a BAR anchor on the feature point . Then we define parameterizations of four coordinates as follows [40]:
(3) 
MultiTask Loss. In common with other object detection networks, such as Faster RCNN [35] and RFCN [36]
, we define a multitask loss to combine both classification and regression loss together. The loss function in RRPN is defined as an effective multitask loss
[41], which combines both classification loss and regression loss for each image. It can be computed as follows(4) 
where
is the predicted probability of the
th () anchor in the th () feature point being background and a target, and ground truth indicator label is if the anchor is positive (i.e., overlap 0.5), and is if the anchor is negative. Ground truth indicator label is 1 when there exist targets in the anchor.We define the classification loss as loss over two classes (background and target) as follows:
(5) 
Then we define the regression loss as
(6) 
where is a smooth loss function which is defined in [41], and formally, it can be calculated by
(7) 
In Eq. (4), two loss terms are normalized by and and balanced by hyperparameter . In our experiments, we set , , and .
Shamos Algorithm for RRoIs. The rotating calipers method was first used in the dissertation of Michael Shamos in 1978 [42]. Shamos used this method to generate all antipodal pairs of points on a convex polygon and to compute the diameter of a convex polygon in time. Then Houle and Toussaint developed an application for computing the minimum width of a convex polygon in [43]. After RRPN, we actually obtain RRoIs in the form of irregular quadrilateral denoted by 8d vectors. In order to get RRoIs in the form of regular quadrilateral to feed them into RDN, here, we propose to use Shamos Algorithm to calculate the minimum multioriented rectangular bounding boxes. The coordinate transformation can be described as follows:
(8) 
where a 5d vector is utilized to represent a minimum multioriented rectangular RRoI box , which can be obtained from an 8d vector of an irregular quadrilateral RRoI box by the Shamos Algorithm . Now we can feed the rectangular RRoI boxes into RDN.
IiB Rotatable detection network
Suppose that we generate RRoIs in total after RRPN, and these RRoIs are subsequently fed into RDN for the final regression and classification tasks. Here, an improved positionsensitive RoI pooling strategy, called rotatable positionsensitive RoI pooling (RPS pooling), is proposed to generate scores on rotatable positionsensitive score maps (RPS maps) for each RRoI. We next describe the main ingredients of our RDN.
RPS Maps of RRoIs. Given RRoIs defined by 5d vectors, each target proposal can be located on the feature maps extracted from an adjusted ResNet101 [37], which uses a randomly initialized convolutional layer with 1024 filters instead of a global average pooling layer and a fully connected layer. The size of output feature maps is .
For the classification task in RDN, we apply RPS maps for each category and channel output layer with object categories ( for our vehicle detection task and for background). The bank of RPS maps corresponds to a spatial grid describing relative positions, and we set in this paper.
Different from positionsensitive score maps in RFCN [36], the RPS maps in our method do not encode cases of an optional object category, but cases of a potential vehicle category. As shown in Fig. 4, RRoIs of vehicles are always approximate central symmetric so that there are always highly similar features in headstock and tailstock direction for most vehicles, making it hard to identify the exact direction. In order to avoid this almost ”unavoidable” misidentification, in our model, the angle of an RRoI is kept spanning . For example, shows two possible cases that the vehicle direction might be or , and these two possible cases are encoded into RPS maps in the same order.
RPS Pooling on RPS Maps. We divide each RRoI rotatable rectangular box into bins by a parallel grid. For an RRoI (), a bin is of size [41, 44]. For the th bin (), the RPS pooling step over the th RPS map of th category () is defined as follows:
(9) 
where denotes pooled output in the th bin for the th category, represents one RPS map out of the score maps, is all learnable parameters of the network, is the number of pixels in the th bin , and is the global coordinate of feature point which can be defined by the following affine transformation equation:
(10) 
where means local coordinates of feature point and , and means the topleft corner of an RRoI. Formally, the rotation angle can be caculated by
(11) 
When the th bin is pooled into a score map, we can get by as follows:
(12) 
and we then get the limitation of and when in the following inequalities:
(13) 
Voting by Scores. After performing RPS pooling operation on RPS maps in positions with 2 layers, we can then puzzle the blocks into one voting map for each RRoI. Here, a voting map is a kind of feature map that can keep rotatable positionsensitive features on different position blocks. Hence, we use total outputs on blocks to compute the score of the th category by
(14) 
And the softmax response of the th category () can be computed as follows:
(15) 
Loss Function in RDN. Similarly, for regression task in RDN, we use an 8d vector to represent eight parameterized coordinates of a predicted bounding box and apply RPS maps to each dimension for regression. Thus, the RPS score maps are fed into a convolutional layer with 72 filters for bounding box regression. Then we pool these feature maps into a 72d vector which is aggregated into an 8d vector by average voting for the th () predicted bounding box .
Likewise, we also define a multitask loss for RoIs:
(16) 
where represents ground truth (e.g., 0 and 1 stand for background and vehicle, respectively), and ground truth indicator label is 1 when there exist vehicles in the th () RRoI’s predicted box (i.e., overlap 0.5). is the predicted score of the RRoI being a vehicle, and covers two categories. As usual, we use a fully connected layer activated by a softmax function to compute in Eq. (15). Moreover, and represent coordinates of the predicted bounding box and ground truth box, which are similarly parameterized by Eq. (3). In addition, we define the classification loss as loss for two classes like Eq. (5) and the regression loss as smooth loss referring to Eq. (6) and Eq. (7). In Eq. (16), two terms of the loss are also normalized by a hyperparameter, i.e., . By default, we set .
IiC Joint loss function
The joint loss function of our endtoend trainable twostage detection framework is a combination of RRPN loss and RDN loss , and we use a loss weight to balance them. For each minibatch (batch size in our experiment), we compute the joint loss as follows:
(17) 
where , and can be calculated by Eq. (2) and Eq. (6) for the th () image in a minibatch. In this paper, we set the weight . We also conducted experiments to find better out, and the details will be given in Section III.
IiD Endtoend training
An endtoend training strategy is utilized to train our model, and we use a minibatch gradient descent algorithm to update network weights . Here, we define to represent learnable parameters of ResNet101, RRPN, and RDN, respectively, and indicates parameters of five blocks in ResNet101. In this paper, to train our model more efficiently, we only update parameters of the last two blocks of ResNet101 and keep those of the first three blocks fixed. There are two training stages for RRPN and RDN, respectively. In RRPN, feature maps are extracted by the first four blocks of ResNet101, and is computed and accumulated in a minibatch . In RDN, we extract feature maps using all five blocks, compute of an image, and accumulate it in a minibatch . We then compute the joint loss by Eq. (17) and independently perform backpropagation at the end of each minibatch.
As aforementioned, there are two transmission routes which are regarded as forward propagation in our network. We initialize the network parameters and
by pretrained model and Gaussian distribution, respectively. At the end of RRPN and RDN, according to the discrepancy between ground truth and the stepwise output of RRPN and RDN, we can use gradient descent algorithm to update learnable parameters of different parts with a learning rate
as follows:(18) 
Hyperparameter settings.
Prior to network training, all new layers are randomly initialized by drawing weights from a zeromean Gaussian distribution with standard deviation 0.01, and all other layers are initialized by a pretrained model on ImageNet
[45]. In the minibatch gradient descent, we use a learning rate of 0.001 for the first 10K iterations and 0.0001 for the next 10K iterations on the dataset. The momentum and weight decay are set to 0.9 and 0.0005, respectively. As a result, we find that it works well after about 10K iterations. The minibatch size is set to 32 in this paper.Iii Experiments
We use VGG16 [38], ResNet101 [37] and ResNet101 with FPN [39]
to extract features. The models are implemented using Caffe and Caffe 2
[46] and run on an NVIDIA GeForce GTX1080Ti with 12 GB on board memory.Iiia Dataset
To evaluate the performance of our method, we use two open vehicle detection datasets, namely DLR 3K Munich Vehicle Dataset [18] and VEDAI Vehicle Dataset [47], in which vehicles are accurately labelled by rotatable rectangular boxes. In our experiments, we regard various types of vehicles as one category. A statistic of the two datasets for our experiments can be found in Table I. In addition, we make use of data augmentation (translation transform, scale transform, and rotation transform) to extend the number of training samples (cf. Table I).
IiiB Experimental analysis
In our detection task, there are two outputs, rotatable rectangular bounding boxes and categories (i.e., vehicle or not). In general, we evaluate the performance of detection methods by using different choices of intersection over union (IoU) which indicates the overlap ratio between a predicted box and its ground truth box. The IoU can be defined as:
(19) 
where and are areas of the predicted box and the ground truth box, respectively, in the shape of regular rectangle. Therefore, we convert our predicted rotatable rectangular bounding boxes and rotatable rectangular ground truth boxes into regular rectangular ones (i.e., minimum bounding rectangles) for the purpose of calculating IoUs.
Then, average precision (AP) and precision recall curve are applied to evaluate object detection methods. Quantitatively, AP means the average value of precision for each object category from recall = 0 to recall = 1. For computing AP value, we define and count true positives (TPs), false positives (FPs), false negatives (FNs), and true negatives (TNs) in detection results.
For the vehicle detection task, we can regard the region of a regular rectangle bounding box as a TP in the case that the IoU is more than the given threshold value. Otherwise, if the IoU is less than the given threshold, the region is considered as an FP (also called false alarm). Moreover, the region of a target is regarded as an FN (also called miss alarm) if no predicted bounding box covers it. Otherwise, we regard the region as a TN (also called correct rejection). Consequently, we use the following definitional equations to formulate precision and recall indicators:
(20) 
Item  DLR 3K Munich Dataset  VEDAI Dataset 

Category  Car, Bus  Car, Pickup 
Truck  Truck, Van  
Image size  
Data tpye  RGB  RGB & NIR 
Total image (Tr. / Te.)  575 / 408  1066 / 1066 
Original veh.(Tr. / Te.)  5214 / 3054  2792 / 2702 
Augment veh.(Tr. / Te.)  31284 / 3054  16752 / 2702 
Besides, we use Fscore to evaluate the comprehensive performance of precision and recall, which can be calculated as follows:
(21) 
RecallIoU Analysis. Fig. 6 shows RecallIoU comparisons of our method and other baseline methods. Though a lower IoU usually means more TPs and less FPs, it may give inaccurate locations of targets at the same time. So, to obtain a better tradeoff between detection and location accuracies, we habitually set the IoU threshold to 0.5. Fig. 6 also displays RecallIoU trends of three different feature extraction networks (i.e., VGG16, ResNet101, and ResNet101 with FPN). We can see that the proposed method significantly obtains better performance as compared to other networks, which indicates that the proposed network is capable of learning more robust feature representations for multioriented vehicle detection tasks. Besides, it can be seen that FPN can generate better localization results as it can embed lowlevel features into highlevel ones.
Orientation Accuracy Analysis. In addition to evaluating localization accuracy, for multioriented vehicle detection tasks, we also need to assess the performance of the proposed RNet in terms of vehicle orientation estimation. Thus, we make a statistic to show the probability of the deviation () between predicted angles and ground truth angles. In Fig. 7, we compare two anchor generation strategies, i.e., using the proposed BAR anchor and traditional anchor [35], for our method (using ResNet101 with FPN as feature extraction network) on DLR 3K Munich Dataset. It can be seen that applying BAR anchor to region proposal network can offer better estimations of vehicle orientations, which may attribute to the prior information of vehicle sizes for anchored areas.
The Analysis of the Number of Proposals. As one of important parameters of twostage detection frameworks, the number of proposals always influences the tradeoff between detection accuracy and processing time. However, with the increase of the number of proposals, the detection accuracy can not get continuous increase. Here, we test the recall rate with different numbers of RRPN proposals and RDN proposals in the proposed RNet (with ResNet101) on DLR 3K Munich Dataset and VEDAI Dateset, respectively. We set the IoU value to 0.5. In Table II, we display the recall rates with different proposal number settings. We set the number of proposals in RRPN and that in RDN in the range of and , respectively. It can been seen that for both DLR 3K Munich Dataset and VEDAI Dataset, when the number of proposals in RRPN is more than 500 and that in RDN is set more than 1000, respectively, the recall rate is nearly the same. We, therefore, use 500 and 1000 as the abovementioned numbers for our following experiments.
Loss Weight Analysis. When training our network with the joint loss, there are several important hypeparameters (i.e., , , and ) which control weights of all components of the loss function. We, therefore, conduct a series of experiments to seek an optimal combination of them. In our method, the loss weight is to balance the weight of RRPN and RDN, and and are to balance the weight of classification tasks and regression tasks in RRPN and RDN, respectively. We first fix and () and tweak , and then assess the influence of and by fixing . In our experiments, we use recall rate under IoU to see the performance of different parameter settings. In Table III, we show the trend of recall rate in relation to . It can be seen that is a turning point of recall rates. I.e., when is smaller than 1, the recall rate increases, and it decreases when is larger than 1. Hence, we set for a good tradeoff between RRPN loss and RDN loss.
Besides, it is necessary to conduct experiments to find out the balance between and , which can be tweaked against gradient domination. On the one hand, we set and observe in a range from 0.01 to 100 with seven values. According to the recall rates on DLR 3K Munich Dataset and VEDAI Dataset, we choose for DLR 3K Munich Dataset and for VEDAI Dataset. On the other hand, we set and for DLR 3K Munich Dataset and VEDAI Dataset, respectively, and then select in the range of 0.01 to 100 with seven values. At last, we find that is optimal for both of two datasets. From Table III, we can see that there is no gradient domination, which can prove that the proposed method is robust.
IiiC Comparisons on detection task with other methods
We compare the proposed network (based on VGG16, ResNet101, and ResNet101 with FPN) with two onestage CNNbased object detection methods (i.e., SSD^{1}^{1}1https://github.com/weiliu89/caffe/tree/ssd [48] based on VGG16 and YOLO^{2}^{2}2https://github.com/pjreddie/darknet [49] based on VGG16), a twostage CNNbased object detection method Faster RCNN ^{3}^{3}3https://github.com/rbgirshick/pyfasterrcnn [35] based on VGG16, and a baseline method RFCN^{4}^{4}4https://github.com/YuwenXiong/pyRFCN [36] (based on VGG16, ResNet101, and ResNet101 with FPN).
Num. of RRPN proposals  100  300  500  1000  2000 

Num. of RDN proposals  300  500  1000  2000  3000 
DLR 3K Munich Dataset  0.605  0.753  0.809  0.812  0.816 
VEDAI Dataset  0.426  0.522  0.586  0.589  0.592 
()  0.01  0.1  1  10  100 

DLR 3K Munich Dataset  0.730  0.764  0.798  0.792  0.753 
VEDAI Dataset  0.508  0.520  0.565  0.537  0.528 
(, )  0.01  0.1  1  10  100 
DLR 3K Munich Dataset  0.741  0.766  0.809  0.781  0.769 
VEDAI Dataset  0.536  0.558  0.571  0.586  0.562 
(, )  0.01  0.1  1  10  100 
DLR 3K Munich Dataset  0.733  0.778  0.798  0.809  0.790 
VEDAI Dataset  0.528  0.545  0.569  0.586  0.575 
Parameter Settings. Before further comparison of our method with others, we set proper parameters for each method. Moreover, the Faster RCNN is implemented based on the Caffe framework, and we generate 2000 proposals by selective search algorithm [50]. Other parameter settings of those CNNbased methods can refer to the open source code.
PrecisionRecall Curve and AP. We show precisionrecall curves and APs of our method and other competitors on both DLR 3K Munich Dataset and VEDAI Dataset, respectively, in Fig. 9 and Table IV.
On DLR 3K Munich Dataset and VEDAI Dataset, we set the recall threshold from 0 to 1 and 0 to 0.8, respectively, and show PrecisionRecall curves of different methods. It can be seen that twostage CNNbased detection methods are with higher accuracies than onestage ones, which can be attributed to the fact that twostage classifiers can obtain more accuracy shots for classification tasks than onestage ones.
On the other hand, from Table IV, in comparison with the baseline model, we can see that the proposed method fails to raise the AP performance too much and even drop down the AP when taking VGG16 and ResNet101 as feature extraction networks. However, the proposed method can promote the AP performance a lot by using features extracted by ResNet101 together with FPN, which probably thanks to the efficient feature fusion mechanism in FPN, contributing to more precise localization for outputting bounding boxes in the shape of rotatable rectangles. Besides, it can be seen that deeper networks are able to offer better results with their stronger capabilities of nonlinear representation.
Method  DLR 3K Munich Dataset  VEDAI Dataset 

YOLO(V)  71.4%  36.7% 
SSD(V)  74.7%  43.8% 
Faster RCNN(V)  73.4%  44.8% 
RNet(V)  74.2%  45.2% 
RFCN(R)  80.1%  53.2% 
RNet(R)  79.5%  53.4% 
RFCN(R+F)  85.9%  61.8% 
RNet(R+F)  87.0%  64.8% 
Method  DLR 3K Mun. (F1)  4cls VEDAI (mAP) 
ViolaJones [18]  0.61  
Liu’s [18]  0.77  
AVPN_basic [25]  0.80  
AVPN_large [25]  0.82  
Fast RCNN(AVPN) [25]  0.82  
DPM [47]  0.46  
SVM+LTP [47]  0.51  
SVM+HOG31+LBP [47]  0.50  
RNet(V), IoU = 0.5  0.83  0.47 
RNet(R), IoU = 0.5  0.85  0.56 
RNet(R+F), IoU = 0.5  0.91  0.69 
In Table V, we report F1scores of several methods, including ViolaJones detector [51], Liu’s method [18], and AVPN method with different settings [25]. Moreover, we also report mean APs (mAPs) of Deformable Partbased Model (DPM) [52] and some traditional detectors which use handcrafted features (e.g., LBP [15], HOG [13], and LTP [53]) on four selected vehicle classes, i.e., car, pickup, truck, and van. The results show that the proposed method outperforms others.
Task  Mode  Frame size  Frames  aFPS  aTPs  aFNs  aFPs  Precision  Recall  Num. of proposals  Test speed 

Det.  Cruise  1080  24.0  67  6  0  100.0%  93.8%  1000 / 500  10.3 fps  
Det.  Surveillance  1452  24.2  435  46  2  99.3%  90.4%  5000 / 3000  1.8 fps  
Det. by trk.  Surveillance  1452  24.2  439  42  0  100.0%  91.3%  5000 / 3000  1.6 fps 
Net (with ResNet101 and FPN) as detection method together with Kalman filter as tracking method. Key: Det. = Detection; Trk. = Tracking.
Method  Frame@1s  Frame@15s  Frame@30s  Frame@45s  

F  Pre.  Rec.  Tim.  F  Pre.  Rec.  Tim.  F  Pre.  Rec.  Tim.  F  Pre.  Rec.  Tim.  
BXceptionFCN [27]  91.4  89.7  93.2  2.59  90.2  86.8  93.8  2.76  90.1  87.7  92.7  2.81  90.4  87.6  93.2  2.74 
BResFCN [27]  93.3  95.2  91.5  1.54  92.6  91.5  93.6  1.67  93.6  94.0  93.2  1.92  93.1  94.3  91.8  1.77 
Mask RCNN(R+F)  95.3  99.8  91.2  0.24  95.6  99.6  91.9  0.25  95.0  99.6  90.8  0.25  95.0  99.3  91.1  0.24 
RNet(R+F)  94.9  99.6  90.6  0.56  94.9  99.3  90.9  0.55  94.3  99.3  89.8  0.56  95.1  99.8  90.9  0.55 
RNet(R+F)+KF  95.5  99.8  91.6  0.63  95.9  100.0  92.3  0.63  95.8  99.8  92.0  0.63  96.5  99.8  93.4  0.62 
IiiD Experiments on aerial videos
In addition to image data, we also test our method on two aerial videos (see Fig. 10), Parking Lot UAV Cruise Video and Busy Parking Lot UAV Surveillance Video [27]. The former is captured in low altitude cruise mode, and the latter is acquired by a camera onborad a UAV hovering above the parking lot of Woburn Mall, Woburn, MA, USA^{5}^{5}5https://www.youtube.com/watch?v=yojapmOkIfg. We apply our model (RNet with ResNet101 and FPN trained on DLR 3K Munich Vehicle Dataset) to Parking Lot UAV Cruise Video for detection task to assess the performance of the proposed model on video frames in a dynamic scenery and on Busy Parking Lot UAV Surveillance Video for detection as well as tracking to test the detector’s performance in a crowed and complex scene from a relatively static surveillance view. Furthermore, we are curious to know if the temporal information of the second video data can strengthen our detection network’s performance. The detection results are shown in Table VI.
Detection on Parking Lot UAV Cruise Video. In order to qualitatively evaluate our model on this video, we manually labeled 20 ground truths for 20 frames and then compute average TPs, average FNs, average FPs, precision, and recall based on these ground truths. For simplicity, we use transferred regular rectangle boxes to compute IoU as mentioned before. The IoU threshold is set to 0.5. Table VI shows that the precision is 100%, and the recall is 93.8%, which indicates that our trained model has a high generalization ability on this video. Besides, the test speed with a frame size of is about 10.3 fps, which can nearly satisfy the requirement of a realtime vehicle detection task using UAV videos with a low speed cruise mode. Here, the numbers of proposals in RRPN and RDN are set to 1000 and 500, respectively.
Detection and Tracking on Busy Parking Lot UAV Surveillance Video. We also test our model on this data with 10 manually labeled frames. In this video, the main challenge for vehicle detection is that a great number of tiny vehicle appear densely. And in Table VI, we can see that our trained model has a satisfactory flexibility in such case with a precision of 99.3% and a recall of 90.4%. Here, we set the numbers of proposals in RRPN and RDN to 5000 and 3000, respectively, and the speed during test phase with a frame size of is 1.8 fps, which can nearly meet the requirement of a highaltitude surveillance platform.
To facilitate our research, we try to perform multiple object tracking task on Busy Parking Lot UAV Surveillance Video using the detection results produced by the proposed network. We exploit a simple online and realtime tracking (SORT) algorithm for multiple object tracking in video sequences^{6}^{6}6https://github.com/abewley/sort, which utilizes a Kalman filter (KF) to predict tracking boxes. From the result shown in Table VI, it can be seen that the capability of the proposed detection network can be upgraded with a precision of 100% and a recall of 91.3%^{7}^{7}7In this paper, we only discuss the result of indicators for detection task rather than tracking task., which is attributed to the fact that the relevance of context information in multiple frames can offset the deviation of singleframe detection. The joint speed of detection and tracking during test phase is about 1.6 fps. In Fig. 10, we show vehicle IDs, predicted boxes, and vehicle trajectories. A part of the tracking result is available at https://youtu.be/xCYDtYudN0.
Comparison with Instance SegmentationBased Detection Methods. We compare our method with several stateoftheart instance segmentationbased detection methods, namely BXceptionFCN [27], BResFCN [27], and Mask RCNN [54]. Here, BXceptionFCN model and ResFCN model are trained on ISPRS Potsdam Semantic Labeling Dataset [55], and Mask RCNN model with ResNet101 and FPN is trained on DLR 3K Munich Dataset. The parameter setting of Mask RCNN can refer to the open source code^{8}^{8}8https://github.com/facebookresearch/Detectron. In Table VII, we show the comparison on precision, recall, and Fscore. We find that the proposed models get lower recall than those instance segmentationbased methods in general, however, they have better performance on precision and Fscore, which shows the satisfactory flexibility of the proposed method on this video data.
Tradeoff between Accuracy and Test Time. In Fig. 11, we evaluate the test time cost and average precision of our method with different numbers of proposals on the two labeled UAV video data in order to find a good tradeoff between the accuracy and time cost. As a result, we set the numbers of proposals in RRPN and RDN to 1000 and 500 for Parking Lot UAV Cruise Video, and 5000 and 3000 for Busy Parking Lot UAV Surveillance Video, respectively.
Iv Conclusions
In this paper, a novel method is proposed to detect multioriented vehicles in aerial images and videos using a deep network call R
Net. First, one typical CNN is utilized to extract deep features. Second, we use RRPN to generate RRoIs encoded in 8d vectors. A novel strategy called BAR anchor is applied to initialize templates of rotatable candidates. Third, we use RDN as classifier and regressor to obtain the final 5d rotatable detection boxes. Here, we propose a new downsampling method for RRoIs called RPS pooling to achieve fast dimensionality reduction on RRoI feature maps and keep the information of positions and orientations. Besides, we modify the Shamos Algorithm for the conversion of 5d and 8d detection boxes in RRPN and RDN. In our method, RRPN and RDN can be jointly trained for high efficiency. Then we evaluate the proposed method from two perspectives. On the one hand, we perform experiments on two open vehicle detection image datasets, i.e., DLR 3K Munich Dataset and VEDAI Dataset, to compare with other stateoftheart detection methods. On the other hand, we conduct extra experiments on two aerial videos using models trained on DLR 3K Munich Dataset. Experimental results show that the proposed R
Net outperforms other methods on both aerial images and aerial videos. Especially, RNet can be well combined with multiple object tracking methods to acquire further information (e.g., vehicle trajectory), showing the satisfactory performance on multioriented vehicle detection tasks.References

[1]
G. Xia, J. Hu, F. Hu, B. Shi, X. Bai, Y. Zhong, L. Zhang, and X. Lu, “AID: A benchmark data set for performance evaluation of aerial scene classification,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 7, pp. 3965–3981, Jul. 2017.  [2] B. Kellenberger, M. Volpi, and D. Tuia, “Fast animal detection in UAV images using convolutional neural networks,” in IEEE International Geoscience and Remote Sensing Symposium, Jul. 2017, pp. 866–869.
 [3] T. Moranduzzo, F. Melgani, M. L. Mekhalfi, Y. Bazi, and N. Alajlan, “Multiclass coarse analysis for UAV imagery,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 12, pp. 6394–6406, Dec. 2015.
 [4] N. Audebert, B. L. Saux, and S. Lef vre, “Beyond RGB: Very high resolution urban remote sensing with multimodal deep networks,” ISPRS Journal of Photogrammetry and Remote Sensing, 2017.

[5]
G.S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo,
and L. Zhang, “DOTA: A largescale dataset for object detection in
aerial images,” in
IEEE Conference on Computer Vision and Pattern Recognition
, Jun. 2018.  [6] L. Mou and X. X. Zhu, “Rifcn: Recurrent network in fully convolutional network for semantic segmentation of high resolution remote sensing images,” CoRR, vol. abs/1805.02091, 2018. [Online]. Available: http://arxiv.org/abs/1805.02091

[7]
F. Zhang, B. Du, L. Zhang, and M. Xu, “Weakly supervised learning based on coupled convolutional neural networks for aircraft detection,”
IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 9, pp. 5553–5563, Sep. 2016.  [8] L. Mou and X. X. Zhu, “Spatiotemporal scene interpretation of space videos via deep neural network and tracklet analysis,” in IEEE International Geoscience and Remote Sensing Symposium, Jul. 2016, pp. 1823–1826.
 [9] L. Mou, X. Zhu, M. Vakalopoulou, K. Karantzalos, N. Paragios, B. L. Saux, G. Moser, and D. Tuia, “Multitemporal very high resolution from space: Outcome of the 2016 IEEE GRSS Data Fusion Contest,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 8, pp. 3435–3447, Aug. 2017.
 [10] N. Audebert, B. Le Saux, and S. Lefèvre, “SegmentbeforeDetect: Vehicle detection and classification through semantic segmentation of aerial images,” Remote Sensing, vol. 9, no. 4, p. 368, Apr. 2017.
 [11] Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu, “HSFNet: Multiscale deep feature embedding for ship detection in optical remote sensing imagery,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–15, 2018.
 [12] D. G. Lowe, “Object recognition from local scaleinvariant features,” in IEEE International Conference on Computer Vision, vol. 2, 1999, pp. 1150–1157 vol.2.
 [13] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 1, Jun. 2005, pp. 886–893 vol. 1.
 [14] W. Shao, W. Yang, G. Liu, and J. Liu, “Car detection from highresolution aerial imagery using multiple features,” in IEEE International Geoscience and Remote Sensing Symposium, 2012.
 [15] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution grayscale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, Jul. 2002.
 [16] T. Moranduzzo and F. Melgani, “Automatic car counting method for unmanned aerial vehicle images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 3, pp. 1635–1647, 2014.
 [17] T. Moranduzzo and F. Melgani, “Detecting cars in UAV images with a catalogbased approach,” IEEE Transactions on Geoscience and Remote Sensing, vol. 52, no. 10, pp. 6356–6367, 2014.
 [18] K. Liu and G. Mattyus, “Fast multiclass vehicle detection on aerial images,” IEEE Geoscience and Remote Sensing Letters, vol. 12, no. 9, pp. 1938–1942, Sep. 2015.
 [19] M. ElMikaty and T. Stathaki, “Detection of cars in highresolution aerial images of complex urban environments,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 10, pp. 5913–5924, Oct. 2017.
 [20] B. Kalantar, S. B. Mansor, A. A. Halin, H. Z. M. Shafri, and M. Zand, “Multiple moving object detection from UAV videos using trajectories of matched regional adjacency graphs,” IEEE Transactions on Geoscience and Remote Sensing, vol. 55, no. 9, pp. 5198–5213, Sep. 2017.
 [21] H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Nahavandi, “Robust vehicle detection in aerial images using bagofwords and orientation aware scanning,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–12, 2018.
 [22] X. X. Zhu, D. Tuia, L. Mou, G. S. Xia, L. Zhang, F. Xu, and F. Fraundorfer, “Deep learning in remote sensing: A comprehensive review and list of resources,” IEEE Geoscience and Remote Sensing Magazine, vol. 5, no. 4, pp. 8–36, Dec. 2017.
 [23] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object detection in optical remote sensing images based on weakly supervised learning and highlevel feature learning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 6, pp. 3325–3337, Jun. 2015.
 [24] G. Cheng, P. Zhou, and J. Han, “Learning rotationinvariant convolutional neural networks for object detection in vhr optical remote sensing images,” IEEE Transactions on Geoscience and Remote Sensing, vol. 54, no. 12, pp. 7405–7415, Dec. 2016.
 [25] Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou, “Toward fast and accurate vehicle detection in aerial images using coupled regionbased convolutional neural networks,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 10, no. 8, pp. 3652–3664, Aug. 2017.
 [26] Q. Li, Y. Wang, Q. Liu, and W. Wang, “Hough transform guided deep feature extraction for dense building detection in remote sensing images,” in IEEE International Conference on Acoustics, Speech and Signal Processing, Apr. 2018.
 [27] L. Mou and X. X. Zhu, “Vehicle instance segmentation from aerial image and video using a multitask learning residual fully convolutional network,” IEEE Transactions on Geoscience and Remote Sensing, pp. 1–13, 2018.
 [28] Q. Li, L. Mou, K. Jiang, Q. Liu, Y. Wang, and X. X. Zhu, “Hierarchical region based convolution neural network for multiscale object detection in remote sensing images,” in IEEE International Geoscience and Remote Sensing Symposium, Jul. 2018.
 [29] X. Chen, S. Xiang, C. L. Liu, and C. H. Pan, “Vehicle detection in satellite images by hybrid deep convolutional neural networks,” IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 10, pp. 1797–1801, 2014.
 [30] N. Ammour, H. Alhichri, Y. Bazi, B. Benjdira, N. Alajlan, and M. Zuair, “Deep learning approach for car detection in UAV imagery,” Remote Sensing, vol. 9, no. 4, p. 312, 2017.
 [31] V. Badrinarayanan, A. Handa, and R. Cipolla, “SegNet: A deep convolutional encoderdecoder architecture for robust semantic pixelwise labelling,” CoRR, vol. abs/1505.07293, 2015. [Online]. Available: http://arxiv.org/abs/1505.07293
 [32] L. Cao, F. Luo, L. Chen, Y. Sheng, H. Wang, C. Wang, and R. Ji, “Weakly supervised vehicle detection in satellite images via multiinstance discriminative learning,” Pattern Recognition, vol. 64, 2016.
 [33] L. Cao, Q. Jiang, M. Cheng, and C. Wang, “Robust vehicle detection by combining deep features with exemplar classification,” Neurocomputing, vol. 215, pp. 225–231, 2016.
 [34] Y. Xu, G. Yu, X. Wu, Y. Wang, and Y. Ma, “An enhanced ViolaJones vehicle detection method from unmanned aerial vehicles imagery,” IEEE Transactions on Intelligent Transportation Systems, vol. 18, no. 7, pp. 1845–1856, Jul. 2017.
 [35] S. Ren, K. He, R. Girshick, and J. Sun, “Faster RCNN: Towards realtime object detection with region proposal networks,” in Advances in Neural Information Processing Systems, 2015, pp. 91–99.
 [36] J. Dai, Y. Li, K. He, and J. Sun, “RFCN: Object detection via regionbased fully convolutional networks,” in Advances in Neural Information Processing Systems, 2016, pp. 379–387.
 [37] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp. 770–778.
 [38] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations, 2015.
 [39] T.Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017.
 [40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580–587.
 [41] R. Girshick, “Fast RCNN,” in IEEE International Conference on Computer Vision, Dec. 2015, pp. 1440–1448.
 [42] M. I. Shamos, “Computational geometry.” Ph. D. thesis, Yale University, May 1978.
 [43] M. E. Houle and G. T. Toussaint, “Computing the width of a set,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 10, no. 5, pp. 761–765, Sep. 1988.
 [44] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
 [45] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei, “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
 [46] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,” in ACM International Conference on Multimedia, 2014, pp. 675–678.
 [47] S. Razakarivony and F. Jurie, “Vehicle detection in aerial imagery : A small target detection benchmark,” Journal of Visual Communication and Image Representation, vol. 34, pp. 187 – 203, Nov. 2016.
 [48] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, and A. C. Berg, “SSD: Single shot multibox detector,” in European Conference on Computer Vision, 2016, pp. 21–37.
 [49] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, realtime object detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 779–788.
 [50] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and A. W. M. Smeulders, “Selective search for object recognition,” International Journal of Computer Vision, vol. 104, no. 2, pp. 154–171, 2013.
 [51] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in IEEE Computer Conference on Computer Vision and Pattern Recognition, vol. 1, 2001, pp. I–511–I–518 vol.1.
 [52] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, Sep. 2010.

[53]
X. Tan and B. Triggs, “Enhanced local texture feature sets for face recognition under difficult lighting conditions,”
IEEE Transactions on Image Processing, vol. 19, no. 6, pp. 1635–1650, Feb. 2010.  [54] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask RCNN,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2018.
 [55] F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez, and U. Breitkopf, “The ISPRS benchmark on urban object classification and 3D building reconstruction,” ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci, vol. 1, no. 3, pp. 293–298, 2012.