Automatic vehicle detection based on aerial imagery is crucial for a variety of applications such as large-scale traffic monitoring, parking lot utilization, urban planning, disaster management, as well as search and rescue missions. Aerial images, with their wide field of view, provide valuable information over large open areas in a short time [Ajay2017].
Due to the steep rise in the number of vehicles, traffic monitoring and management has become tremendously more complex, especially in urban areas. The major socio-economic impacts of the traffic-related problems such as air pollution, time loss in traffic jams, and health issues have increased the demand for developing novel automatic algorithms and adequate traffic data [Lewandowski2018]. It has been shown that vehicle detection algorithms based on aerial imagery can provide frequent and cost-efficient information about the location, number, and the types of vehicles in different traffic scenarios such as congestion caused by infrastructure bottleneck, accidents, or even lack of parking spaces [Ajay2017]. Due to the dynamic nature of traffic, the availability of large-scale information through aerial images can make traffic management more adaptive to the changing traffic conditions and help predicting infrastructure bottlenecks [Souza2017]. In disaster management, vehicle detection based on aerial imagery allows rapid localization of traffic congestion and abandoned vehicles to determine routes for effective search and rescue activities. Furthermore, in the case of natural disasters such as floods and earthquakes, aerial imagery is the most efficient means for detecting the affected vehicles [Makiuchi2019]. Recently, a large number of studies have focused on object detection (including vehicles) in aerial imagery [DBLP:journals/corr/PinheiroLCD16, recombinator, laplacian, stacked, fpn, D-FPN]
; however, despite the pronounced differences between ground and aerial images, most of the proposed methods are based on transferring object detection algorithms developed for natural-scene images to the aerial ones due to the scarcity of the large-scale aerial image datasets. For instance, to apply deep learning detection algorithms to aerial images, previous works usually relied on fine-tuning networks pre-trained on large-scale natural-scene datasets (ImageNet[Deng2009], MSCOCO [Lin2014], PASCAL VOC [Everingham10thepascal]). As it can be seen in Figure 1, the scale of the objects varies widely in aerial images due to not only the differences in spatial resolution , but also in the size of objects from the same category. In addition, aerial images usually contain a large number of small objects distributed and oriented differently over the scene (from sparse density of moving vehicles in highways to tightly packed ones in parking lots). In addition, the number of the object instances in aerial images is unbalanced, from a few to thousands of objects per image.
Object detection in ground imagery owes its significant promotion to the large datasets such as MSCOCO, ImageNet, and PASCAL VOC. However, for aerial imagery, similar datasets in terms of image number and annotation details are scarce, which has highly limited the progress in developing methods for aerial images.The current available aerial image datasets [Heitz2008TAS, Razakarivony2016VEDAI, Liu2015DLR3K, Zhu2015UCAS, Xia2017DOTA] suffer from either low number of images and annotated instances or low-quality annotations. The largest currently available aerial image dataset for object detection is DOTA [Xia2017DOTA] which comprises 2,800 images with fifteen categories and about 188,000 bounding box annotations using already processed Google Earth and satellite images; however, it contains only 43,462 vehicles. Other datasets such as TAS [Heitz2008TAS], VEDAI [Razakarivony2016VEDAI], COWO [Mundhenk2016COWC], DLR-3K-Munich-Vehicle [Liu2015DLR3K], and UCAS-AOD [Zhu2015UCAS] which mainly focus on vehicle detection also contain very limited number of annotated vehicles: TAS (1,319), VEDAI (3,270), COWO (32,716), DLR-3K-Munich-Vehicle (14,235), and UCAS-AOD (2,819). In addition to the number of instances, the inadequate diversity and complexity of the images used (clear background and limited object distribution heterogeneity) in these datasets prevents them from representing real-world situations. Table LABEL:tab:stats shows detailed statistics from the current major aerial image datasets for object detection. To promote research on vehicle detection including vehicle detection, counting, and tracking, we propose a new and yet largest aerial image dataset for vehicle detection in real-world aerial imagery scenarios, called EAGLE.
Altogether, the main contributions of this paper are:
EAGLE, which is to the best of our knowledge the largest aerial image dataset for vehicle detection and the first dataset of its kind addressing real-world scenarios.
Its high-quality annotations can contribute to the development and evaluation of practical airborne vehicle detection systems as well as haze, shadow, in-painting and super-resolution applciations. The dataset will be made publicly available.
Benchmarks of state-of-the-art object detection algorithms as baseline for future works by defining benchmarks for all three possible detection possibilities and two dataset split approaches.
Ii EAGLE dataset
The EAGLE dataset consists of aerial images with size of , acquired during several flight campaigns carried out between 2006 and 2019 in various time of day and year with different weather and illumination conditions. The images were taken under different traffic conditions and situations involving vehicles such as motorways, urban/rural areas, industrial districts, floods, wildfires, earthquakes, as well as search and rescue missions over multiple locations in five countries (see Figure 2). The images contain a large diversity of vehicle orientation angle and number of objects per image as shown in Figure 3 with a higher number of vehicle instances compared to previous datasets (see Figure 4). Figure 5 showcases some example image patches from the dataset. We acquired the images using a camera system comprised of three standard DSLR cameras (Canon EOS cameras) mounted on an airborne platform with different looking angles, a nadir-looking (top-down vertical) and two side-looking cameras. According to the conditions of the flight campaigns, the camera setups such as aperture size, image size, and ISO were adjusted differently. The platform was installed either on an airplane or on a helicopter flying at altitudes between and , resulting in a range ofgls:GSD, or spatial resolution, from to per pixel. The images were taken from early in the morning until the evening in various weather conditions (sunny, snowy, rainy, and foggy) with different illumination levels. Altogether, the variability in image parameters and scenes allows our dataset to cover a wide range of real-world situations involving vehicles. Figure 2 represents further statistics on the EAGLE dataset.
Ii-a Image annotation
Taking into account the relevance of the vehicle categories for the real-world applications of aerial imagery according to experts in the domain, we decided on two main categories for our dataset, namely small vehicles (cars, vans, transporters, SUVs, ambulances, police cars) and large vehicles (trucks, large-trucks, minibuses, buses, firefighter trucks, construction vehicles, trailers). The annotation contains the coordinates of all four vehicles corners having right angle between sides as well as orientation degree between to indicating the angle of vehicle head with respect to the trigonometric circle. Table LABEL:tab:stats shows a comparison between EAGLE and other existing aerial imagery datasets for vehicle detection. The EAGLE contains annotated vehicles, ranging from 1 to 3,567 annotations per image in all possible orientations (see Table I), making it the largest aerial image dataset for vehicle detection by a large margin (5 more vehicle instances than in the current largest dataset). Furthermore, for each instance, the visibility condition (totally/partly/hardly visible) and orientation clarity (clear/unclear) of the vehicle were provided. Stitched images with original sizes are ones of size. As visible in Table I, the EAGLE dataset contains 208,963 small and 7,023 large vehicles. A category-wise comparison is provided in Figure 4.
Ii-B Annotation method
We have addressed various challenges during the annotation of the vehicles in our aerial images. Due to the diversity of the scene locations, the acquisition time, as well as the weather and illumination conditions, precise annotation of the vehicles could be a very challenging task. For example, in an image taken over a flooded area when haze is present with low illumination or resolution, the visibility of the vehicles gets limited considerably. In addition, the occlusion due to other objects or strong shadow could cause difficulties in finding the vehicles. Furthermore, spotting vehicles in large aerial images of remote places (mountains) is not trivial. Moreover, categorizing the vehicles into either small or large vehicles could be sometimes tricky due to the uncertainty about the category of some borderline cases such as large transporters or buses. To ease the latter situation, we assumed the one-cabin vehicles with a width or a height smaller than a specific threshold (specified by an expert) as small vehicles and otherwise as large vehicles. We also assigned a difficulty flag for the occluded vehicles which can help to better train algorithms to overcome occlusion. Detecting the occluded vehicles is very important in real-world scenarios such as in disasters like flood when the vehicles are trapped or partially under water.
In the ground imagery , objects are usually annotated bygls:HBB, where an HBB can be defined by its top-left (TL) and bottom-right (BR) vertices, ; or by its center point together with the width and height , (). HBB is an efficient object annotation approach; however, it does not consider the objects’ orientation, which can lead to imprecise outlines of arbitrary oriented objects.Moreover,gls:HBB considerably overlap when objects are tightly packed, which can confuse even state-of-the-art algorithms trying to distinguish them.
|Small vehicles||Large vehicles|
|# Weak orientation||311||10|
|# Partly visible||18,188||184|
|objects per image|
An approach toward alleviating the limitations of HBB is using arbitrary quadrilateral bounding boxes, the so-calledgls:RBB [Xia2017DOTA], which can be described by , where are the vertex coordinates which can be with a clockwise order [Xia2017DOTA]. A specific case is a rotated rectangle when the sides make right angle with each other. Inspired by [shi2017detecting, Xia2017DOTA] and the annotations in the common object detection benchmarks such as MSCOCO and PASCAL VOC, we propose a right-angle constrainedgls:OBB which can be described as , where are the vertex coordinates and indicates the bounding box orientation. gls:OBB can be also represented as , where the bounding box edges are oriented according to . This approach ensures the precision of the object outlines.
Ii-C Dataset splits
We split the dataset into training, validation, and test sets based on two approaches. In the first approach, we randomly assign , , and of the images respectively. In this case, images from similar flight campaigns can be present in both train and test sets, which makes the detection task easier and similar to DOTA.Thus, in the second approach, we split the dataset so that the images from the same flight campaigns are either in the training or test set. This approach is similar to the real-world scenarios in which there is no prior knowledge about future flight missions and their locations, weather or illumination conditions.
Ii-D Contributions over the existing datasets
The existing datasets containing vehicle instances (e.g. DOTA) suffer from inconsistent or inaccurate annotations, low degree of diversity and a small number of vehicle instances, limiting their practical applications. Therefore, vehicle detection datasets such as EAGLE with thorough annotations even for tiny yet visible vehicles (see Figure 6) are lacking in the community. Moreover, EAGLE enables researchers to do research on haze and shadow removal as well as super-resolution, in-paining and instance segmentation. Our dataset is featuring major differences compared to the DOTA dataset:
EAGLE focuses on vehicle detection in real-world and practical scenarios with images of diverse location, time, resolution, weather and illumination conditions while DOTA is a multi-class general-purpose detection and classification dataset.
DOTA suffers from incomplete and noisy annotations (see Figure 6) especially for small vehicles [azimi2018towards], whereas EAGLE provides precise and comprehensive annotations (even for partially visible vehicles).
Due to overlaps between the training and test sets in DOTA, the task is less challenging than EAGLE in which two training/test splits are proposed: (1) a random patch-based split, and (2) a more realistic and challenging campaign-based split, where the test set contains locations and adverse conditions unseen during training.
We assess the performance of state-of-the-art object detection methods onEAGLE. Forgls:HBB object detection, we choose Cascade (Mask-)RCNN [cai2019cascade], Mask-RCNN [he2017maskrcnn]111https://github.com/facebookresearch/Detectron, FPN [fpn], Faster RCNN [fasterrcnnNIPS2015], FCOS [tian2019fcos]222https://github.com/tianzhi0549/FCOS, TridentNet [li2019scale], SNIPER [singh2018sniper]333https://github.com/MahyarNajibi/SNIPER, R-FCN [dai2016r]444https://github.com/msracver/Deformable-ConvNets, YOLOv3 [redmon2018yolov3], RefineDet [zhang2018single], and SSD [liu2016ssd]555https://github.com/tensorflow/models/tree/master/research/object_detection having ResNet101 [resnetHe15], ResNext101 [xie2017aggregated], Triple-ResNeXt152, InceptionV2 [normalization2015accelerating] or VGG16 [Simonyan2015VeryRecognition] backbone-networks as our baseline benchmark algorithms on the test set for their excellent performance in object detection on ground images bygls:HBB. Furthermore, we modify the original Cascade Mask-RCNN to detect objects withgls:RBB described by . We further adapt the algorithm to able to detect objects with OBBs denoted as , as
means the vehicle head angle. In order to evaluate the benchmark algorithms on EAGLE, we propose three different tasks including detection by HBB, RBB, and OBB.As the evaluation metric, we employ gls:map similar to PASCAL VOC. The image patches are stitched to form the original image before the evaluation step. In order to remove the redundant detected boxes in the overlapping regions as well as the patches themselves, we apply gls:NMS with a threshold offor HBB and for both RBB and OBB.
Iii-a Image splitting
In the training phase, due to the large size of the images (56163744 px) in the EAGLE dataset which cannot be fitted into the object detectors for the training process, we crop them into px patches with a 50% overlap in a sliding window fashion, resulting in 70 patches per image leading to 12075, 4025, and 8050 patches of training, validation and test respectively. The overlaps of the patches allows keeping all the objects, even if partially clipped at image boundaries. Patches thus ending up partially outside the image are shifted back into the image window. Patch-wise predictions are stitched into full images and overlaps were merged using NMS. This process could cut some vehicles into two parts. In this case, we compute the ratio between the area covered by each part () and that of the complete vehicle () as similar to [Xia2017DOTA], but with the difference that we adapt the parts’ ground truths to the image boundaries to have the highest intersection with the original object. After that, for , the attribute of the part remains unchanged, for , the attribute of the part is changed to ”difficult”, and for
, the part is ignored. This implementation will be made public. Moreover, the part which does not include the front part of the vehicle (depicting the orientation) is assigned a ”difficult” flag to its orientation attribute. For the testing step, we crop the images, but with a stride ofpx (10% overlap to ensure the coverage of the vehicles in their full appearance as well.
Iii-B Horizontal Bounding Boxes (HBB) baselines
We generate the ground truth forgls:HBB by calculating the center coordinates of the minimum and maximum in and coordinates in the original rotated bounding box ground truth. We train the baseline algorithms with their default settings and hyper-parameters for a fair comparison. Table LABEL:tab:ResultsHbbandRBB shows thegls:HBB detection results which indicates how challenging this dataset is for the-state-of the-art methods, with Cascade Mask-RCNN achieving the best performance of 39.29% mAP. SSD and Yolov3 have very low performance compared to the others. This could be due to the random crops during data augmentation suggested by [Xia2017DOTA]. Furthermore, the results depict a considerable difference between the ground-level and aerial objects concerning their size, scale and appearance.
Iii-C Rotated Bounding Boxes (RBB) baselines
Since most of the state-of-the-art algorithms are designed for non-oriented objects, direct application of the algorithms for detecting the oriented-objects is not efficient which makes the benchmark of the existing algorithms forgls:RBB challenging. We select and modify the Cascade Mask-RCNN [cai2019cascade] algorithm for predicting rotated bounding boxes, due to its accuracy on thegls:HBB task of the EAGLE dataset. For the rest of algorithms, we train the algorithms on thegls:HBB annotations of our dataset and test them on thegls:RBB annotations.Cascade Mask-RCNN is composed of onegls:RPN and three detection and segmentation heads with thresholds . Whilegls:RBB ground truth is defined by vertices,gls:RPN generates horizontal rectangles denoted by their top-left (TL) and bottom-right (BR) vertices . Therefore, we adapt the ground truth to rectangles by , , , and , similar to [Xia2017DOTA]. An alternative would be using rotatedgls:RPN as mentioned in [azimi2018towards]. However, we try to preserve the structure of the algorithm as much as possible. In the detection heads, the output target for each RoI and its ground truth are defined as: t_xi &= (g_xi - v_xi)/w, & t_yi &= (g_yi - v_yi)/h where and , similar to [liao2018textboxes++]. We consider the coordinates of each ground truth as the object mask to prepare the mask for the segmentation head. Table LABEL:tab:ResultsHbbandRBB shows the results of the modified Cascade Mask-RCNN trained and tested ongls:RBB compared with other baselines trained ongls:HBB and tested based ongls:RBB ground truth. We denote the modified method as Cascade Mask-RCNN-Rotated. The results show that by adapting the algorithm to rotated bounding box detection, we can achieve an improvement of about 7% mAP points. It also indicates that RBB task is a more difficult task than general HBB.
Iii-D Oriented Bounding Boxes (OBB) baselines
For the benchmark based ongls:OBB, we modify the detection heads of Cascade Mask-RCNN to predict the bounding box angles, and denote it as Cascade Mask-RCNN-Oriented. To this end, We regress over instead of . Other possibilities are regression over , or considering the clockwise order of bounding box vertices. The angle regression is defined as: t_θ = tan(g_θ- v_θ), where function is used to ensure the periodicity of the angle regression, but other regression approaches can be considered. Similar to the Fast-RCNN [fastrcnn] algorithm, we use the smooth loss for bounding box regression and Cross-entropy loss for classification. We evaluate the performance of the algorithm on thegls:OBB task by comparing the center coordinates, angle, width and height of predicted oriented bounding box. For orientation estimation, we divide the angles in the range of into 16 output bins and we consider an angle prediction to be correct if it falls into the same bin as the ground truth. Cascade Mask-RCNN-Oriented achieves 43.87%gls:map which is 59.45%gls:ap and 28.29%gls:ap for small and large vehicle and with the angle accuracy of 67.34%.
Iii-E Experimental analysis
By analyzing the results shown in Table LABEL:tab:ResultsHbbandRBB, we observe that the HBB detection is still challenging with respect to very small size objects, densely crowded regions, and occlusions in aerial images. In Figure 7, we provide a comparison of small and large vehicle detection methods ofgls:HBB,gls:RBB, andgls:OBB. As shown in Figure 7, for areas in which vehicles are parked tightly , we observe thatgls:HBB is less accurate thangls:RBB andgls:OBB in precise localization of vehicles in which several detection results are suppressed bygls:NMS and other post-processing steps. Furthermore, we see that some vehicles do not have right-angle detections for the RBB task leading to mistakes in the localization whilegls:OBB does not have this issue, resulting in a better performance. Thereforegls:OBB is the more accurate way in oriented object detection in aerial images.As for false positives, some non-vehicles objects appear similar to vehicles, confusing detectors as shown in the left column of Figure 7, showing false positives over the roofs. Also in the results of RBB in the middle column, a trash bin was detected as small vehicle. The less accuracy of the detector in large-vehicle detection compared to small-vehicle is the higher number of small-vehicle instances compared to large-vehicle ones leading to an unbalanced dataset. Also, in highly dense areas, results of bothgls:RBB andgls:OBB are not satisfying implying the high difficulty of this task.
Iii-F Impact of data-related factors on the performance
The smallergls:GSD is already known to improve performance drastically [shermeyer2019effects, azimi2018towards], but requires very-high resolution image acquisition, which may not always be possible. Smaller size and scale can also degrade the performance. The segmentation of objects down to 2px-wide at different scales was already successfully presented [azimi2019skyscapes]. Experiments on EAGLE indicates other challenges such as low-illumination, haze, shadow and occlusion as critical factors preventing state-of-the-art object detectors from performing well. EAGLE will support future works aiming at solving these real-world issues.
Iii-G Cross-dataset validation
We do a cross-dataset generalization to evaluate the generalization capability of EAGLE dataset. We select DOTA for comparison and its validation set for testing. We choose Cascade Mask-RCNN for validation experiments with gls:HBB ground truth. Table LABEL:tab:Resultscrossvalidation shows that a model trained on EAGLE generalizes well to DOTA, scoring only 6% mAP below a model trained on DOTA, indicating that EAGLE contains features of DOTA to a large extend. Moreover, as the annotation quality in EAGLE is significantly higher than in DOTA specially with respect to very small vehicles (as mentioned in Section II-A), a portion of false positives in this comparison is due to the detection of vehicles which are generally not annotated and ignored in DOTA, due to their small size. As for DOTA, the model trained on it only achieves 28.23% mAP on EAGLE (-11% mAP of the model trained on EAGLE) reflecting that EAGLE is significantly more diverse and challenging than the current available datasets which makes it appropriate for real-world vehicle detection scenarios.
We presentEAGLE, a large-scale dataset for task of vehicle detection in aerial imagery, which is multiple times larger than existing datasets. Unlike common object detection datasets, we provide a high number of annotated instances with oriented bounding boxes. We build a dataset specifically focusing on real-world scenarios which includes a variety of situations in aerial photography such as time, weather, and places. The detection of vehicles in any situation regardless of their size and appearance with arbitrary orientations contains useful information for different applications, making it useful for many practical applications. Our benchmarks showEAGLE is a very challenging dataset for the current state-of-the-art object detection algorithms. We also showcase a general method on object detection which can be modified to detect oriented objects. We believeEAGLE addresses the task of vehicle detection in remote vision bringing it to the next practical level. It also introduces interesting challenges to object detection domain in computer vision.
We thank Ternow AI GmbH for the data labeling support.