Unmanned aerial vehicles (UAVs) are extensively used as flying platforms of sensors for different domains such as traffic surveillance , managing the urban environment , package delivery  or aerial cinematography . For these applications, UAVs are equipped with mounted cameras and mainly gather visual data of the environment. Then, computer vision algorithms are applied to aerial visual data to extract high-level information regarding the environment.
Object detection is one of the most studied problems in computer vision. The recent advances in deep learning (variants of convolutional neural networks (CNNs) mainly) have led to breakthrough object detection performances with the availability of large datasets and computing power. Since these methods require a large number of training samples, several datasets (e.g., COCO, Pascal VOC ) have been introduced for benchmarking for the object detection task. The samples in these datasets consist of natural images that are mainly captured by handheld cameras. The significant differences between natural and aerial images (such as object layouts and sizes) cause these object detectors to have trouble to find objects in aerial images. Therefore, several datasets (e.g., [29, 7, 13, 24, 19, 6, 14]) have been introduced in recent years as a benchmark for object detection in aerial images.
Besides visual data gathered by a camera, the data from other sensors might give crucial information about the environment. The use of UAVs as only flying cameras cut off the potential advance in multi-modal object detection algorithms for aerial applications. For instance, the recent advances in perception for autonomous driving have brought new datasets such as [12, 4, 5] including multi-modal data (e.g., RGB images, Global Positioning System (GPS) coordinates, inertial measurement unit (IMU) data). Although the data fusion for object detection is still open research topic , these multi-modal datasets allow a benchmark for further research. However, to the best of our knowledge, there is no such multi-modal dataset collected in a real-world outdoor environment for UAVs.
In this work, we present a multi-modal UAV dataset (The AU-AIR dataset) in order to push forward the development of computer vision and robotic algorithms targeted at autonomous aerial surveillance. The AU-AIR dataset meets vision and robotics for UAVs having the multi-modal data from different on-board sensors. The dataset consists of 8 video streams (over 2 hours in total) for traffic surveillance. The videos mainly are recorded at Skejby Nordlandsvej and P.O Pedersensvej roads (Aarhus, Denmark). The dataset includes aerial videos, time, GPS coordinates and the altitude of the UAV, IMU data, and the velocity. The videos are recorded at different flight altitudes from 5 meters to 30 meters and in different camera angles from 45 degrees to 90 degrees (i.e., complete bird-view images that the camera is perpendicular to the Earth). Instances belonging to different object categories related to the traffic surveillance context are annotated with bounding boxes in video frames. Moreover, each extracted video frame is labeled with the flight data (See Fig. 1).
The whole dataset includes 32,823 labeled video frames with object annotations and the corresponding flight data. Eight object categories are annotated including person, car, van, truck, motorbike, bike, bus, trailer. The total number of annotated instances is 132,034. The dataset is split into 30,000 training-validation samples and 2,823 test samples.
In this work, we emphasize differences between aerial and natural images in the context of object detection tasks. To this end, we compare image samples and object instances between the AU-AIR dataset and the COCO dataset . In our experiments, we train and evaluate two mobile object detectors (including YOLOv3-tiny  and MobileNetv2-SSD Lite  on the AU-AIR dataset. We form a baseline, including mobile object detectors since we focus on real-time performance and the applicability of object detection task onboard computers mounted on UAV.
I-a Related Work
In recent years, several drone datasets have been introduced for object detection tasks ([29, 7, 13, 24, 19, 6, 14]). Zhu et al.  propose a UAV dataset (VisDrone) consisting of visual data and object annotations in images and frames. In the VisDrone dataset, object instances belonging the certain categories are annotated by bounding boxes and category labels. Besides object annotations, VisDrone includes some vision-related attributes such as the visibility of a scene, occlusion status. Du et al.  propose a benchmark dataset for object detection and tracking in aerial images. The dataset also includes meta information regarding the flight altitude. Hsieh et al.  propose a UAV-based counting dataset (CARPK) including object instances that belong to the car category. Robicquet et al.  introduce a UAV dataset (Stanford) that collects images and videos of six types of objects in the Stanford campus area. In this dataset, some of the object categories dominate the dataset having a high number of samples, whereas the remaining object categories have significantly less number of instances. Mueller et al.  propose synthetic dataset created by a simulator for target tracking with a UAV. Collins et al.  introduce a benchmarking website (VIVID) with an evaluation dataset collected under the DARPA VIVID program. Krajewski et al. propose an aerial dataset collected from highways, including object bounding boxes and labels of vehicles.
These datasets are annotated by common objects in an environment such as humans and different types of vehicles (e.g., car, bike, van). However, they only include visual data and bounding box annotations for objects and discard other sensory data. Among these studies, only UAVDT  includes an attribute that gives limited information about the flight altitude (i.e., labels such as ”low-level”, ”mid-level” and ”high-level”).
Fonder et al.  propose a synthetic dataset (Mid-Air) for low altitude drone flights in unstructured environments (e.g., forest, country). It includes multi-modal data regarding the flight (e.g., visual, GPS, IMU data) without any annotations for visual data.
There are also multi-modal drone datasets in the literature ([10, 1, 3, 17, 27]). However, the visual data are not collected for object detection since the main focus of these studies is the UAV navigation. Therefore, these datasets do not have object annotations. The comparison of existing datasets is given in Table I.
Looking also at the summary of the existing studies in Table I, the followings are the main contributions of this work:
To the best of our knowledge, the AU-AIR dataset is the first multi-modal UAV dataset for object detection. The dataset includes flight data (i.e., time, GPS, altitude, IMU data) in addition to visual data and objects annotations.
Considering the real-time applicability, we form a baseline training and testing mobile object detectors with the AU-AIR dataset. We emphasize the differences between object detection in aerial images and natural images.
|Dataset||Environment||Data type||Visual data||Object annotations||Time||GPS||Altitude||Velocity||IMU data|
|EuRoC MAV ||indoor||real||yes||no||yes||yes||yes||yes||yes|
|Zurich Urban MAV ||outdoor||real||yes||no||yes||yes||yes||yes||yes|
|UPenn Fast Flight ||outdoor||real||yes||no||yes||yes||yes||yes||yes|
Ii Object Detection in Natural Images vs Aerial Images
The availability of large amounts of data and processing power enables deep neural networks to achieve state-of-the-art results for object detection. Currently, deep learning-based object detectors are separated into two groups. The first group consists of region-based CNNs that ascend on image classifiers. Region-based CNNs propose image regions that are likely to contain an object and classify the region into a predefined object category. The second group has only one stage converting to the object detection problem into the bounding box prediction for objects, without re-purposing image classifiers. Faster-R-CNN is one of the well-known models belonging to the first group, YOLO  and SSD  are the popular object detectors that belong to the second group.
Deep learning-based object detectors have trained and performed on large datasets such as COCO  and PASCAL . These datasets include natural images that contain a single object or multi objects in their natural environments. Most of the images in these datasets are captured by humans using a handheld camera so that the vast majority of images have side-view. There are challenges of the object detection in natural images such as occlusion, illumination changes, rotation, low resolution, crowd existence of instances.
Aerial images have different characteristics from natural images due to having a bird’s-eye view. First of all, objects in natural images are much larger than their counterparts in aerial images. For example, an object category such as humans may occupy a large number of pixels in natural images. However, it may have a few numbers of pixels in an aerial image that is quite challenging to detect for object detectors (See Fig. 2). Moreover, aerial images can be fed to a network with higher dimensions that increases computational cost in order to prevent the diminishing of pixels belonging to small objects.
Secondly, an occlusion is observed in different conditions for natural and aerial images. In natural images, an object instance may be occluded by another foreground object instance (e.g., a human in front of a car). However, objects in aerial images are less likely to be occluded by other foreground objects (especially, bird-view images captured by a camera that is perpendicular to the Earth). (See Fig. 3.
Thirdly, the perspective in aerial images makes appearances of objects short and squat. This fact diminishes the information regarding an object height (See Fig. 4). Moreover, although aerial images can supply more contextual information about an environment by a broader view angle, the object instances may be amid cluttered.
Lastly, having a drone to capture aerial images, the altitude changes during the flight can cause varieties in object size and appearance in aerial images. Therefore, a recording of aerial videos at different altitudes may change the levels of challenges mentioned above.
Iii Au-Air – the Multi-Modal Uav Dataset
To address the challenges mentioned in Section II, we propose a multi-modal drone dataset (AU-AIR) including videos, object annotations in the extracted frames and sensor data for the corresponding frames. The data are captured by low-level flight (max. 30 meters) and for the scenario of a traffic surveillance. The AU-AIR dataset consists of video clips, sensor data, and object bounding box annotations for video frames.
Iii-a UAV Platform
We have used a quadrotor (Parrot Bebop 2) to capture the videos and record the flight data. An on-board camera has recorded the videos with a resolution of pixels at 30 frames per second (fps). The sensor data have been recorded for every 20 milliseconds.
Iii-B Dataset Collection
The AU-AIR dataset consists of 8 video clips (approximately in 2 hours of a total length) with 32,823 extracted frames. All videos are recorded for a scenario of aerial traffic surveillance at the intersection of Skejby Nordlandsvej and P.O Pedersensvej (Aarhus, Denmark) on windless days. Moreover, the videos cover various lighting conditions due to the time of the day and the weather conditions (e.g., sunny, partly sunny, cloudy).
Capturing an aerial video with a UAV brings different challenges for visual surveillance that are significantly different from natural images. To add these challenges in our dataset, we have captured the videos in different flight altitudes and camera angles. The flight altitude changes between 10 meters to 30 meters in the videos and the camera angle is adjusted from 45 degrees to 90 degrees (perpendicular to the Earth). An increase in the camera angle makes object detection task more challenging since images get differ from natural images.
Although the videos have been recorded with 30 fps, we have extracted five frames for every second in order to prevent the redundant occurrence of frames. Both of raw videos and extracted frames have a resolution of pixels.
Iii-C Visual Data and Annotation
Considering a traffic surveillance scenario, we have manually annotated specific object categories in the frames. For annotation, we used a bounding box and object category index for each instance. The annotated object categories include eight types of objects which highly occur during the traffic surveillance: person, car, bus, van, truck, bike, motorbike, and trailer.
For annotation, we employed workers on Amazon’s Mechanical Turk (AMT) . In order to increase the labeling quality, three workers annotated the same frame separately. Then, we combined annotations if they have the same object labels, and whose bounding boxes overlap more than a certain threshold. We chose a threshold as a value of 0.75 experimentally. In case this condition is not satisfied, we manually fine-tuned the bounding boxes and class labels. The category distribution over the dataset can be seen in Fig. 5. In the context of traffic surveillance, cars appear significantly more than other classes, and three vehicle types (car, van, truck) have a major portion of annotated bounding boxes.
The AU-AIR dataset includes frames that are captured in different flight altitudes (See Fig. 6). We recorded the data mainly for 10 meters, 20 meters, and 30 meters with different camera angles from 45 degrees to 90 degrees.
Iii-D Sensor Data
In addition to visual data and object annotations, the AU-AIR dataset includes sensor data that are logged during the video recording. In the dataset, we have the following attributes for each extracted frame:
: current date of a frame
: current time stamp of the frame
: latitude of the UAV (read from GPS sensor)
: longitude of the UAV (read from GPS sensor)
: altitude of the UAV (read from altimeter)
: UAV roll angle (rotation around the x axis) (read from IMU sensor)
: UAV pitch angle (rotation around the y axis) (read from IMU sensor)
: UAV yaw angle (rotation around the z axis) (read from IMU sensor)
: speed on the x axis
: speed on the y axis
: speed on the z axis
Table II shows unit values and ranges for each attribute except the date. The date () has a format of MMDDYYYY-HHMMSS where MM, DD, YYYY, HH, MM, SS indicates the month, day, year, hour, minutes, and second, respectively.
|Data||Unit||Min. value||Max. value|
The velocities () and rotation angles () are calculated according to the UAV body-frame given in Fig. 7.
Iv Evaluation and Analysis
We train and evaluate mobile object detectors with our dataset. During the evaluation, we consider real-time performance rather than achieving a state-of-the-art accuracy for the sake of the applicability. Therefore, we choose two mobile object detectors (YOLOv3-Tiny  and MobileNetv2-SSDLite ), which have a reasonable trade-off between the detection accuracy and the inference time.
Iv-a Baseline networks
We configure YOLOv3-Tiny  and MobileNetv2-SSDLite  for the bench-marking using the default parameters (e.g., learning rate, input size) as suggested in the original papers. We use the models that are trained on the COCO dataset as backbones.
We split the AU-AIR dataset into %60 training, %10 validation and %30 testing samples. The object detectors are adapted to the total number of classes in the AU-AIR dataset (8 classes in total) by changing their last layers.
Iv-B Comparison Metrics
To compare detection performances, we use mean average precision (mAP) that is a prominent metric in object detection [15, 8]. It is the mean of the average precision (AP) values which compute the precision score for an object category at discretized recall values over 0 to 1 . We consider 11 different recall values as in  and the intersection over union (IoU) threshold as 0.5.
For benchmarking, we train YOLOv3-Tiny and MobileNetv2-SSDLite with the AU-AIR Dataset. We use the batch size of 32 and Adam optimizer with the default parameters (alpha= 0.001, beta1=0.9, beta2=0.999). The training is stopped when the validation error starts to increase. Both networks are pre-trained on the COCO dataset. In order to see the effect of the training with an aerial dataset and a natural image dataset, we also use YOLOv3-Tiny and MobileNetv2-SSDLite trained on the COCO dataset without further training with the AU-AIR dataset. The results are given in Table III.
As shown in Table III, the networks only trained on the COCO dataset have poor results. This is expected since the characteristics of natural images are significantly different from natural images.
We observe that the AP values of motorbike and bicycle categories are significantly lower than the AP values of other categories. This fact might happen due to the class imbalance problem and the small object sizes of these categories. However, the bus category has the highest AP value, although there are fewer bus instances. This might result from the large size of bus instances in the frames. Furthermore, although the size of human instances is usually as small as the sizes of motorbike and bicycles, the AP values of the human category are relatively higher than these classes. This fact might be a consequence of the high number of human instances. There is no available AP values for the van and trailer categories in Table III since they do not exist in the COCO dataset.
The baselines trained on the AU-AIR dataset are good at finding objects in aerial images that are captured at different altitudes and view angles. Qualitative results can be seen in Fig. 8.
Among the baselines, YOLOv3-Tiny has higher AP values and mAP value compared to MobileNetv2-SSDLite. There is no significant difference between inference times (17.5 FPS and 17 FPS for YOLOv3-Tiny and MobileNetv2-SSDLite on TX2, respectively).
Since the number of instances of each object category is imbalanced in the AU-AIR dataset (Fig. 5), we consider several methods to solve the imbalanced class problem in the next version of the dataset. As a first step, we will try to collect more data to balance the number of instances. Besides, we may consider adding synthetic data (i.e., changing the brightness of images, translation, rotation) to increase the number of object categories which has a low number of samples in the current version.
We use AMT to annotate objects in images. Although three different people annotate one image and the annotations are manually checked by ourselves, there might be still overlooked samples that have weak annotations (e.g., unlabelled instances, loose bounding box drawings). Therefore, we consider using a three-step workflow proposed by Su et al. . In this workflow, the first worker draws a bounding box around an instance, the second worker verifies whether the bounding box is correctly drawn, and the third worker checks whether all object instances are annotated.
Unlike other UAV object detection datasets, ours includes sensor data corresponding to each frame. In this work, we give a baseline only for object annotations and visual data. As future work, more baselines may be added to encourage research using sensor data (e.g., navigation and control of a UAV, object detection using multi-modal data). Also, we can add more visual sensors, such as multi-spectral cameras.
We have used a ready-to-fly quadrotor (i.e., Parrot Bebop 2) to collect the whole dataset. We also consider collecting more samples from other platforms (e.g., different types of UAVs) using cameras that have different resolutions and frame rates.
In this dataset, traffic surveillance is the primary context. In future work, we consider increasing the number of environment contexts to increase diversity in the dataset.
In this work, we propose the AU-AIR dataset that is a multi-modal UAV dataset collected in an outdoor environment. Our aim is to fill the gap between computer vision and robotics having a diverse range of recorded data types for UAVs. Including visual data, object annotations, and flight data, it can be used for different research fields focused on data fusion.
We have emphasized the differences between natural images and aerial images affecting the object detection task. Moreover, since we consider real-time performance and applicability in real-world scenarios, we have created a baseline, including two mobile object detectors in the literature (i.e., YOLOv3-Tiny  and MobileNetv2-SSDLite ). In our experiments, we showed that mobile networks trained on natural images have trouble in detecting objects in aerial images.
-  (2018) The blackbird dataset: a large-scale dataset for uav perception in aggressive flight. arXiv preprint arXiv:1810.01987. Cited by: §I-A, TABLE I.
-  Autonomous aerial cinematography in unstructured environments with learned artistic decision-making. Journal of Field Robotics n/a (n/a), pp. . External Links: Cited by: §I.
-  (2016) The euroc micro aerial vehicle datasets. The International Journal of Robotics Research 35 (10), pp. 1157–1163. Cited by: §I-A, TABLE I.
-  (2019) NuScenes: a multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027. Cited by: §I.
-  (2018) KAIST multi-spectral day/night data set for autonomous and assisted driving. IEEE Transactions on Intelligent Transportation Systems 19 (3), pp. 934–948. Cited by: §I.
An open source tracking testbed and evaluation web site. In IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, Vol. 2, pp. 35. Cited by: §I-A, TABLE I, §I.
-  (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 370–386. Cited by: §I-A, §I-A, TABLE I, §I.
-  (2010) The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: §I, §II, §IV-B.
-  (2019) Deep multi-modal object detection and semantic segmentation for autonomous driving: datasets, methods, and challenges. arXiv preprint arXiv:1902.07830. Cited by: §I.
Mid-air: a multi-modal dataset for extremely low altitude drone flights.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0. Cited by: §I-A, §I-A, TABLE I.
-  (2016) Drones to manage the urban environment: risks, rewards, alternatives. Journal of Unmanned Vehicle Systems 4 (2), pp. 115–124. Cited by: §I.
-  (2013) Vision meets robotics: the kitti dataset. The International Journal of Robotics Research 32 (11), pp. 1231–1237. Cited by: §I.
-  (2017) Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4145–4153. Cited by: §I-A, TABLE I, §I.
-  (2018) The highd dataset: a drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 IEEE 21st International Conference on Intelligent Transportation Systems (ITSC), Cited by: §I-A, TABLE I, §I.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §I, §I, §II, §IV-B.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §II.
-  (2017) The zurich urban micro aerial vehicle dataset. The International Journal of Robotics Research 36 (3), pp. 269–273. Cited by: §I-A, TABLE I.
-  (2019) A constrained instantaneous learning approach for aerial package delivery robots: onboard implementation and experimental results. Autonomous Robots 43 (8), pp. 2209–2228. Cited by: §I.
-  (2016) A benchmark and simulator for uav tracking. In European conference on computer vision, pp. 445–461. Cited by: §I-A, TABLE I, §I.
-  (2005) A survey of unmanned aerial vehicles (uav) for traffic surveillance. Department of computer science and engineering, University of South Florida, pp. 1–29. Cited by: §I.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §II.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §I, §IV-A, §IV, §VI.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §II.
-  (2016) Learning social etiquette: human trajectory understanding in crowded scenes. In European conference on computer vision, pp. 549–565. Cited by: §I-A, TABLE I, §I.
-  (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520. Cited by: §I, §IV-A, §IV, §VI.
Crowdsourcing annotations for visual object detection.
Workshops at the Twenty-Sixth AAAI Conference on Artificial Intelligence, Cited by: §V.
-  (2018) Robust stereo visual inertial odometry for fast autonomous flight. IEEE Robotics and Automation Letters 3 (2), pp. 965–972. Cited by: §I-A, TABLE I.
-  (2012) Amazon mechanical turk. Retrieved August 17, pp. 2012. Cited by: §III-C.
-  (2018) Vision meets drones: a challenge. arXiv preprint arXiv:1804.07437. Cited by: §I-A, TABLE I, §I.