Unmanned Aerial Vehicles (UAVs) equipped with cameras have emerged into an important asset in a wide range of fields, such as agriculture, delivery, surveillance, and search and rescue (SAR) missions [1, 44, 20]. In particular, UAVs are capable of assisting in SAR missions due to their fast and versatile applicability while providing an overview over the scene [35, 24, 4]. Especially in maritime scenarios, where wide areas need to be quickly overseen and searched, the efficient use of autonomous UAVs is crucial . Among the most challenging issues in this application scenario is the detection, localization, and tracking of people in open water [19, 38]. The small size of people relative to search radii and the variability in viewing angles and altitudes require robust vision-based systems.
Currently, these systems are implemented via data-driven methods such as deep neural networks. These methods depend on large-scale data sets portraying real-case scenarios to obtain realistic imagery statistics. However, there is a great lack of large-scale data sets in maritime environments. Most data sets captured from UAVs are land-based, often focusing on traffic environments, such as VisDrone and UAVDT . Many of the few data sets that are captured in maritime environments fall in the category of remote sensing, often leveraging satellite-based synthetic aperture radar . All of these are only valuable for ship detection 
as they don’t provide the resolution needed for SAR missions. Furthermore, satellite-based imagery is susceptible to clouds and only provides top-down views. Finally, many current approaches in the maritime setting rely on classical machine learning methods, incapable of dealing with the large number of influencing variables and calling for more elaborate models.
This work aims to close the gap between large-scale land-based data sets captured from UAVs to maritime-based data sets. We introduce a large-scale data set of people in open water, called SeaDronesSee. We captured videos and images of swimming probands in open water with various UAVs and cameras. As it is especially critical in SAR missions to detect and track objects from a large distance, we captured the RGB footage with 38402160 px to 54563632 px resolution. We carefully annotated ground-truth labels for objects of interest including swimmer, swimmer with life jacket, life jacket, person on boat, person with life jacket on boat, and boat.
Moreover, we note that current data sets captured from UAVs only provide very coarse or no meta information at all. We argue that this is a major impediment in the development of multi-modal systems, which take these additional information into account to improve accuracy or speed, see [25, 33]. Therefore, we provide precise meta information for every frame and image including altitude, camera angle, speed, time, and others. We also made sure that the data set was balanced with respect to this meta information.
In maritime settings, the use of multi-spectral cameras with Near Infrared channels to detect humans can be advantageous . For that reason, we also captured multi-spectral images using a MicaSense RedEdge. This enables the development of detectors taking into account the non-visible light spectra Near Infrared (842 nm) and Red Edge (717 nm).
Finally, we provide detailed statistics of the data set and conduct extensive experiments using state-of-the-art models and hereby establish baseline models. These serve as a starting point for our SeaDronesSee benchmark. We release the training and validation sets with complete ground truth but only the test set’s videos/images. The ground truth of the test set is used by the benchmark server to calculate the generalization power of the models. We set up an evaluation web page, where researchers can upload their predictions and opt to publish their results on a central leader board such that transparent comparisons are possible. The benchmark focuses on three tasks: (i) object detection, (ii) single-object tracking and (iii) multi-object tracking, which will be explained in more detail in the subsequent sections.
Our main contributions are as follows:
To the best of our knowledge, SeaDronesSee is the first large annotated UAV-based data set of swimmers in open water. It can be used to further develop detectors and trackers for SAR missions.
We provide full environmental meta information for every frame making SeaDroneSee the first UAV-based data set of that nature.
We provide an evaluation server to prevent researches from overfitting and allow for fair comparisons.
We perform extensive experiments on state-of-the-art object detectors and trackers on our data set.
|Object detection||env.||platform||image widths||altitude||range||angle||range||other meta|
|UAVDT ||traffic||UAV||1,024||✕||5-200 m*||✕||*||✕|
|VisDrone ||traffic||UAV||960-2,000||✕||5-200 m*||✕||*||✕|
Airbus Ship 
|Single-object tracking||env.||#clips||frame widths||altitude||range||angle||range||other meta|
|Multi-object tracking||env.||#frames||frame widths||altitude||range||angle||range||other meta|
|traffic||40.7 k||1,024||✕||5-200 m*||✕||*||✓|
|traffic||40 k||960-2,000||✕||5-200 m*||✕||*||✓|
|maritime||54 k||3,840||✓||5-150 m||✓||✓|
Comparison with the most prominent annotated aerial data sets. ’altitude’ and ’angle’ indicates whether or not there are precise altitude and angle view information available. ’other meta’ refers to time stamps, GPS, and IMU data and in the case of object tracking can also mean attribute information about the sequences. The values with stars have been estimated based on ground truth bounding box sizes and corresponding real world object sizes (for altitude) and qualitative estimation of sample images (for angle). For DOTA and Airbus Ship the range of altitudes is not available because these are satellite-based data sets.
2 Related Work
In this section, we review major labeled data sets in the field of computer vision from UAVs and in maritime scenarios which are usable for supervised learning models.
2.1 Labeled Data Sets Captured from UAVs
Over the last few years, quite a few data sets captured from UAVs have been published. The most prominent are these that depict traffic situations, such as VisDrone  and UAVDT . Both data sets focus on object detection and object tracking in unconstrained environments. Pei  collect videos (Stanford Drone Dataset) showing traffic participants on campuses (mostly people) for human trajectory prediction usable for object detection. UAV123  is a single-object tracking data set consisting of 123 video sequences with corresponding labels. The clips mainly show traffic scenarios and common objects. Both, Hsieh  and Mundhenk  capture a data set showing parking lots for car counting tasks and constrained object detection. Li  provide a single-object tracking data set showing traffic, wild life and sports scenarios. Collins capture a single-object tracking data set showing vehicles on streets in rural areas. Krajewski  show vehicles on freeways.
2.2 Labeled Data Sets in Maritime Environments
Many data sets in maritime environments are captured from satellite-based synthetic aperture radar and therefore fall into the remote sensing category. In this category, the airbus ship data set  is the largest, featuring 40k images from synthetic aperture radars with instance segmentation labels. Li  provide a data set of ships with images mainly taken from Google Earth, but also a few UAV-based images. In , the authors provide satellite-based images from natural scenes, mainly land-based but also harbors. The most similar to our work is . They also consider the problem of human detection in open water. However, their data mostly contains images close to shores and of swimming pools. Furthermore, it is not publicly available.
2.3 Multi-Modal Data Sets Captured from UAVs
UAVDT  provides coarse meta data for their object detection and tracking data: every frame is labeled with altitude information (low, medium, high), angle of view (front-view, side-view, bird-view) and light conditions (day, night, foggy). Wu  manually label VisDrone after its release with the same annotation information for the object detection track. Mid-Air  is a synthetic multi-modal data set with images in nature containing precise altitude, GPS, time, and velocity data but without annotated objects. Blackbird  is a real-data indoor data set for agile perception also featuring these meta information. In , street-view images with the same meta data are captured to benchmark appearance-based localization. Bozcan  release a low-altitude ( m) object detection data set containing images showing a traffic circle and provide meta data such as altitude, GPS, and velocity but exclude the import camera angle information.
Tracking data sets often provide meta data (or attribute information) for the clips. However, in many cases these do not refer to the environmental state in which the image was captured. Instead, they abstractly describe the way in which a clip was captured: UAV123  label their clips with information such as aspect ratio change, background clutter, and fast motion, but do not provide frame-by-frame meta data. The same observation can be made for the tracking track of VisDrone . See Table 1 for an overview of annotated aerial data sets.
3 Data Set Generation
3.1 Image Data Collection
We gathered the footage on several days to obtain variance in light conditions. Taking into account safety and environmental regulations, we asked over 20 test subjects to be recorded in open water. Only subjects who met strict criteria regarding their ability to swim in open water were recruited. Small boats were rented to transport the subjects to the area of interest, where quadcopters were launched at a safe distance from the swimmers. At the same time, the fixed-wing UAV Trinity F90+ was launched from the shore. We used waypoints to ensure a strict flight schedule to maximize data collection efficiency. Care was taken to maintain a strict vertical separation at all times. Subjects were free to wear life jackets, of which we provided several differently colored pieces (see also Figure2).
To diminish the effect of camera biases within the data set, we used multiple cameras, as listed in Table 2, mounted to the following drones: DJI Matrice 100, DJI Matrice 210, DJI Mavic 2 Pro, and a Quantum Systems Trinity F90+.
|Hasselblad L1D-20c||3,8402,160||30 fps|
|MicaSense RedEdge-MX||1,280 960||✕|
|Zenmuse X5||3,8402,160||30 fps|
|Zenmuse XT2||3,8402,160||30 fps|
With the video cameras we captured videos at 30 fps. For the object detection task, we extract at most two frames per second of these videos to avoid having redundant occurrences of frames. See Section 4 for information on the distribution of images with respect to different cameras.
Lastly, we captured top-down looking multi-spectral imagery at 1 fps. We used a MicaSense RedEdge-MX, which records five wavelengths (475 nm, 560 nm, 668 nm, 717 nm, 842 nm). Therefore, in addition to the RGB channels, the recordings also contain a RedEdge and a Near Infrared channel. The camera was referenced with a white reference before each flight. As the RedEdge-MX captures every band individually, we merge the bands using the development kit provided by MicaSense.
3.2 Meta Data Collection
Accompanied with every frame there is a meta stamp, that is logged at 10 hertz. To align the video data (30 fps) and the time stamps, a nearest neighbor method was performed. The following data is logged and provided for every image/frame read from the onboard clock, barometer, IMU and GPS sensor, respectively:
: current date and time of capture
: relative time stamp since beginning of capture
: latitude of the UAV
: longitude of the UAV
: altitude of the UAV
: camera pitch angle (viewing angle)
: UAV roll angle
: UAV pitch angle
: UAV yaw angle
: speed along the -axis
: speed along the -axis
: speed along the -axis
See Table 3 for an overview of the variables and their ranges. Note that corresponds to a top-down view, and to a horizontally facing camera. The date format is given in the extended form of ISO 8601. Furthermore, note that the UAV roll/pitch/yaw-angles are of minor importance for meta-data-aware vision-based methods as the onboard gimbal filters out movement by the drone such that the camera pitch angle is roughly constant if it is not intentionally changed .
3.3 Annotation Method
Using the non-commercial labeling tool DarkLabel , we manually and carefully annotated all provided images and frames with the categories swimmers, swimmers with life jacket, life jackets, person, person with life jacket (we define a person to be on a boat), and boats. Note that the annotations are mutually exclusive. Subsequently, all annotations were checked by experts in aerial vision. We choose these classes as they are the hardest and most critical to detect in SAR missions. Furthermore, we annotated regions with other objects as ignored regions, such as boats on land. These regions were blackened in the recordings. Our guidelines for the annotation are described in the appendix. In particular, swimmers were annotated such that the complete body including arms and legs (if visible) are within the bounding box. The bounding box format is , where and correspond to the upper left corner and and to the width and height, respectively. See Figure 2 for examples of objects.
On the left side there are three images of swimmers with life jacket and a single image of a life jacket. The four images in the middle show swimmers in various poses. On the right, there are two persons on boat shown. The top one is wearing a life jacket.
3.4 Data Set Split
To ensure that the training, validation, and testing set have similar statistics, we roughly balance them such that the respective subsets have similar distributions with respect to altitude and angle of view, two of the most important factors of appearance changes. Of the individual images, we randomly select and add it to the training set, add to the validation set and another to the testing set. In addition to the individual images, we randomly cut every video into three parts of length , , and of the original length and add the respective portions to the training, validation, and testing set. This is done to avoid having subsequent frames in the training and testing set such that a realistic evaluation is possible. We release the training and validation set with all annotations and the testing set’s images, but withhold its annotations. Evaluation will be available via an evaluation server, where the predictions on the test set can be uploaded.
The tracking clips are based on the already described video splits, i.e. the training set consists of every recorded sequence’s first and so on. Hence the sets are disjoint. Both single-object tracking and multi-object tracking use the same clips. Like for the object detection task, we withhold the annotations for the testing set and provide an evaluation server for both, single-object tracking and multi-object tracking.
4 Data Set Statistics
4.1 Object Detection Task
There are 5,630 images (training: 2,975; validation: 859; testing: 1,796). See Figure 3 for the distribution of images/frames with respect to cameras and the class distribution. We recorded most of the images with the L1D-20c and UMC-R10C, having the highest resolution. Having the lowest resolution, we recorded only 432 images with the RedEdge-MX. Note, for the Object Detection Task only the RGB-channels of the multi-spectral images are used to support a uniform data structure.
Furthermore, the class distribution is slightly skewed towards the class ’boat’, since safety precautions require boats to be nearby. We emphasize that this bias can easily be diminished by blackening the respective regions, as is common for areas which are not of interest or unwanted (such as boats here; see). Right after that, swimmers with life jacket are the most common objects. We argue that this scenario is very often encountered in SAR missions. This type of class often is easier to detect than just swimmer as life jackets mostly are of contrasting color, such as red or orange (see Fig. 2 and Table 5). However, as it is also a likely scenario to search for swimmers without life jacket, we included a considerable amount. There are also several different manifestations/visual appearances of that class which is why we recorded and annotated swimmers with and without adequate swimwear (such as wet suit). To be able to discriminate between humans in water and humans on boats, we also annotated humans on boats (with and without life jackets). Lastly, we annotated a small amount of life jackets only. However, we note that the discrimination between life jackets and humans in life jackets can become visually ambiguous, especially in higher altitudes. See also Fig. 2.
Figure 4 shows the distribution of images with respect to the altitude and viewing angle they were captured at. Roughly 50% of the images were recorded below 50 m because lower altitudes allow for the whole range of available viewing angles (). That is, to cover all viewing angles, more images at these altitudes had to be taken. On the other hand, there are many images facing downwards (), because images taken at greater altitudes tend to face downwards since acute angles yield image areas with tiny pixel density, which is unsuitable for object detection. Nevertheless, every altitude and angle interval is sufficiently represented.
To assess the variability in sizes of instances, we compare to other data sets by following the convention in  and  which measure the size of an object by its horizontal bounding box length. Furthermore, we form three groups of instances according to their size: 0-50 px, 50-300 px, and 300 px. Table 4 shows the distribution over these different groups. As also noted in 
, PASCAL VOC, NWPU VHR-10 and Munich Vehicle are dominated by medium-sized and small-sized objects. In contrast, DOTA and SeaDronesSee offer a more balanced distribution. This favors models benchmarked on our data set that perform well on all object sizes.
4.2 Single-Object Tracking
We provide 208 short clips (4 seconds) with a total of 393,295 frames (counting the duplicates), including all available objects labeled. We randomly split the sequences into 58 training, 70 validation and 80 testing sequences. We do not support long-term tracking. The altitude and angle distributions are similar to these in the object detection section since the origin of the images of the object detection task is the same.
4.3 Multi-Object Tracking
We provide 22 clips with a total of 54,105 frames and 403,192 annotated instances, the average consists of 2,460 frames. We differentiate between two use-cases. In the first task, only the persons in water (with and without life jacket) are tracked, it is called MOT-Swimmer. In the second task, all objects in water are tracked (also the boats, but not people on boats), called MOT-All-Objects-In-Water. In both tasks, all objects are grouped into one class. The data set split is performed as described in section 3.4.
4.4 Multi-Spectral Footage
Along with the data for the three tasks, we provide multi-spectral images. We supply annotations for all channels of these recordings, but only the RGB-channels are currently part of the Object Detection Task. There are 432 images with 1,901 instances. See Figure 1 for an example of the individual bands.
We evaluate current state-of-the-art object detectors and object trackers on SeaDronesSee. All experiments can be reproduced by using our provided code available on the evaluation server. Furthermore, we refer the reader to the Appendix for the exact form and uploading requirements.
5.1 Object Detection
|Model height 1.1||AP||AP||AP||AR||AR height 1.1||
The used detectors can be split into two groups. The first group consists of two-stage detectors, which are mainly built on Faster R-CNN  and its improvements. Built for optimal accuracy, these models often lack the inference speed needed for real-time employment, especially on embedded hardware which can be a vital use-case in UAV-based SAR missions. For that reason, we also evaluate on detectors in the second group, the one-stage detectors. In particular, we perform experiments with the best performing single-model (no ensemble) from the workshop report : a Faster R-CNN with a ResNeXt-101 64-4d  backbone with P6 removed. For large one-stage detectors, we take the recent CenterNet 
. To further test an object detector in real-time scenarios, we choose the current best model family on the COCO test-dev according to, i.e. EfficientDet , and take the smallest model, , which can run in real-time on embedded hardware, such as the Nvidia Xavier . We refer the reader to the appendix for the exact parameter configurations and training configurations of the individual models.
For evaluation of this task, an algorithm is required to output detected bounding boxes along with a confidence score for every image. Similar to the VisDrone benchmark , we evaluate detectors according to the COCO json-format , i.e. average precision at certain intersection-over-union-thresholds. More specifically, we use APAP, APAP and APAP. Furthermore, we evaluate the maximum recalls for at most 1 and 10 given detections, respectively, denoted ARAR, and ARAR. All these metrics are averaged over all categories (except for ”ignored region”). We furthermore provide the class-wise average precisions. Moreover, similar to , we report AP-results on different equidistant levels of altitudes ’low’ = 5-56 m (L), ’low-medium’ = 55-106 m (LM), ’medium’ = 106-157 m (M), ’medium-high’ = 157-208 m (MH), and ’high’ = 208-259 m (H). To measure the universal cross-domain performance we report the average over these domains, denoted AP. Similarly, we report AP-results for different angles of view: ’acute’ = 7-23 (A), ’acute-medium’ = 23-40 (AM), ’medium’ = 40-56 (M), ’medium-right’ = 56-73 (MR), and ’right’ = 73-90 (R). Ultimately, it is the goal to have robust detectors across all domains uniformly which is better measured by the latter metrics.
Table 5 shows the results for all object detection models. As expected, the large Faster R-CNN with ResNeXt-101 64-4d backbone performs best, closely followed by CenterNet-Hourglass104. Medium-sized networks, such as the ResNet-50-FPN, and fast networks, such as CenterNet-ResNet18 and EfficientDet-, perform worse, as expected. However, the latter can run in real-time on an Nvidia Xavier . Swimmers are detected significantly better than swimmers by most detectors. Notably, life jackets are very hard to detect since from a far distance these are easily confused with swimmers (see Fig. 2). Since there is a heavy class imbalance with many fewer life jackets, detectors are biased towards swimmers.
Table 7 and 7 show the performances for different altitudes and angles, respectively. These evaluations help assess the strength and weaknesses of individual models. For example, although ResNeXt-101-FPN performs overall better than Hourglass104 in AP (54.7 vs. 50.3), the latter is better in the domain of medium angles (45.2 vs. 49.7). Furthermore, the great performance discrepancy between CenterNet-ResNet101 and CenterNet-ResNet18 in AP (36.4 vs. 21.8) vanishes when averaged over angle domains (23.8 vs. 23.1 AP) possibly indicating ResNet101’s bias towards specific angle domains.
5.2 Single-Object Tracking
Like VisDrone , we provide the success and precision curves for single-object tracking and compare models based on a single number, the success score. As comparison trackers, we choose the DiMP family (DiMP50, DiMP18, PrDiMP50, PrDiMP18) [7, 12] and Atom  because they were the foundation of many of the submitted trackers to the last VisDrone workshop . We take pre-trained versions of these trackers and evaluate them on the testing set.
Figure 5 shows that the PrDiMP- and DiMP-family expectedly outperform the older Atom tracker in both, success and precision. Surprisingly, PrDiMP50 slightly trails the accuracy of its predecessor DiMP50. Furthermore, all trackers’ performances on SeaDronesSee are similar or worse than on UAV123 (Atom with 65.0 success) [7, 12, 11], on which they were heavily optimized. We argue that in SeaDronesSee there is still room for improvement, especially considering that the clips feature precise meta information that may be helpful for tracking. Furthermore, in our experiments, the faster trackers DiMP18 and Atom run at approximately 27.1 fps on an Nvidia RTX 2080 Ti. However, we note that they are not capable of running in real-time on embedded hardware, a use-case especially important for UAV-based SAR missions.
5.3 Multi-Object Tracking
We use a similar evaluation protocol as the MOT benchmark . That is, we report results for Multiple Object Tracking Accuracy (MOTA), Identification F1 Score (IDF1), Multiple Object Tracking Precision (MOTP), number of false positives (FP), number of false negatives (FN), recall (R), precision (P), ID switches (ID sw.), fragmentation occurrences (Frag). We refer the reader to  or the appendix for a thorough description of the metrics.
We train and evaluate FairMOT , a popular tracker, which is the base of many trackers submitted to the challenge . FairMOT-D34 employs a DLA34  as its backbone while FairMOT-R34 makes use of a ResNet34. Another SOTA tracker is Tracktor++ , which we also use for our experiments. It performed well on the MOT20  challenge and is conceptually simple.
Surprisingly, Tracktor++ was better than FairMOT in both tasks. One reason for this may be the used detector. Tracktor++ utilizes a Faster-R-CNN with a ResNet50 backbone. In contrast, FairMOT is using a CenterNet with a DLA34 and a ResNet34 backbone, respectively.
6 Limitations and Conclusions
This work serves as an introductory benchmark in UAV-based computer vision problems in maritime scenarios. We build the first large scaled-data set for detecting and tracking humans in open water. Furthermore, it is the first large-scaled benchmark providing full environmental information for every frame, offering great opportunities in the so-far restricted area of multi-modal object detection and tracking. We offer three challenges, object detection, single-object tracking, and multi-object tracking by providing an evaluation server. Researchers are able to upload their predictions and compare them fairly. We hope that the development of meta-data-aware object detectors and trackers can be accelerated by means of this benchmark. Moreover, we provide multi-spectral imagery for detecting humans in open water. These images are very promising in maritime scenarios having the ability to capture wave lengths, which set apart objects from the water background.
We note, however, that the data can be even more variable. Specifically, we want to emphasize that footage at night or during rain is of great importance in SAR missions. Furthermore, the variance in life jackets, different water types, and subjects with different skin and clothes colors can be improved. We hope that the work at hand attracts more attention to UAV-based computer vision problems in maritime SAR scenarios.
We would like to thank Sebastian Koch, Hannes Leier and Aydeniz Soezbilir, without whose contribution this work would not have been possible.
This work has been supported by the German Ministry for Economic Affairs and Energy, Project Avalon, FKZ: 03SX481B.
-  (2017) Hyperspectral imaging: a review on uav-based sensors, data processing and applications for agriculture and forestry. Remote Sensing 9 (11), pp. 1110. Cited by: §1.
-  Aerial data accuracy – an experiment comparing 4 drone approaches. Note: https://www.sitemark.com/blog/accuracyAccessed: 2021-03-01 Cited by: §3.2.
-  Airbus Ship Detection Challenge. Note: https://www.kaggle.com/c/airbus-ship-detectionAccessed: 2021-03-01 Cited by: Table 1, §2.2.
-  (2020) SARDO: an automated search-and-rescue drone-based solution for victims localization. arXiv preprint arXiv:2003.05819. Cited by: §1.
-  (2018) The blackbird dataset: a large-scale dataset for uav perception in aggressive flight. In International Symposium on Experimental Robotics, pp. 130–139. Cited by: §2.3.
-  (2019-10) Tracking without bells and whistles. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §5.3, Table 9.
-  (2019) Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191. Cited by: §5.2, §5.2.
-  (2020) Au-air: a multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8504–8510. Cited by: Table 1, §2.3.
-  (2010) A complete processing chain for ship detection using optical satellite imagery. International Journal of Remote Sensing 31 (22), pp. 5837–5854. Cited by: §1.
-  (2004) The state-of-the-art in ship detection in synthetic aperture radar imagery. defence science and technology organization (dsto). Information Science Laboratory, Research Report No. DSTO-RR-0272. Cited by: §1.
Atom: accurate tracking by overlap maximization.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669. Cited by: §5.2, §5.2.
-  (2020) Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192. Cited by: §5.2, §5.2.
-  DarkLabel video/image labeling and annotation tool. Note: https://github.com/darkpgmr/DarkLabelAccessed: 2020-08-31 Cited by: §3.3.
-  MOT20: A benchmark for multi object tracking in crowded scenes. Technical report External Links: Cited by: §5.3.
-  (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 370–386. Cited by: Table 1, §1, §2.1, §2.3, §4.1.
-  (2020) VisDrone-mot2020: the vision meets drone multiple object tracking challenge results. In European Conference on Computer Vision, pp. 713–727. Cited by: §5.3.
-  (2020) VisDrone-sot2020: the vision meets drone single object tracking challenge results. In European Conference on Computer Vision, pp. 728–749. Cited by: §2.3, §5.2.
-  (2019-06) Mid-air: a multi-modal dataset for extremely low altitude drone flights. In Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Cited by: §2.3.
-  (2019) Detection of bodies in maritime rescue operations using unmanned aerial vehicles with multispectral cameras. Journal of Field Robotics 36 (4), pp. 782–796. Cited by: §1, §1.
UAV-based situational awareness system using deep learning. IEEE Access 7, pp. 122583–122594. Cited by: §1.
-  (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §5.1, Table 5.
-  (2017) Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4145–4153. Cited by: §2.1.
-  (2013) The prototype of gyro-stabilized uav gimbal for day-night surveillance. In Advanced technologies for intelligent systems of national border security, pp. 107–115. Cited by: §3.2.
-  (2018) The potential use of unmanned aircraft systems (drones) in mountain search and rescue operations. The American journal of emergency medicine 36 (4), pp. 583–588. Cited by: §1.
-  (2021) Leveraging domain labels for object detection from uavs. arXiv preprint arXiv:2101.12677. Cited by: §1, §5.1, §5.1, §5.1.
-  (2018) The highd dataset: a drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2118–2125. Cited by: §2.1.
HSF-net: multiscale deep feature embedding for ship detection in optical remote sensing imagery. IEEE Transactions on Geoscience and Remote Sensing 56 (12), pp. 7147–7161. Cited by: §2.2.
Visual object tracking for unmanned aerial vehicles: a benchmark and new motion models.
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: Table 1, §2.1.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.1.
-  (2015) Fast multiclass vehicle detection on aerial images. IEEE Geoscience and Remote Sensing Letters 12 (9), pp. 1938–1942. Cited by: Table 4.
-  (2019) Unsupervised human detection with an embedded vision system on a fully autonomous uav for search and rescue operations. Sensors 19 (16), pp. 3542. Cited by: §2.2.
-  (2017) The zurich urban micro aerial vehicle dataset. The International Journal of Robotics Research 36 (3), pp. 269–273. Cited by: §2.3.
-  (2021) Gaining scale invariance in uav bird’s eye view object detection by adaptive resizing. arXiv preprint arXiv:2101.12694. Cited by: §1.
-  (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §5.3.
-  (2020) Drone-surveillance for search and rescue in natural disaster. Computer Communications 156, pp. 1–10. Cited by: §1.
-  (2016) A benchmark and simulator for uav tracking. In European conference on computer vision, pp. 445–461. Cited by: Table 1, §2.1, §2.3.
-  (2016) A large contextual dataset for classification, detection and counting of cars with deep learning. In European Conference on Computer Vision, pp. 785–800. Cited by: §2.1.
-  (2019) Shipwrecked victims localization and tracking using uavs. In 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), pp. 1344–1348. Cited by: §1.
-  Object Detection on COCO test-dev. Note: https://paperswithcode.com/sota/object-detection-on-cocoAccessed: 2021-03-01 Cited by: §5.1.
-  (2016) Combining human computing and machine learning to make sense of big (aerial) data for disaster response. Big data 4 (1), pp. 47–59. Cited by: §2.1.
Human trajectory prediction in crowded scene using social-affinity long short-term memory. Pattern Recognition 93, pp. 273–282. Cited by: §2.1.
-  (2019) Are object detection assessment criteria ready for maritime computer vision?. IEEE Transactions on Intelligent Transportation Systems 21 (12), pp. 5295–5304. Cited by: §1.
-  (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pp. 17–35. Cited by: §5.3.
-  (2018) UAV delivery monitoring system. In MATEC Web of Conferences, Vol. 151, pp. 04011. Cited by: §1.
-  (2019) Object detection and instance segmentation in remote sensing imagery based on precise mask r-cnn. In IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pp. 1454–1457. Cited by: Table 4.
-  (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790. Cited by: §5.1, Table 5.
-  (2014) Nature conservation drones for automatic localization and counting of animals. In European Conference on Computer Vision, pp. 255–270. Cited by: §2.1.
-  (2014) Reconstructing pascal voc. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 41–48. Cited by: Table 4.
-  (2019) Delving into robust object detection from unmanned aerial vehicles: a deep nuisance disentanglement approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1201–1210. Cited by: §2.3.
-  (2018) DOTA: a large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983. Cited by: Table 1, §2.2, §4.1, Table 4.
-  (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §5.1, Table 5.
Wider face: a face detection benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5525–5533. Cited by: §4.1.
-  (2015) A review on marine search and rescue operations using unmanned aerial vehicles. International Journal of Marine and Environmental Sciences 9 (2), pp. 396–399. Cited by: §1.
-  (2018) Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2403–2412. Cited by: §5.3.
-  (2020) FairMOT: on the fairness of detection and re-identification in multiple object tracking. arXiv e-prints, pp. arXiv–2004. Cited by: §5.3, Table 9.
-  (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §5.1, Table 5.
-  (2018) Vision meets drones: a challenge. arXiv preprint arXiv:1804.07437. Cited by: Table 1, §1, §2.1, §5.1.
-  (2020) Vision meets drones: past, present and future. arXiv preprint arXiv:2001.06303. Cited by: §5.2.
-  (2018) Visdrone-det2018: the vision meets drone object detection in image challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §5.1.
-  (2017) PRECISE positioning of uavs-dealing with challenging rtk-gps measurement conditions during automated uav flights.. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences 4. Cited by: §3.2.