SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water

by   Leon Amadeus Varga, et al.
Universität Tübingen

Unmanned Aerial Vehicles (UAVs) are of crucial importance in search and rescue missions in maritime environments due to their flexible and fast operation capabilities. Modern computer vision algorithms are of great interest in aiding such missions. However, they are dependent on large amounts of real-case training data from UAVs, which is only available for traffic scenarios on land. Moreover, current object detection and tracking data sets only provide limited environmental information or none at all, neglecting a valuable source of information. Therefore, this paper introduces a large-scaled visual object detection and tracking benchmark (SeaDronesSee) aiming to bridge the gap from land-based vision systems to sea-based ones. We collect and annotate over 54,000 frames with 400,000 instances captured from various altitudes and viewing angles ranging from 5 to 260 meters and 0 to 90 degrees while providing the respective meta information for altitude, viewing angle and other meta data. We evaluate multiple state-of-the-art computer vision algorithms on this newly established benchmark serving as baselines. We provide an evaluation server where researchers can upload their prediction and compare their results on a central leaderboard



page 1

page 5


The Unmanned Aerial Vehicle Benchmark: Object Detection and Tracking

With the advantage of high mobility, Unmanned Aerial Vehicles (UAVs) are...

Leveraging domain labels for object detection from UAVs

Object detection from Unmanned Aerial Vehicles (UAVs) is of great import...

AU-AIR: A Multi-modal Unmanned Aerial Vehicle Dataset for Low Altitude Traffic Surveillance

Unmanned aerial vehicles (UAVs) with mounted cameras have the advantage ...

Efficient resource management in UAVs for Visual Assistance

There is an increased interest in the use of Unmanned Aerial Vehicles (U...

Analysis and Adaptation of YOLOv4 for Object Detection in Aerial Images

The recent and rapid growth in Unmanned Aerial Vehicles (UAVs) deploymen...

Intelligent Vision-based Autonomous Ship Landing of VTOL UAVs

The paper discusses an intelligent vision-based control solution for aut...

Towards Automated Cadastral Boundary Delineation from UAV Data

Unmanned aerial vehicles (UAV) are evolving as an alternative tool to ac...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unmanned Aerial Vehicles (UAVs) equipped with cameras have emerged into an important asset in a wide range of fields, such as agriculture, delivery, surveillance, and search and rescue (SAR) missions [1, 44, 20]. In particular, UAVs are capable of assisting in SAR missions due to their fast and versatile applicability while providing an overview over the scene [35, 24, 4]. Especially in maritime scenarios, where wide areas need to be quickly overseen and searched, the efficient use of autonomous UAVs is crucial [53]. Among the most challenging issues in this application scenario is the detection, localization, and tracking of people in open water [19, 38]. The small size of people relative to search radii and the variability in viewing angles and altitudes require robust vision-based systems.

Figure 1: (a) Typical image examples with varying altitudes and angles of view: 250 m, ; 50m, ; 10m, and 20m, (from top left to bottom right). (b) Examples of the Red Edge (717 nm, left) and Near Infrared (842 nm, right) light spectra of an image captured by the MicaSense RedEdge-MX. Note the glowing appearance of the swimmers.

Currently, these systems are implemented via data-driven methods such as deep neural networks. These methods depend on large-scale data sets portraying real-case scenarios to obtain realistic imagery statistics. However, there is a great lack of large-scale data sets in maritime environments. Most data sets captured from UAVs are land-based, often focusing on traffic environments, such as VisDrone

[57] and UAVDT [15]. Many of the few data sets that are captured in maritime environments fall in the category of remote sensing, often leveraging satellite-based synthetic aperture radar [10]. All of these are only valuable for ship detection [9]

as they don’t provide the resolution needed for SAR missions. Furthermore, satellite-based imagery is susceptible to clouds and only provides top-down views. Finally, many current approaches in the maritime setting rely on classical machine learning methods, incapable of dealing with the large number of influencing variables and calling for more elaborate models


This work aims to close the gap between large-scale land-based data sets captured from UAVs to maritime-based data sets. We introduce a large-scale data set of people in open water, called SeaDronesSee. We captured videos and images of swimming probands in open water with various UAVs and cameras. As it is especially critical in SAR missions to detect and track objects from a large distance, we captured the RGB footage with 38402160 px to 54563632 px resolution. We carefully annotated ground-truth labels for objects of interest including swimmer, swimmer with life jacket, life jacket, person on boat, person with life jacket on boat, and boat.

Moreover, we note that current data sets captured from UAVs only provide very coarse or no meta information at all. We argue that this is a major impediment in the development of multi-modal systems, which take these additional information into account to improve accuracy or speed, see [25, 33]. Therefore, we provide precise meta information for every frame and image including altitude, camera angle, speed, time, and others. We also made sure that the data set was balanced with respect to this meta information.

In maritime settings, the use of multi-spectral cameras with Near Infrared channels to detect humans can be advantageous [19]. For that reason, we also captured multi-spectral images using a MicaSense RedEdge. This enables the development of detectors taking into account the non-visible light spectra Near Infrared (842 nm) and Red Edge (717 nm).

Finally, we provide detailed statistics of the data set and conduct extensive experiments using state-of-the-art models and hereby establish baseline models. These serve as a starting point for our SeaDronesSee benchmark. We release the training and validation sets with complete ground truth but only the test set’s videos/images. The ground truth of the test set is used by the benchmark server to calculate the generalization power of the models. We set up an evaluation web page, where researchers can upload their predictions and opt to publish their results on a central leader board such that transparent comparisons are possible. The benchmark focuses on three tasks: (i) object detection, (ii) single-object tracking and (iii) multi-object tracking, which will be explained in more detail in the subsequent sections.
Our main contributions are as follows:

  • To the best of our knowledge, SeaDronesSee is the first large annotated UAV-based data set of swimmers in open water. It can be used to further develop detectors and trackers for SAR missions.

  • We provide full environmental meta information for every frame making SeaDroneSee the first UAV-based data set of that nature.

  • We provide an evaluation server to prevent researches from overfitting and allow for fair comparisons.

  • We perform extensive experiments on state-of-the-art object detectors and trackers on our data set.

Object detection env. platform image widths altitude range angle range other meta
DOTA [50] cities satellite 800-20,000
UAVDT [15] traffic UAV 1,024 5-200 m* *
VisDrone [57] traffic UAV 960-2,000 5-200 m* *

Airbus Ship [3]
maritime satellite 768

AU-AIR [8]
traffic UAV 1,920 5-30 m

maritime UAV 3,840-5,456 5-260 m
Single-object tracking env. #clips frame widths altitude range angle range other meta

UAV123 [36]
traffic 123 1,280 5-50 m* *

DTB70 [28]
sports 70 1,280 0-10 m* *

traffic 50 1,024 5-200 m* *

VisDrone [57]
traffic 167 960-2,000 5-200 m* *

maritime 405 3,840 5-150 m

Multi-object tracking env. #frames frame widths altitude range angle range other meta

traffic 40.7 k 1,024 5-200 m* *

VisDrone [57]
traffic 40 k 960-2,000 5-200 m* *

maritime 54 k 3,840 5-150 m

Table 1:

Comparison with the most prominent annotated aerial data sets. ’altitude’ and ’angle’ indicates whether or not there are precise altitude and angle view information available. ’other meta’ refers to time stamps, GPS, and IMU data and in the case of object tracking can also mean attribute information about the sequences. The values with stars have been estimated based on ground truth bounding box sizes and corresponding real world object sizes (for altitude) and qualitative estimation of sample images (for angle). For DOTA and Airbus Ship the range of altitudes is not available because these are satellite-based data sets.

2 Related Work

In this section, we review major labeled data sets in the field of computer vision from UAVs and in maritime scenarios which are usable for supervised learning models.

2.1 Labeled Data Sets Captured from UAVs

Over the last few years, quite a few data sets captured from UAVs have been published. The most prominent are these that depict traffic situations, such as VisDrone [57] and UAVDT [15]. Both data sets focus on object detection and object tracking in unconstrained environments. Pei [41] collect videos (Stanford Drone Dataset) showing traffic participants on campuses (mostly people) for human trajectory prediction usable for object detection. UAV123 [36] is a single-object tracking data set consisting of 123 video sequences with corresponding labels. The clips mainly show traffic scenarios and common objects. Both, Hsieh [22] and Mundhenk [37] capture a data set showing parking lots for car counting tasks and constrained object detection. Li [28] provide a single-object tracking data set showing traffic, wild life and sports scenarios. Collins capture a single-object tracking data set showing vehicles on streets in rural areas. Krajewski [26] show vehicles on freeways.

Another active area of research focuses on drone-based wildlife detection. Van [47] release a data set for the tasks of low-altitude detection and counting of cattle. Ofli [40] release the African Savanna data set as part of their crowd-sourced disaster response project.

2.2 Labeled Data Sets in Maritime Environments

Many data sets in maritime environments are captured from satellite-based synthetic aperture radar and therefore fall into the remote sensing category. In this category, the airbus ship data set [3] is the largest, featuring 40k images from synthetic aperture radars with instance segmentation labels. Li [27] provide a data set of ships with images mainly taken from Google Earth, but also a few UAV-based images. In [50], the authors provide satellite-based images from natural scenes, mainly land-based but also harbors. The most similar to our work is [31]. They also consider the problem of human detection in open water. However, their data mostly contains images close to shores and of swimming pools. Furthermore, it is not publicly available.

2.3 Multi-Modal Data Sets Captured from UAVs

UAVDT [15] provides coarse meta data for their object detection and tracking data: every frame is labeled with altitude information (low, medium, high), angle of view (front-view, side-view, bird-view) and light conditions (day, night, foggy). Wu [49] manually label VisDrone after its release with the same annotation information for the object detection track. Mid-Air [18] is a synthetic multi-modal data set with images in nature containing precise altitude, GPS, time, and velocity data but without annotated objects. Blackbird [5] is a real-data indoor data set for agile perception also featuring these meta information. In [32], street-view images with the same meta data are captured to benchmark appearance-based localization. Bozcan [8] release a low-altitude ( m) object detection data set containing images showing a traffic circle and provide meta data such as altitude, GPS, and velocity but exclude the import camera angle information.

Tracking data sets often provide meta data (or attribute information) for the clips. However, in many cases these do not refer to the environmental state in which the image was captured. Instead, they abstractly describe the way in which a clip was captured: UAV123 [36] label their clips with information such as aspect ratio change, background clutter, and fast motion, but do not provide frame-by-frame meta data. The same observation can be made for the tracking track of VisDrone [17]. See Table 1 for an overview of annotated aerial data sets.

3 Data Set Generation

3.1 Image Data Collection

We gathered the footage on several days to obtain variance in light conditions. Taking into account safety and environmental regulations, we asked over 20 test subjects to be recorded in open water. Only subjects who met strict criteria regarding their ability to swim in open water were recruited. Small boats were rented to transport the subjects to the area of interest, where quadcopters were launched at a safe distance from the swimmers. At the same time, the fixed-wing UAV Trinity F90+ was launched from the shore. We used waypoints to ensure a strict flight schedule to maximize data collection efficiency. Care was taken to maintain a strict vertical separation at all times. Subjects were free to wear life jackets, of which we provided several differently colored pieces (see also Figure


To diminish the effect of camera biases within the data set, we used multiple cameras, as listed in Table 2, mounted to the following drones: DJI Matrice 100, DJI Matrice 210, DJI Mavic 2 Pro, and a Quantum Systems Trinity F90+.

Camera Resolution Video
Hasselblad L1D-20c 3,8402,160 30 fps
MicaSense RedEdge-MX 1,280 960
Sony UMC-R10C 5,4563,632
Zenmuse X5 3,8402,160 30 fps
Zenmuse XT2 3,8402,160 30 fps
Table 2: Overview of used cameras.

With the video cameras we captured videos at 30 fps. For the object detection task, we extract at most two frames per second of these videos to avoid having redundant occurrences of frames. See Section 4 for information on the distribution of images with respect to different cameras.

Lastly, we captured top-down looking multi-spectral imagery at 1 fps. We used a MicaSense RedEdge-MX, which records five wavelengths (475 nm, 560 nm, 668 nm, 717 nm, 842 nm). Therefore, in addition to the RGB channels, the recordings also contain a RedEdge and a Near Infrared channel. The camera was referenced with a white reference before each flight. As the RedEdge-MX captures every band individually, we merge the bands using the development kit provided by MicaSense.

3.2 Meta Data Collection

Data unit Min. value Max.value
degrees 90
Table 3: Meta data that comes with every image/frame.

Accompanied with every frame there is a meta stamp, that is logged at 10 hertz. To align the video data (30 fps) and the time stamps, a nearest neighbor method was performed. The following data is logged and provided for every image/frame read from the onboard clock, barometer, IMU and GPS sensor, respectively:

  • : current date and time of capture

  • : relative time stamp since beginning of capture

  • : latitude of the UAV

  • : longitude of the UAV

  • : altitude of the UAV

  • : camera pitch angle (viewing angle)

  • : UAV roll angle

  • : UAV pitch angle

  • : UAV yaw angle

  • : speed along the -axis

  • : speed along the -axis

  • : speed along the -axis

See Table 3 for an overview of the variables and their ranges. Note that corresponds to a top-down view, and to a horizontally facing camera. The date format is given in the extended form of ISO 8601. Furthermore, note that the UAV roll/pitch/yaw-angles are of minor importance for meta-data-aware vision-based methods as the onboard gimbal filters out movement by the drone such that the camera pitch angle is roughly constant if it is not intentionally changed [23].

We need to emphasize that the the meta values lie within the error thresholds introduced by the different sensors but an extended analysis is beyond the scope of this paper (see [60, 2] for an overview).

3.3 Annotation Method

Using the non-commercial labeling tool DarkLabel [13], we manually and carefully annotated all provided images and frames with the categories swimmers, swimmers with life jacket, life jackets, person, person with life jacket (we define a person to be on a boat), and boats. Note that the annotations are mutually exclusive. Subsequently, all annotations were checked by experts in aerial vision. We choose these classes as they are the hardest and most critical to detect in SAR missions. Furthermore, we annotated regions with other objects as ignored regions, such as boats on land. These regions were blackened in the recordings. Our guidelines for the annotation are described in the appendix. In particular, swimmers were annotated such that the complete body including arms and legs (if visible) are within the bounding box. The bounding box format is , where and correspond to the upper left corner and and to the width and height, respectively. See Figure 2 for examples of objects.

Figure 2: Examples of objects. Note that these examples are crops from high resolution images. However, as the objects are small and the images taken from high altitudes, they appear blurry.
On the left side there are three images of swimmers with life jacket and a single image of a life jacket. The four images in the middle show swimmers in various poses. On the right, there are two persons on boat shown. The top one is wearing a life jacket.

3.4 Data Set Split

Object Detection

To ensure that the training, validation, and testing set have similar statistics, we roughly balance them such that the respective subsets have similar distributions with respect to altitude and angle of view, two of the most important factors of appearance changes. Of the individual images, we randomly select and add it to the training set, add to the validation set and another to the testing set. In addition to the individual images, we randomly cut every video into three parts of length , , and of the original length and add the respective portions to the training, validation, and testing set. This is done to avoid having subsequent frames in the training and testing set such that a realistic evaluation is possible. We release the training and validation set with all annotations and the testing set’s images, but withhold its annotations. Evaluation will be available via an evaluation server, where the predictions on the test set can be uploaded.

Object Tracking

The tracking clips are based on the already described video splits, i.e. the training set consists of every recorded sequence’s first and so on. Hence the sets are disjoint. Both single-object tracking and multi-object tracking use the same clips. Like for the object detection task, we withhold the annotations for the testing set and provide an evaluation server for both, single-object tracking and multi-object tracking.

4 Data Set Statistics

4.1 Object Detection Task

There are 5,630 images (training: 2,975; validation: 859; testing: 1,796). See Figure 3 for the distribution of images/frames with respect to cameras and the class distribution. We recorded most of the images with the L1D-20c and UMC-R10C, having the highest resolution. Having the lowest resolution, we recorded only 432 images with the RedEdge-MX. Note, for the Object Detection Task only the RGB-channels of the multi-spectral images are used to support a uniform data structure.

Furthermore, the class distribution is slightly skewed towards the class ’boat’, since safety precautions require boats to be nearby. We emphasize that this bias can easily be diminished by blackening the respective regions, as is common for areas which are not of interest or unwanted (such as boats here; see

[15]). Right after that, swimmers with life jacket are the most common objects. We argue that this scenario is very often encountered in SAR missions. This type of class often is easier to detect than just swimmer as life jackets mostly are of contrasting color, such as red or orange (see Fig. 2 and Table 5). However, as it is also a likely scenario to search for swimmers without life jacket, we included a considerable amount. There are also several different manifestations/visual appearances of that class which is why we recorded and annotated swimmers with and without adequate swimwear (such as wet suit). To be able to discriminate between humans in water and humans on boats, we also annotated humans on boats (with and without life jackets). Lastly, we annotated a small amount of life jackets only. However, we note that the discrimination between life jackets and humans in life jackets can become visually ambiguous, especially in higher altitudes. See also Fig. 2.

Figure 3: Distribution of images over camera types (left) and distribution of objects over classes (right).

Figure 4 shows the distribution of images with respect to the altitude and viewing angle they were captured at. Roughly 50% of the images were recorded below 50 m because lower altitudes allow for the whole range of available viewing angles (). That is, to cover all viewing angles, more images at these altitudes had to be taken. On the other hand, there are many images facing downwards (), because images taken at greater altitudes tend to face downwards since acute angles yield image areas with tiny pixel density, which is unsuitable for object detection. Nevertheless, every altitude and angle interval is sufficiently represented.

Figure 4: Distribution of images over altitudes (left) and angles (right), respectively.

To assess the variability in sizes of instances, we compare to other data sets by following the convention in [52] and [50] which measure the size of an object by its horizontal bounding box length. Furthermore, we form three groups of instances according to their size: 0-50 px, 50-300 px, and 300 px. Table 4 shows the distribution over these different groups. As also noted in [50]

, PASCAL VOC, NWPU VHR-10 and Munich Vehicle are dominated by medium-sized and small-sized objects. In contrast, DOTA and SeaDronesSee offer a more balanced distribution. This favors models benchmarked on our data set that perform well on all object sizes.

Dataset 10-50 px 50-300 px 300 px
PASCAL VOC [48] 0.14 0.61 0.25
NWPU VHR-10 [45] 0.15 0.83 0.02
Munich Vehicle [30] 0.93 0.07 0
DOTA [50] 0.57 0.41 0.02
SeaDronesSee 0.42 0.54 0.04
Table 4: Percentages of instances in respective groups which are divided according to size.

4.2 Single-Object Tracking

We provide 208 short clips (4 seconds) with a total of 393,295 frames (counting the duplicates), including all available objects labeled. We randomly split the sequences into 58 training, 70 validation and 80 testing sequences. We do not support long-term tracking. The altitude and angle distributions are similar to these in the object detection section since the origin of the images of the object detection task is the same.

4.3 Multi-Object Tracking

We provide 22 clips with a total of 54,105 frames and 403,192 annotated instances, the average consists of 2,460 frames. We differentiate between two use-cases. In the first task, only the persons in water (with and without life jacket) are tracked, it is called MOT-Swimmer. In the second task, all objects in water are tracked (also the boats, but not people on boats), called MOT-All-Objects-In-Water. In both tasks, all objects are grouped into one class. The data set split is performed as described in section 3.4.

4.4 Multi-Spectral Footage

Along with the data for the three tasks, we provide multi-spectral images. We supply annotations for all channels of these recordings, but only the RGB-channels are currently part of the Object Detection Task. There are 432 images with 1,901 instances. See Figure 1 for an example of the individual bands.

5 Evaluations

We evaluate current state-of-the-art object detectors and object trackers on SeaDronesSee. All experiments can be reproduced by using our provided code available on the evaluation server. Furthermore, we refer the reader to the Appendix for the exact form and uploading requirements.

5.1 Object Detection

Model height 1.1 AP AP AP AR AR height 1.1






life jacket

height 1.1
ResNeXt-101-FPN [51] 30.4 54.7 29.7 18.6 42.6 78.1 82.4 25.9 44.3 96.7 0.6 2
ResNet-50-FPN [21] 14.2 30.1 7.2 6.4 17.7 24.6 54.1 4.9 7.5 89.2 0.3 14
CenterNet-Hourglass104 [56] 25.6 50.3 22.2 17.7 40.1 65.1 73.6 19.1 48.1 95.8 0.3 6
CenterNet-ResNet101 [56] 15.1 36.4 10.8 9.6 21.4 16.8 39.8 0.8 1.7 74.3 0 22
CenterNet-ResNet18 [56] 9.9 21.8 9.0 7.2 19.7 20.9 21.9 2.6 3.3 81.9 0.4 78
EfficientDet– [46] 20.8 37.1 20.6 11.5 29.1 65.3 55.1 3.1 3.3 95.5 0.1 26
Table 5: Average precision results for several baseline models. The right part contains AP–values for each class individually. Classes marked with represent the same respective class with all instances wearing life jackets. All reported FPS numbers are obtained on a single Nvidia RTX 2080 Ti.

The used detectors can be split into two groups. The first group consists of two-stage detectors, which are mainly built on Faster R-CNN [21] and its improvements. Built for optimal accuracy, these models often lack the inference speed needed for real-time employment, especially on embedded hardware which can be a vital use-case in UAV-based SAR missions. For that reason, we also evaluate on detectors in the second group, the one-stage detectors. In particular, we perform experiments with the best performing single-model (no ensemble) from the workshop report [59]: a Faster R-CNN with a ResNeXt-101 64-4d [51] backbone with P6 removed. For large one-stage detectors, we take the recent CenterNet [56]

. To further test an object detector in real-time scenarios, we choose the current best model family on the COCO test-dev according to

[39], i.e. EfficientDet [46], and take the smallest model, , which can run in real-time on embedded hardware, such as the Nvidia Xavier [25]. We refer the reader to the appendix for the exact parameter configurations and training configurations of the individual models.

For evaluation of this task, an algorithm is required to output detected bounding boxes along with a confidence score for every image. Similar to the VisDrone benchmark [57], we evaluate detectors according to the COCO json-format [29], i.e. average precision at certain intersection-over-union-thresholds. More specifically, we use APAP, APAP and APAP. Furthermore, we evaluate the maximum recalls for at most 1 and 10 given detections, respectively, denoted ARAR, and ARAR. All these metrics are averaged over all categories (except for ”ignored region”). We furthermore provide the class-wise average precisions. Moreover, similar to [25], we report AP-results on different equidistant levels of altitudes ’low’ = 5-56 m (L), ’low-medium’ = 55-106 m (LM), ’medium’ = 106-157 m (M), ’medium-high’ = 157-208 m (MH), and ’high’ = 208-259 m (H). To measure the universal cross-domain performance we report the average over these domains, denoted AP. Similarly, we report AP-results for different angles of view: ’acute’ = 7-23 (A), ’acute-medium’ = 23-40 (AM), ’medium’ = 40-56 (M), ’medium-right’ = 56-73 (MR), and ’right’ = 73-90 (R). Ultimately, it is the goal to have robust detectors across all domains uniformly which is better measured by the latter metrics.

Table 5 shows the results for all object detection models. As expected, the large Faster R-CNN with ResNeXt-101 64-4d backbone performs best, closely followed by CenterNet-Hourglass104. Medium-sized networks, such as the ResNet-50-FPN, and fast networks, such as CenterNet-ResNet18 and EfficientDet-, perform worse, as expected. However, the latter can run in real-time on an Nvidia Xavier [25]. Swimmers are detected significantly better than swimmers by most detectors. Notably, life jackets are very hard to detect since from a far distance these are easily confused with swimmers (see Fig. 2). Since there is a heavy class imbalance with many fewer life jackets, detectors are biased towards swimmers.

Table 7 and 7 show the performances for different altitudes and angles, respectively. These evaluations help assess the strength and weaknesses of individual models. For example, although ResNeXt-101-FPN performs overall better than Hourglass104 in AP (54.7 vs. 50.3), the latter is better in the domain of medium angles (45.2 vs. 49.7). Furthermore, the great performance discrepancy between CenterNet-ResNet101 and CenterNet-ResNet18 in AP (36.4 vs. 21.8) vanishes when averaged over angle domains (23.8 vs. 23.1 AP) possibly indicating ResNet101’s bias towards specific angle domains.

Model L LM M MH H AP
ResNeXt-101-FPN 56.8 54.6 49.2 65 78.3 60.8
ResNet-50-FPN 32.8 29.8 23.5 40.5 48.9 35.1
Hourglass104 50.6 52.0 47.5 64.9 73.2 57.6
ResNet101 20.2 30.4 24.1 35.1 38.0 29.6
ResNet18 23.8 20.3 19.2 29.3 31.9 24.9
39.6 38.0 30.4 42.5 54.5 41.0
Table 6: Results on different altitude-domains. E.g. ResNeXt’s AP performance in low-medium (LM) altitudes is 54.6 AP.
Model A AM M MR R AP
ResNeXt101-FPN 68.3 55.1 45.2 63.6 51.5 56.7
ResNet50-FPN 32.8 35.5 32.7 35.7 27.6 32.9
Hourglass104 66.4 42.1 49.7 58.7 46.9 52.76
ResNet101 7.4 35.8 20.5 33.6 21.7 23.8
ResNet18 9.6 29.5 26.3 27.9 22.1 23.1
26.9 47.0 40.5 40.3 36.8 38.3
Table 7: Results on different angle-domains. For example, ResNeXt’s AP performance in medium-right (MR) angles (57-73) is 63.6 AP.

5.2 Single-Object Tracking

Model MOTA IDF1 MOTP MT ML FP FN Recall Prcn ID Sw. Frag
FairMOT-D34 [55] 39.0 44.8 23.6 17 17 3,604 9,445 57.2 77.8 307 1,687
FairMOT-R34 [55] 15.2 27.6 33.7 6 37 2,502 12,592 30.1 68.4 181 807
Tracktor++ [6] 55.0 69.6 25.6 62 4 7,271 3,550 85.5 74.2 165 347
Table 8: Multi-Object Tracking evaluation results for the Swimmer task.
Model MOTA IDF1 MOTP MT ML FP FN Recall Prcn ID Sw. Frag
FairMOT-D34 [55] 36.5 43.8 20.9 28 49 3,788 20,867 47.2 83.1 447 1,599
FairMOT-R34 [55] 30.5 40.8 27.3 29 127 4,401 28,999 40.2 81.6 285 1,588
Tracktor++ [6] 71.9 80.5 20.1 123 5 7,741 5,496 88.5 84.5 192 438
Table 9: Multi-Object Tracking evaluation results for the All-Objects-In-Water task.

Like VisDrone [58], we provide the success and precision curves for single-object tracking and compare models based on a single number, the success score. As comparison trackers, we choose the DiMP family (DiMP50, DiMP18, PrDiMP50, PrDiMP18) [7, 12] and Atom [11] because they were the foundation of many of the submitted trackers to the last VisDrone workshop [17]. We take pre-trained versions of these trackers and evaluate them on the testing set.

Figure 5: Success and precision plots for single-object tracking task (best viewed in color).

Figure 5 shows that the PrDiMP- and DiMP-family expectedly outperform the older Atom tracker in both, success and precision. Surprisingly, PrDiMP50 slightly trails the accuracy of its predecessor DiMP50. Furthermore, all trackers’ performances on SeaDronesSee are similar or worse than on UAV123 (Atom with 65.0 success) [7, 12, 11], on which they were heavily optimized. We argue that in SeaDronesSee there is still room for improvement, especially considering that the clips feature precise meta information that may be helpful for tracking. Furthermore, in our experiments, the faster trackers DiMP18 and Atom run at approximately 27.1 fps on an Nvidia RTX 2080 Ti. However, we note that they are not capable of running in real-time on embedded hardware, a use-case especially important for UAV-based SAR missions.

5.3 Multi-Object Tracking

We use a similar evaluation protocol as the MOT benchmark [34]. That is, we report results for Multiple Object Tracking Accuracy (MOTA), Identification F1 Score (IDF1), Multiple Object Tracking Precision (MOTP), number of false positives (FP), number of false negatives (FN), recall (R), precision (P), ID switches (ID sw.), fragmentation occurrences (Frag). We refer the reader to [43] or the appendix for a thorough description of the metrics.
We train and evaluate FairMOT [55], a popular tracker, which is the base of many trackers submitted to the challenge [16]. FairMOT-D34 employs a DLA34 [54] as its backbone while FairMOT-R34 makes use of a ResNet34. Another SOTA tracker is Tracktor++ [6], which we also use for our experiments. It performed well on the MOT20 [14] challenge and is conceptually simple.
Surprisingly, Tracktor++ was better than FairMOT in both tasks. One reason for this may be the used detector. Tracktor++ utilizes a Faster-R-CNN with a ResNet50 backbone. In contrast, FairMOT is using a CenterNet with a DLA34 and a ResNet34 backbone, respectively.

6 Limitations and Conclusions

This work serves as an introductory benchmark in UAV-based computer vision problems in maritime scenarios. We build the first large scaled-data set for detecting and tracking humans in open water. Furthermore, it is the first large-scaled benchmark providing full environmental information for every frame, offering great opportunities in the so-far restricted area of multi-modal object detection and tracking. We offer three challenges, object detection, single-object tracking, and multi-object tracking by providing an evaluation server. Researchers are able to upload their predictions and compare them fairly. We hope that the development of meta-data-aware object detectors and trackers can be accelerated by means of this benchmark. Moreover, we provide multi-spectral imagery for detecting humans in open water. These images are very promising in maritime scenarios having the ability to capture wave lengths, which set apart objects from the water background.

We note, however, that the data can be even more variable. Specifically, we want to emphasize that footage at night or during rain is of great importance in SAR missions. Furthermore, the variance in life jackets, different water types, and subjects with different skin and clothes colors can be improved. We hope that the work at hand attracts more attention to UAV-based computer vision problems in maritime SAR scenarios.


We would like to thank Sebastian Koch, Hannes Leier and Aydeniz Soezbilir, without whose contribution this work would not have been possible.
This work has been supported by the German Ministry for Economic Affairs and Energy, Project Avalon, FKZ: 03SX481B.


  • [1] T. Adão, J. Hruška, L. Pádua, J. Bessa, E. Peres, R. Morais, and J. J. Sousa (2017) Hyperspectral imaging: a review on uav-based sensors, data processing and applications for agriculture and forestry. Remote Sensing 9 (11), pp. 1110. Cited by: §1.
  • [2] Aerial data accuracy – an experiment comparing 4 drone approaches. Note: 2021-03-01 Cited by: §3.2.
  • [3] Airbus Ship Detection Challenge. Note: 2021-03-01 Cited by: Table 1, §2.2.
  • [4] A. Albanese, V. Sciancalepore, and X. Costa-Pérez (2020) SARDO: an automated search-and-rescue drone-based solution for victims localization. arXiv preprint arXiv:2003.05819. Cited by: §1.
  • [5] A. Antonini, W. Guerra, V. Murali, T. Sayre-McCord, and S. Karaman (2018) The blackbird dataset: a large-scale dataset for uav perception in aggressive flight. In International Symposium on Experimental Robotics, pp. 130–139. Cited by: §2.3.
  • [6] P. Bergmann, T. Meinhardt, and L. Leal-Taixé (2019-10) Tracking without bells and whistles. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §5.3, Table 9.
  • [7] G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte (2019) Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191. Cited by: §5.2, §5.2.
  • [8] I. Bozcan and E. Kayacan (2020) Au-air: a multi-modal unmanned aerial vehicle dataset for low altitude traffic surveillance. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 8504–8510. Cited by: Table 1, §2.3.
  • [9] C. Corbane, L. Najman, E. Pecoul, L. Demagistri, and M. Petit (2010) A complete processing chain for ship detection using optical satellite imagery. International Journal of Remote Sensing 31 (22), pp. 5837–5854. Cited by: §1.
  • [10] D. Crisp (2004) The state-of-the-art in ship detection in synthetic aperture radar imagery. defence science and technology organization (dsto). Information Science Laboratory, Research Report No. DSTO-RR-0272. Cited by: §1.
  • [11] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2019) Atom: accurate tracking by overlap maximization. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 4660–4669. Cited by: §5.2, §5.2.
  • [12] M. Danelljan, L. V. Gool, and R. Timofte (2020) Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192. Cited by: §5.2, §5.2.
  • [13] DarkLabel video/image labeling and annotation tool. Note: 2020-08-31 Cited by: §3.3.
  • [14] P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, L. Leal-Taixé, and T. Taixé MOT20: A benchmark for multi object tracking in crowded scenes. Technical report External Links: 2003.09003v1, Link Cited by: §5.3.
  • [15] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 370–386. Cited by: Table 1, §1, §2.1, §2.3, §4.1.
  • [16] H. Fan, D. Du, L. Wen, P. Zhu, Q. Hu, H. Ling, M. Shah, J. Pan, A. Schumann, B. Dong, et al. (2020) VisDrone-mot2020: the vision meets drone multiple object tracking challenge results. In European Conference on Computer Vision, pp. 713–727. Cited by: §5.3.
  • [17] H. Fan, L. Wen, D. Du, P. Zhu, Q. Hu, H. Ling, M. Shah, B. Wang, B. Dong, D. Yuan, et al. (2020) VisDrone-sot2020: the vision meets drone single object tracking challenge results. In European Conference on Computer Vision, pp. 728–749. Cited by: §2.3, §5.2.
  • [18] M. Fonder and M. V. Droogenbroeck (2019-06) Mid-air: a multi-modal dataset for extremely low altitude drone flights. In Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), Cited by: §2.3.
  • [19] A. Gallego, A. Pertusa, P. Gil, and R. B. Fisher (2019) Detection of bodies in maritime rescue operations using unmanned aerial vehicles with multispectral cameras. Journal of Field Robotics 36 (4), pp. 782–796. Cited by: §1, §1.
  • [20] R. Geraldes, A. Goncalves, T. Lai, M. Villerabel, W. Deng, A. Salta, K. Nakayama, Y. Matsuo, and H. Prendinger (2019)

    UAV-based situational awareness system using deep learning

    IEEE Access 7, pp. 122583–122594. Cited by: §1.
  • [21] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §5.1, Table 5.
  • [22] M. Hsieh, Y. Lin, and W. H. Hsu (2017) Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4145–4153. Cited by: §2.1.
  • [23] K. Jedrasiak, D. Bereska, and A. Nawrat (2013) The prototype of gyro-stabilized uav gimbal for day-night surveillance. In Advanced technologies for intelligent systems of national border security, pp. 107–115. Cited by: §3.2.
  • [24] Y. Karaca, M. Cicek, O. Tatli, A. Sahin, S. Pasli, M. F. Beser, and S. Turedi (2018) The potential use of unmanned aircraft systems (drones) in mountain search and rescue operations. The American journal of emergency medicine 36 (4), pp. 583–588. Cited by: §1.
  • [25] B. Kiefer, M. Messmer, and A. Zell (2021) Leveraging domain labels for object detection from uavs. arXiv preprint arXiv:2101.12677. Cited by: §1, §5.1, §5.1, §5.1.
  • [26] R. Krajewski, J. Bock, L. Kloeker, and L. Eckstein (2018) The highd dataset: a drone dataset of naturalistic vehicle trajectories on german highways for validation of highly automated driving systems. In 2018 21st International Conference on Intelligent Transportation Systems (ITSC), pp. 2118–2125. Cited by: §2.1.
  • [27] Q. Li, L. Mou, Q. Liu, Y. Wang, and X. X. Zhu (2018)

    HSF-net: multiscale deep feature embedding for ship detection in optical remote sensing imagery

    IEEE Transactions on Geoscience and Remote Sensing 56 (12), pp. 7147–7161. Cited by: §2.2.
  • [28] S. Li and D. Yeung (2017) Visual object tracking for unmanned aerial vehicles: a benchmark and new motion models. In

    Proceedings of the AAAI Conference on Artificial Intelligence

    Vol. 31. Cited by: Table 1, §2.1.
  • [29] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §5.1.
  • [30] K. Liu and G. Mattyus (2015) Fast multiclass vehicle detection on aerial images. IEEE Geoscience and Remote Sensing Letters 12 (9), pp. 1938–1942. Cited by: Table 4.
  • [31] E. Lygouras, N. Santavas, A. Taitzoglou, K. Tarchanidis, A. Mitropoulos, and A. Gasteratos (2019) Unsupervised human detection with an embedded vision system on a fully autonomous uav for search and rescue operations. Sensors 19 (16), pp. 3542. Cited by: §2.2.
  • [32] A. L. Majdik, C. Till, and D. Scaramuzza (2017) The zurich urban micro aerial vehicle dataset. The International Journal of Robotics Research 36 (3), pp. 269–273. Cited by: §2.3.
  • [33] M. Messmer, B. Kiefer, and A. Zell (2021) Gaining scale invariance in uav bird’s eye view object detection by adaptive resizing. arXiv preprint arXiv:2101.12694. Cited by: §1.
  • [34] A. Milan, L. Leal-Taixé, I. Reid, S. Roth, and K. Schindler (2016) MOT16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831. Cited by: §5.3.
  • [35] B. Mishra, D. Garg, P. Narang, and V. Mishra (2020) Drone-surveillance for search and rescue in natural disaster. Computer Communications 156, pp. 1–10. Cited by: §1.
  • [36] M. Mueller, N. Smith, and B. Ghanem (2016) A benchmark and simulator for uav tracking. In European conference on computer vision, pp. 445–461. Cited by: Table 1, §2.1, §2.3.
  • [37] T. N. Mundhenk, G. Konjevod, W. A. Sakla, and K. Boakye (2016) A large contextual dataset for classification, detection and counting of cars with deep learning. In European Conference on Computer Vision, pp. 785–800. Cited by: §2.1.
  • [38] I. Nasr, M. Chekir, and H. Besbes (2019) Shipwrecked victims localization and tracking using uavs. In 2019 15th International Wireless Communications & Mobile Computing Conference (IWCMC), pp. 1344–1348. Cited by: §1.
  • [39] Object Detection on COCO test-dev. Note: 2021-03-01 Cited by: §5.1.
  • [40] F. Ofli, P. Meier, M. Imran, C. Castillo, D. Tuia, N. Rey, J. Briant, P. Millet, F. Reinhard, M. Parkan, et al. (2016) Combining human computing and machine learning to make sense of big (aerial) data for disaster response. Big data 4 (1), pp. 47–59. Cited by: §2.1.
  • [41] Z. Pei, X. Qi, Y. Zhang, M. Ma, and Y. Yang (2019)

    Human trajectory prediction in crowded scene using social-affinity long short-term memory

    Pattern Recognition 93, pp. 273–282. Cited by: §2.1.
  • [42] D. K. Prasad, H. Dong, D. Rajan, and C. Quek (2019) Are object detection assessment criteria ready for maritime computer vision?. IEEE Transactions on Intelligent Transportation Systems 21 (12), pp. 5295–5304. Cited by: §1.
  • [43] E. Ristani, F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016) Performance measures and a data set for multi-target, multi-camera tracking. In European conference on computer vision, pp. 17–35. Cited by: §5.3.
  • [44] K. T. San, S. J. Mun, Y. H. Choe, and Y. S. Chang (2018) UAV delivery monitoring system. In MATEC Web of Conferences, Vol. 151, pp. 04011. Cited by: §1.
  • [45] H. Su, S. Wei, M. Yan, C. Wang, J. Shi, and X. Zhang (2019) Object detection and instance segmentation in remote sensing imagery based on precise mask r-cnn. In IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing Symposium, pp. 1454–1457. Cited by: Table 4.
  • [46] M. Tan, R. Pang, and Q. V. Le (2020) Efficientdet: scalable and efficient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10781–10790. Cited by: §5.1, Table 5.
  • [47] J. C. van Gemert, C. R. Verschoor, P. Mettes, K. Epema, L. P. Koh, and S. Wich (2014) Nature conservation drones for automatic localization and counting of animals. In European Conference on Computer Vision, pp. 255–270. Cited by: §2.1.
  • [48] S. Vicente, J. Carreira, L. Agapito, and J. Batista (2014) Reconstructing pascal voc. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 41–48. Cited by: Table 4.
  • [49] Z. Wu, K. Suresh, P. Narayanan, H. Xu, H. Kwon, and Z. Wang (2019) Delving into robust object detection from unmanned aerial vehicles: a deep nuisance disentanglement approach. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1201–1210. Cited by: §2.3.
  • [50] G. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, M. Datcu, M. Pelillo, and L. Zhang (2018) DOTA: a large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3974–3983. Cited by: Table 1, §2.2, §4.1, Table 4.
  • [51] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He (2017) Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1492–1500. Cited by: §5.1, Table 5.
  • [52] S. Yang, P. Luo, C. Loy, and X. Tang (2016)

    Wider face: a face detection benchmark

    In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5525–5533. Cited by: §4.1.
  • [53] S. Yeong, L. King, and S. Dol (2015) A review on marine search and rescue operations using unmanned aerial vehicles. International Journal of Marine and Environmental Sciences 9 (2), pp. 396–399. Cited by: §1.
  • [54] F. Yu, D. Wang, E. Shelhamer, and T. Darrell (2018) Deep layer aggregation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2403–2412. Cited by: §5.3.
  • [55] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu (2020) FairMOT: on the fairness of detection and re-identification in multiple object tracking. arXiv e-prints, pp. arXiv–2004. Cited by: §5.3, Table 9.
  • [56] X. Zhou, D. Wang, and P. Kr”ahenb”uhl (2019) Objects as points. In arXiv preprint arXiv:1904.07850, Cited by: §5.1, Table 5.
  • [57] P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu (2018) Vision meets drones: a challenge. arXiv preprint arXiv:1804.07437. Cited by: Table 1, §1, §2.1, §5.1.
  • [58] P. Zhu, L. Wen, D. Du, X. Bian, Q. Hu, and H. Ling (2020) Vision meets drones: past, present and future. arXiv preprint arXiv:2001.06303. Cited by: §5.2.
  • [59] P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Hu, Q. Nie, H. Cheng, C. Liu, X. Liu, et al. (2018) Visdrone-det2018: the vision meets drone object detection in image challenge results. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0. Cited by: §5.1.
  • [60] F. Zimmermann, C. Eling, L. Klingbeil, and H. Kuhlmann (2017) PRECISE positioning of uavs-dealing with challenging rtk-gps measurement conditions during automated uav flights.. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial Information Sciences 4. Cited by: §3.2.