Deep Learning-based Human Detection for UAVs with Optical and Infrared Cameras: System and Experiments

by   Timo Hinzmann, et al.
ETH Zurich

In this paper, we present our deep learning-based human detection system that uses optical (RGB) and long-wave infrared (LWIR) cameras to detect, track, localize, and re-identify humans from UAVs flying at high altitude. In each spectrum, a customized RetinaNet network with ResNet backbone provides human detections which are subsequently fused to minimize the overall false detection rate. We show that by optimizing the bounding box anchors and augmenting the image resolution the number of missed detections from high altitudes can be decreased by over 20 percent. Our proposed network is compared to different RetinaNet and YOLO variants, and to a classical optical-infrared human detection framework that uses hand-crafted features. Furthermore, along with the publication of this paper, we release a collection of annotated optical-infrared datasets recorded with different UAVs during search-and-rescue field tests and the source code of the implemented annotation tool.



page 2

page 5

page 6

page 7

page 9

page 11

page 16

page 17


Detecting animals in African Savanna with UAVs and the crowds

Unmanned aerial vehicles (UAVs) offer new opportunities for wildlife mon...

Deep Motion Boundary Detection

Motion boundary detection is a crucial yet challenging problem. Prior me...

Deep Learning based Multi-Modal Sensing for Tracking and State Extraction of Small Quadcopters

This paper proposes a multi-sensor based approach to detect, track, and ...

Deep Cuboid Detection: Beyond 2D Bounding Boxes

We present a Deep Cuboid Detector which takes a consumer-quality RGB ima...

Circumpapillary OCT-Focused Hybrid Learning for Glaucoma Grading Using Tailored Prototypical Neural Networks

Glaucoma is one of the leading causes of blindness worldwide and Optical...

Depth Reconstruction and Computer-Aided Polyp Detection in Optical Colonoscopy Video Frames

We present a computer-aided detection algorithm for polyps in optical co...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The need for robust human detection algorithms is tremendous and has massively increased over the past years due to the vast amount of emerging applications in the field [Dollar2009, Nguyen2016]. With *UAV technology blooming, research in the field of human detection from aerial views also steadily evolved and experienced much interest for real-world *SAR missions [Rudol2008, Andriluka2010, Blondel2014, Kummerle2016, Bejiga2017]. While the majority of the earlier work on human detection incorporated hand-crafted features, more recent publications started to make use of *DL-based detectors, mostly in the form of *CNN [DeOliveira2016, Herrmann2016, Bejiga2017]. In the superordinate field of object detection, deep *CNN have been already established for several years [Krizhevsky2012, Simonyan2014], and current state-of-the-art detectors produce impressive results with possible real-time performance [Redmon2018, Lin2017_RetinaNet]. There exist a few publications known to date that try to use these insights from the field of object detection for human detection in aerial images [Bondi2018, Wang2018, Liu2018, Perera2018, Chang2018, Yun2019]. Making use of either only optical or only *LWIR (also referred to as *TI) images, still no work so far has considered the use of deep *CNN for combining information from both *TI and optical images. As *CNN need a vast amount of data to outperform hand-crafted detectors, the lack of publicly available data might still limit the ubiquity of deep *CNN for human detection in both optical and *TI aerial imagery. Available data in the field either only consists of optical [stanford, uav123], or only *TI images [eth_tir, otcvbs1, ptb_tir], and no publicly available dataset provides real-world data collected in the field for an expressive evaluation of human detection algorithms in search and rescue scenarios. Our publication provides such a collection of optical-*TI field datasets for extensive training and testing of human detection algorithms. Furthermore, besides optimizing the raw detections, we propose a complete framework to detect, track, localize, and re-identify humans, as shown in Fig. 1.

Figure 1: Proposed deep learning-based human detection system that uses optical (RGB) and long-wave infrared (LWIR) cameras to detect, track, localize, and re-identify humans from *UAV flying at high altitudes.

2 Related Work

Early work in the field of human detection from *UAV [Rudol2008, Gaszczak2011, Flynn2013, Blondel2014, Vempati2015, Kummerle2016] strongly resembled the one from object detection and applied classical methods such as Haar-like features introduced by [Viola2001], the Felzenszwalb detector [Felzenszwalb2010]

, or *HOG features together with linear *SVM classifiers as introduced by 

[Dalal2005]. A lot of this early work [Rudol2008, Gaszczak2011, Flynn2013, Blondel2014_2, Kummerle2016] concluded that the combination of *TI and optical images is highly beneficial for the task of human detection from high aerial views.

With the rise of deep learning, the application of *CNN started to become well-established for the task of human detection from aerial views [DeOliveira2016, Sathyan2016, Herrmann2016, Bejiga2017]. By comparing against more classical feature extractors and detectors such as Haar, *HOG or *SVM, an improvement in detection performance as well as a better generalization when using *CNN, was stated throughout. Research in object detection progressed quickly and gradually yielded deeper, better, and faster-performing object detection networks. On the side of two-stage detectors, the R-CNN network and its successors  [Girshick2014, Girshick2015, Ren2015], as well as *FPN by  [Lin2017_FPN] gained a lot of attention. On the other hand, first one-stage detectors such as YOLO and its successors  [Redmon2016, Redmon2017, Redmon2018] or the *SSD by  [Liu2016], impressed with high frame-rates while still achieving competitive detection performance. Some recent work in the field already started to include these state-of-the-art object detectors such as adaptations of the R-CNN network [Liu2018, Perera2018] or one-stage detectors such as YOLO9000 or *SSD [Chang2018, Yun2019] into their pipelines. The use of either one-stage or two-stage object detectors, however, results in a speed-accuracy trade-off, as shown by [Bondi2018]. This is in accordance with recent findings by [Lin2017_RetinaNet]. Their one-stage object detector framework called RetinaNet tries to address this discrepancy between inference speed and detection performance and close the gap between state-of-the-art one-stage and two-stage object detection frameworks. Wang et al. [Wang2018], for instance, used RetinaNet [Lin2017_RetinaNet] as a proposed solution for object detection in aerial views. RetinaNet was evaluated against other one-stage and two-stage detectors, namely *SSD and Faster R-CNN on the Stanford drone dataset [stanford]. The evaluation resulted in a similar statement to the one by [Lin2017_RetinaNet], showing state-of-the-art performance of RetinaNet compared to two-stage detectors, while running at speeds comparable to the ones obtained from one-stage detectors. This motivates the use of these state-of-the-art one stage detectors also for our task, since their fast inference speeds are crucial when running on-board a *UAV with limited computing power.

The vast majority of deep learning architectures is trained and evaluated on optical imagery. In fact, to the best of our knowledge, the only work to date which uses a *DL-based human detector on thermal imagery from a *UAV was conducted by [Bondi2018] in 2018, where a Faster R-CNN framework was applied to detect poachers in thermal images. More so, none of the publications make use of both the optical and thermal domain together with recent state-of-the-art deep object detection networks. Likely, this is due to the lack of publicly available data: Publicly available datasets either only consist of optical images [stanford, uav123] or only thermal images [eth_tir, otcvbs1, ptb_tir]. In this paper, we approach these research gaps as follows: Firstly, we present a comprehensive performance comparison of state-of-the-art human detection architectures, trained with optical and thermal imagery. In particular, the evaluation comprises YOLO and RetinaNet with different ResNet backbones. For reference, we additionally provide the results from a hand-crafted thermal-optical human detection pipeline [Kummerle2016]. Secondly, to the best of our knowledge, this paper constitutes the first publication on deep learning-based human detection from *UAV which combines optical-thermal imagery. The employed merging strategy is explained in detail in Sec. 3.3. Thirdly, the evaluation is conducted on datasets recorded during *SAR field tests. The collection of annotated datasets and the implemented annotation tool is released along with this paper. Despite the great success of recent findings in the field of object detection, very small sample sizes as well as strong-view point variations in aerial images, make the task challenging. Recent work like the one by [Liu2018] or [Chang2018], clearly shows that adaptations to current object detection networks are crucial for a decent performance on aerial images from *UAV. In this paper, we propose to use optimized custom anchors and up-scaled images to boost the detection from high altitudes and reveal this performance gain in a detailed evaluation. Finally, we describe in detail our approach to track, localize, and re-identify humans to improve the overall performance beyond the raw detection.

3 Human Detection

3.1 State-of-the-Art Object Detectors Revisited

Our proposed pipeline uses a customized RetinaNet network with a ResNet50 backbone as introduced by [Lin2017_RetinaNet] as a state-of-the-art one-stage object detector. As already outlined, this is mainly due to limited computing resources on-board of *UAV, as well as similar detection performance of RetinaNet compared to two-stage detectors in recent publications [Lin2017_RetinaNet, Wang2018]. Our implementation follows the original implementation and introduces crucial customizations, as outlined in Sec 3.2. This proposed framework is then compared against YOLOv3 using a darknet-53 backbone, another state of the art one-stage object detector. Both networks follow a similar basic architecture collecting features at different scales using their respective backbone networks and subsequently regressing and classifying the output bounding boxes. As one of their main contributions, RetinaNet introduces a novel focal loss, based on the well-known cross-entropy loss. Since YOLOv3 still relies on the standard cross-entropy loss, the focal loss further constitutes the main difference between these two object detection frameworks. In [Lin2017_RetinaNet], Lin et al. state the significant imbalance of image background and foreground in many object detection tasks as the main reason for lacking the performance of conventional one-stage detectors. They were able to show that a conventional cross-entropy loss is easily overwhelmed by the vast amount of background samples seen during training, when running regression and classification directly on top of a dense feature map. By adding a modulating factor to the cross-entropy loss, they define the novel focal loss using a tunable focusing parameter , able to counteract this large imbalance.

3.2 Network Customization

3.2.1 Optimal Anchor Selection

Both RetinaNet as well as YOLOv3 use nine anchor bounding boxes for final bounding box regression. While RetinaNet uses fixed hand-picked anchors, YOLOv3 uses k-means dimension clustering on the COCO dataset 

[coco] to find optimal anchors. Furthermore, during training, the selection strategies, deciding on which anchor bounding boxes are assigned to the ground-truth bounding boxes, differ quite significantly between the two networks. Lin et al. [Lin2017_RetinaNet] use an adjusted assignment rule originating the region proposal network of Faster R-CNN [Ren2015]. Anchors are assigned to ground-truth boxes for an *IOU value of 0.5 or higher and to background for *IOU values in . Each anchor is assigned to at most one ground-truth box, and all remaining unassigned anchors are ignored during training. Redmon et al. [Redmon2018], on the other hand, do not use such a dual *IOU threshold. Instead, they simply assign one anchor per ground-truth bounding box by using the anchor with the largest *IOU value. If an anchor is not the best but overlaps a ground-truth box with more than 0.5 *IOU, the prediction is ignored. All other anchors, not assigned to any ground-truth boxes, do only incur a loss for the object prediction. The need for customized anchors in our pipeline is crucial: Using the standard anchors for RetinaNet on our training dataset showed that a lot of samples in both optical and thermal images did not contribute to the training process because of their really small size and not reaching an *IOU value of or higher with any of the standard anchor bounding boxes. Resolving this by changing the added anchor sizes at each level of the original publication by [Lin2017_RetinaNet] to custom sizes of , the number of bounding boxes contributing to the training was sufficiently increased. More specifically, the amount of optical samples contributing to the final stage of training was increased from 57.7% to 97% on the optical side and from 33.2% to 88.1% on the thermal side of our training dataset.

3.2.2 Image Resolution Augmentation

A second natural way of improving the detection of really small bounding boxes is to simply augment their sizes. Final detection performance can be improved significantly, by either using augmented- or higher-resolution images, as also stated by Bejiga et al. [Bejiga2017]. This can be achieved by increasing the input image size during inference since both networks being fully-convolutional. Doubling the standard image sizes of the RetinaNet variants was able to bring a considerable performance improvement as shown in Sec. 6.

3.3 Optical-Infrared Merging Strategy

The proposed pipeline utilizes optical and thermal imagery. This is especially useful to further reduce false positives during day flights, for detecting humans during night flights, or to find humans in aggravated optical conditions, for instance, when obscured by shadows due to large rocks or trees. The camera intrinsic and extrinsic parameters are used to map feature positions from the optical to the thermal image and vice versa. However, small inaccuracies in the calibration, the image triggering, or exposure time may result in pixel offsets in both spectra. Especially in applications with fast-moving and fast-turning *UAV this may become an issue. To account for these inaccuracies, we propose to use a sliding window to match bounding boxes between the spectra, as illustrated in Fig. 2. More specifically, a rectangle nine times the size of the mapped bounding box is searched for a target bounding box using 36 sliding window steps in total. Matching of the sliding window and target bounding boxes is done similarly to the standard procedure proposed by [Dollar2009], using an *IOU threshold of 0.5. After a completed matching step, a logical OR merging scenario is used, considering all bounding boxes from both domains while averaging the respective prediction scores.

Figure 2: Exemplary matching process. Original detection in the thermal image on the left in green. Raw mapped bounding box in the right optical image in red, sliding window area in yellow and final matched bounding box (detection) in green.

3.4 Network Training

To train the network, publicly available, external datasets, as well as internal data, recorded by one of our *UAV, are used. Using the implemented annotation tool, a total of optical sequences consisting of images, and thermal sequences containing images are available for training. The human bounding boxes in theses sequences are annotated with a distinct ID for each individual human, attributes for the pose (upright, sitting or lying), and an occlusion attribute. Altogether, the newly collected datasets add up to a total of human bounding boxes in the optical images, and human bounding boxes in the thermal images. In addition to our internal datasets, all available public datasets containing humans from both optical [minidrone, stanford, uav123] and thermal [eth_tir, otcvbs1, otcvbs11, ptb_tir, vot_tir] aerial views have been gathered. Together with these external datasets, the final amount of available data at hand consisted of around annotations in over optical images, and around annotations in over thermal images.

The final training of the two object detectors was conducted using a two-stage training procedure: Starting with pre-trained Imagenet

[Russakovsky2015] weights, an initial pre-training step was carried out using all the data at hand, including our newly collected dataset and all available external datasets. The best results in pre-training were achieved by freezing all the backbone weights for both RetinaNet and YOLOv3 and only training and adapting all other, randomly initialized weights to the novel domain. Subsequently, the resulting weights of the pre-training step were then used to initialize a final fine-tuning step. In this step, only the newly collected, domain-specific data was used to train all the layers of both the networks. Training of the two object detectors was carried out according to the original publications of RetinaNet [Lin2017_RetinaNet] and YOLOv3 [Redmon2018]. Both networks were trained on a cluster using Nvidia GeForce GTX 1080 Ti GPUs. While YOLOv3 originally uses a smaller sized input image together with a multi-scale training strategy [Redmon2018], RetinaNet uses a larger single size input image during training. To be able to train both networks using a similar batch size of eight images, RetinaNet was trained on a total of eight GPUs while YOLOv3 was trainable on a single GPU. Both networks include data augmentation and other training features mentioned in the original publications. Randomly selected sequences out of the external datasets were used as validation sets in pre-training and a fixed sequence of our own dataset was used as validation set for the final fine-tuning training scenario.

4 Human Localization and Re-Detection

This section describes our approach to track, localize, and re-detect humans in the optical spectrum based on qualitative results.

Human Tracking:

For every observation classified as human the detector described in Sec. 3 creates a new victim ID. However, the final goal of the proposed framework is to associate every victim with a unique ID and to compute its position or path in 3D. In a first step towards this goal, an object tracker bundles all detections of a victim as long as the human is within the field of view of the camera (cf. Fig. 3). The human tracking is tested on the optical image stream with a frame-rate of , resulting in potentially large pixel displacements of an observation between two subsequent frames. To reduce the pixel displacement and simultaneously increase the speed of the tracker the image is half-sampled. Object trackers implemented in [bradski2000opencv] were tested, including MIL [Babenko09], KCF [henriques2012exploiting], GOTURN [held2016learning], among which CSRT [lukezic2017discriminative] performed best and was selected.

Figure 3: Human tracking: Humans that are tracked across subsequent frames (, ) are assigned the same human ID.
Human Localization:

Given a track of observations in the form of bounding boxes and the corresponding camera poses, the 3D path of the object can be estimated. The camera poses are assumed to be given as input to the framework. As the human may be non-static, two-view triangulation of consecutive observations are used. The center of the bounding box is selected for triangulation, as illustrated in Fig. 


Figure 4: Human localization: Based on two subsequent detections, the geo-referenced human position can be triangulated, as visualized in the satellite orthoimage.
Metric Outlier Rejection:

The metric bounding box area is used to reject false detections as follows: Given two consecutive observations of an object and the triangulated 3D object position , the depth can be inferred via . The depth is then used to transfer the bounding box from pixel coordinates to meters, resulting in the points (clockwise). The metric area of the bounding box is then . Objects with bounding box areas

are classified as outliers and rejected as visualized in Fig. 


Figure 5: Metric outlier rejection: The detection is rejected if the estimated metric area of the bounding box is above a threshold .
Particle Filter:

To handle occlusions, re-detections, and to incorporate a probabilistic motion model, a *PF is initialized for every human. For computational reasons the *PF is restricted to two dimensions and in *UTM coordinates. If necessary, the altitude of the victim can be queried from the existing map. The *PF implementation and notation closely follows [Simon2006] (cf. Fig. 6):

  • [noitemsep]

  • Initialization: Given the first triangulated position, denoted as with , randomly draw initial particles with from .

  • Propagation: In the propagation step, compute the a priori particles given a random walk motion model, assuming a maximum human velocity in both, the - and -direction:
    where is a random velocity

    drawn from the uniform distribution

    and is the time difference between two consecutive frames in seconds.

  • Measurement: Given a new observation associated with the victim ID, compute the relative likelihood of this observation for every particle by evaluating the conditional *PDF based on the measurement equation:
    with and measurement noise . Normalize via with .

  • Resampling: Draw posteriori particles based on normalized likelihoods using *SR [Kozierski2013].

Figure 6: Particle filter: The particles for human ID 1 are visualized as blue points, the trajectory flown by the Rega drone is shown in yellow.
Human re-detection:

If the *UAV flies over a previously visited area and the detector returns an object classified as human, the algorithm needs to decide if the new detection can be associated to an already observed human or if it is, in fact, a new victim (cf. Fig. 7).

Figure 7: Human re-detection: The algorithm needs to decide if the new detection can be associated to an already observed human or if it is, in fact, a new victim. On the right, the histogram similarity is evaluated based on the Intersection metric. The query image is patch number 1.

Applying the Bayes theorem the probability

that the new observation belongs to human can be computed with , . Appearance-based and spatial information are used to compute . Firstly, the conditional probability is computed based on the triangulated location of the new observation and all existing humans currently tracked by *PF. If a new victim with corresponding *PF is initialized. Otherwise the new detection is associated with , i.e., with that human that maximizes the detection probability.

Secondly, the prior which is the probability of observing human is inferred from the similarity between a patch from human and the newly detected human patch using a binary classifier. Since the humans are detected from a high distance we base our decision if two patches contain the same person solely on the color information. Tab. 2 presents the similarity between patches computed by comparing their color histograms using the metrics [Bradski2000] Correlation, Chi-square, Intersection, and Bhattacharyya [bhattacharyya1943measure]. The patch numbers 1 to 4 correspond to the detections shown in Fig. 7. Note that for Correlation and Intersection, higher values correspond to a higher similarity. In contrast, for the metric Chi-square and Bhattacharyya, the lower the value, the higher the similarity. Patch numbers 1, 2, and 3 are observations of the same human. However, this is not reflected by the results in Tab. 2 as the background has an influence on the color histogram. Therefore, using the GrabCut algorithm [rother2004grabcut], the background is automatically subtracted (cf. Fig. 7) before computing the histogram which improves the re-identification results, as shown in Tab. 2

. Using the sigmoid function, for instance, the results from the Intersection metric are mapped to values between 0 and 1, as shown in Fig. 


Patch Method 1 2 3 4 Correlation 1.0 0.71 0.83 0.83 Chi-Square 0.0 8.85 2.39 2.52 Intersection 2.89 1.41 1.44 1.57 Bhattacharyya 0.0 0.53 0.37 0.44
Table 1: Quantitative results for Fig. 7: Patch similarity without background subtraction and histogram comparison.
Patch Method 1 2 3 4 Correlation 1.0 0.57 0.83 0.09 Chi-Square 0.0 12.33 1.9 44.92 Intersection 4.12 2.31 2.25 0.51 Bhattacharyya 0.0 0.52 0.41 0.83
Table 2: Quantitative results for Fig. 7: Patch similarity with background subtraction and histogram comparison.

5 Hardware & Experiment Preparation

5.1 Platform and Sensors

The sensorpod used for our experiments is shown in Fig. 7(c). It consists of an IDS UI-5261SE-C-HQ-R4 RGB camera with a C-mount lens and a horizontal field of view of and an infrared camera FLIR Tau2 with a resolution of pixel. The FLIR Tau2 is mounted on a Teax image grabber with USB2 interface. The GPU is a Jetson TX2 mounted on an Auvidea J120 carrier board and is used for network inference at run-time. It is connected to the CPU, an UP board with Intel Atom Processor, via Ethernet. The CPU has a MSATA storage of and handles the triggering of RGB camera, infrared camera, and ADIS16448 *IMU. The casing features a plexiglass and Germanium window for the optical and infrared camera, respectively, and a fan for additional air circulation. The two *UAV platforms that are used for the various test flights are shown in Fig. 7(b) and 7(b).

(a) Techpod
(b) Rega drone
(a) Techpod
(c) Sensorpod with optical-infrared stereo rig.
Figure 8: The sensorpod with optical-infrared stereo rig carried by the Rega drone.

5.2 Geometric Optical-Infrared Camera and Camera-IMU Calibration

The camera intrinsics (focal length, principal point), distortion parameters, and optical-infrared extrinsics (relative pose between the cameras and *IMU) are required for optical-infrared image fusion and metric human localization. For this purpose, different optical-infrared, i.e., dual-modal calibration targets have been developed and improved over time. A dual-modal calibration target allows to directly calibrate the relative pose between the thermal and optical camera without the need for a camera-*IMU calibration as an intermediate step. That is the optical-infrared camera can be calibrated as a stereo rig with one single calibration dataset using Kalibr [Rehder2012]. The evolution of our developed calibration targets is represented by Tab. 3 and was inspired by the related publications listed in Tab. 4. The different calibration targets shown in Tab. 3 and 4 are classified based on the taxonomy proposed in [Rangel2014]: Initial works utilized classical optical checkerboard calibration targets and heated the target with a flood lamp (cf. target 3 or [Prakash2006]). Based on the emissivity difference of black and white squares, the pattern can be made visible also in the *TI spectrum. However, fuzzy transitions between black and white edges lead to missed or inaccurate corner detections and unsatisfying calibration results. Based on this finding, we attempted to increase the sharpness of the edges by using materials with contrary emissivity properties. For this we designed target 3 which is made out of black colored wood (emissivity [Lide2004]) and aluminum (emissivity [Lide2004]). Likewise, the authors of [Skala2011, Vidas2012, Saponaro2015] focused on improving the contrast of checkerboard targets using different materials and masks. However, as for instance pointed out by [Yu2013], the corner detection remains inherently error-prone in the *TI spectrum, and instead, the usage of circular features is advised. Our final dual-modal calibration target 3 consists of circular features, laser-cut into an aluminum calibration target, and filled with machine-cut black wooden plates.

Picture IR Picture RGB RGB IR
Target type
Working Principle Features


Paper print-out
glued on wood
Color emissivity
difference (black, white)


Black colored
wooden squares,
Aluminum squares
Color emissivity
difference (black, white)
Material emissivity difference
(wood, aluminum)


Square grid
Aluminum target
with inserted
black wooden
Color emissivity
difference (black, white)
Material emissivity difference
(wood, aluminum)
Table 3: Evolution of our dual-modal calibration targets (sorted by date).

Based on the detections, the camera intrinsics and extrinsics are calibrated with Kalibr [Rehder2012]. To be able to use Kalibr, the infrared images are inverted and thresholded. We obtained the best calibration results for datasets that are recorded outside where the calibration target is facing the clear sky. For this target, Kalibr reports the following errors: For a single camera calibration (infrared) and (optical); for a camera-*IMU calibration (infrared) and (optical).

Publication Picture RGB IR Target type Working Principle Features [Prakash2006] 111Not shown in the paper but in principle possible Checkerboard Flood lamp Color emissivity difference (black, white) Corners [Skala2011] Hermann grid Material emissivity difference (Styrofoam, air) Corners [Vidas2012] Hermann grid Heated backdrop, e.g. monitor Material emissivity difference (cardboard, monitor) difference in temperature and/or difference in thermal emissivity Corners [Yu2013] Cross Difference in temperature (Thermostatic heaters) Color emissivity difference (black, white) Circles [Saponaro2015] Checkerboard Flood light Color emissivity difference (black, white) Material emissivity difference (ceramic, paper) Corners [Filippov2019] Wall Temperature difference (flood lights, fan) Color emissivity difference(black, white) Material emissivity difference (foam, aluminum) Arcs
Table 4: Optical and infrared calibration methods (sorted by publication date)

5.3 Optical-Infrared Dataset Collection and Annotation

To further increase the amount of training, validation, and test data we collected additional optical-infrared datasets. The datasets are recorded with sensors mounted on the platforms listed in Fig. 7(b) and 7(b), at different locations and resembling realistic search-and-rescue missions. A complete list of available datasets is shown in Tab. 5. Along with all datasets we release the C++ based annotation tool that takes as input a dataset in the form of a rosbag [Quigley2009] or video. With the help of this tool, all humans appearing in the collected data have been annotated using upright rectangular bounding boxes. During the labeling process, humans are assigned a unique ID. Every annotation contains an additional attribute for the human posture (upright, sitting or lying) and one for occlusion (occluded, not occluded). Examples of the available annotated frames are illustrated in Fig. 15.

6 Experiments and Results

6.1 Experiment 1: Comparison of RetinaNet, YOLOv3 and Hand-crafted Detector

In a first step, we compare the vanilla deep learning-based detectors to a hand-crafted human detection framework that uses *HOG features together with a *SVM classifier [Kummerle2016] and that serves as a low baseline solution. This reference pipeline, that was introduced in [Kummerle2016], detects humans in thermal imagery and then uses the corresponding optical image solely to reduce false positives. The performance on the collected roof_test dataset, similar to the one by [Kummerle2016], is illustrated in Fig. 9 and 10, and shows a significant improvement when using deep learning-based object detectors. Furthermore, all of the RetinaNet variants vastly outperform the YOLOv3 framework.

RetinaNet152 *HOG + *SVM [Kummerle2016]
Figure 9: Thermal fppi/missrate curves on the roof_test set illustrating the large performance improve of deep learning detectors.
RetinaNet152 logical AND RetinaNet152 optical only
Figure 10: Optical fppi/missrate curves on the roof_test set. Performance on pure optical information plotted as dashed lines and the improved performance using the logical OR merging scenario in bold. By combining optical and thermal information, both false positives and overall missrates are reduced.

6.2 Experiment 2: RetinaNet evaluation in *SAR scenario

In this section, the performance of the proposed human detection pipeline is thoroughly evaluated based on a dataset resembling a search-and-rescue scenario. The field_test datasets proved to be very challenging with a lot of very small human samples at different poses recorded from very high aerial views. Occasional motion-blur due to the flight maneuvers and a lot of heated up rocks make the thermal imagery even more challenging. The performance of our vanilla detectors on the total optical and thermal field evaluation datasets clearly emphasizes this: Only a few percents of the total amount of human bounding boxes are detected. Still, RetinaNet variants and especially the proposed RetinaNet50 network vastly outperforms YOLOv3 in the number of total detectable humans, as depicted in Fig. 12. Finally, modifications to the best performing RetinaNet50 variant, as described in Sec. 3.2, show vast improvements for the field_test dataset as illustrated in Fig. 12. While the logical OR merging scenario of optical and thermal detections already brings a performance improvement, the customized anchor bounding boxes vastly increase detection performance and boost up the total amount of detectable bounding boxes by over 7%. Finally, doubling the image size during inference further increases this amount to a total of 70% detected bounding boxes while making less than one false positive per image. This is an improvement of over 20% when compared to the plain version of the original RetinaNet50 variant.

Figure 11: Performance of all networks on the total field_test sets using standard networks without any changes on anchors, image resolution or any merging of IR-RGB information.
Figure 12: Performance on the total field_test showing the gradual improvements when adding our modifications to the plain RetinaNet50 variants, as described in Sec. 3.2.
Figure 13: Bounding box size analysis of the best performing RetinaNet50 on the optical part of the davos_rega test sets. RetinaNet is able to generate true positives at a bounding box size of around 2000 px and fails to make any predictions with sizes below approximately 1500 px.
Figure 14: Bounding box size analysis of the best performing RetinaNet50 on the optical part of the davos_rega test sets using custom anchors. RetinaNet now generates most of its true positive predictions at a bounding box size of around 1000 px and is further able to make predictions across a wider range of sizes.
Individual Human Detection

A final evaluation investigates how well the pipeline detects human individuals: Every human in our novel dataset is labeled with a unique ID. Re-detections of the same individual can therefore be conveniently recognized. For search and rescue scenarios, one person does not necessarily need to be detected in every single frame. Instead, it is more important that an individual is detected at least once during the mission. It is therefore informative to investigate how many of all the distinct individual human IDs are detected by the pipeline as illustrated in Fig. 14 and 14. By calculating the miss-rate of actual human IDs instead of single ground-truth bounding boxes, these final results show final human ID miss-rates of the best performing RetinaNet50 of less than 30% while still making less than one false positive per image. This is a successful detection of over 70% of all individual humans at least once.

Qualitative samples

Fig. 16 presents qualitative examples, including an example for the merging of the optical and infrared image and detected humans during a night flight.

7 Conclusion

This work presented our human detection framework that is able to detect, track, localize, and re-identify humans from *UAV with the help of an infrared and optical camera. Based on a detailed evaluation, it can be concluded that the RetinaNet variants, in particular RetinaNet50, are superior to YOLOv3 and to a classical human detection pipeline that served as a lower baseline. The major advantage of the RetinaNet architecture appears to be the focal loss that is capable of coping with drastic imbalances between the number of foreground and background samples. Moreover, customizing the anchors is crucial for the detection of humans seen from high altitudes. Enlarged or higher resolution images further improve the detection performance. Finally, the evaluation demonstrated that the logical OR merging scenario of both optical and thermal images helps to improve detection performance, especially for more challenging datasets like the introduced field_test sets. Both the quantitative and the qualitative evaluation emphasize the conclusion that our novel pipeline indeed also works on a challenging real-world dataset successfully exploiting and combining information from both optical and thermal images. Given the performance of detecting individual human IDs and the final qualitative examples, it could even be questioned whether the human detection performance might already be surpassed by RetinaNet50 under certain circumstances.


The research leading to these results has received funding from the Swiss air rescue organization [rega]. Furthermore, the authors wish to thank the following persons for their help: Thomas Mantel (Sensor triggering and calibration), Yves Allenspach (camera mounts), and Daniel Hentzen (circular calibration target for thermal camera and graphical user interface).


Sequence Frames Annotations
total annotated total upright sitting lying occluded
roof_new 1 900 1 815 4 748 4 748 0 0 584
roof_old 2 898 2 758 23 165 23 165 0 0 2 396
roof_test 447 447 1 462 1 462 0 0 172
roof_val 231 231 991 991 0 0 258
davos_rega_01 2 553 86 159 116 12 31 21
davos_rega_02 4 806 360 543 374 0 169 37
davos_rega_03 5 853 35 66 31 18 17 20
davos_rega_04 2 962 203 322 166 3 153 29
hinwil_01 2 069 210 417 417 0 0 59
hinwil_02 9 028 1 221 1 994 1 360 530 104 262
solair 7 698 97 155 121 0 34 38
Total 40 445 7 463 34 022 32 951 563 508 3 876
(a) Internal optical image sequences.
Sequence Frames Annotations
total annotated total upright sitting lying occluded
roof_new 1 834 1 701 5 752 5 752 0 0 1 825
roof_old 1 480 1 316 6 141 6 141 0 0 477
roof_test 447 447 1 421 1 421 0 0 205
roof_val 231 231 627 627 0 0 138
davos_old 2 204 664 1 004 1 004 0 0 53
davos_rega_01 2 585 11 14 7 0 7 2
davos_rega_02 4 842 66 71 71 0 0 4
davos_rega_03 5 920 16 16 0 16 0 0
davos_rega_04 2 996 115 144 52 0 92 21
hinwil_01 2 012 114 194 194 0 0 17
hinwil_02 730 15 15 0 15 0 2
rothenturm 37 413 9 852 21 785 21 785 0 0 1 120
solair 27 324 0 0 0 0 0 0
tessin 1 011 43 44 27 0 17 1
Total 91 029 14 591 37 228 37 081 31 116 3 856
(b) Internal infrared image sequences.
Table 5: Recorded and labeled internal datasets.
Sequence Frames Annotations
total annotated total occluded
Mini-drone [minidrone]
set10 222Original name:
Normal_Static_Night_Empty_1_3_1 (test)
543 542 1 207 0
set13 333Original name:
Normal_Static_Day_Half_0_1_1 (training)
570 569 569 0
Stanford drone [stanford]
bookstore00 13 335 13 335 246 158 1 494
bookstore06 14 558 14 305 64 944 7 090
coupa01 11 966 11 966 71 136 3 726
coupa02 11 966 11 474 66 867 3 082
gates07 2 202 2 202 14 982 365
hyang02 12 272 12 272 172 880 4 855
hyang05 10 648 10 648 123 521 123
hyang07 574 574 14 637 221
hyang09 574 574 1 930 592
hyang10 9 928 9 928 64 228 7 845
hyang12 9 928 9 619 39 030 2 810
little00 1 518 1 518 24 517 508
little01 14 070 13 828 52 399 477
quad03 509 509 2448 0
UAV123 [uav123]
bike01 553 553 553 0
person01 799 799 799 0
person02 2 514 2 514 2 514 0
person03 643 643 643 0
person04 254 254 254 0
person05 2 101 2 101 2 101 0
person06 658 658 658 0
person07 1 943 1 873 1 873 0
person08 126 126 126 0
person10 582 514 514 0
person12 1 621 1 548 1 548 0
person13 155 155 155 0
person14 2 034 2 034 2 034 0
person15 712 712 712 0
person16 1 147 1 038 1 038 0
person17 1 852 1 820 1 820 0
person22 24 24 24 0
person23 153 153 153 0
wakeboard02 733 733 733 0
wakeboard03 748 748 748 0
wakeboard04 586 586 586 0
wakeboard05 758 758 758 0
wakeboard06 401 401 401 0
wakeboard07 59 59 59 0
wakeboard08 321 321 321 0
41 136 638 134 988 982 578 33 188
(a) Utilized external optical image sequences.
Sequence Frames Annotations
total annotated total occluded
ETH TIR [eth_tir]
asl 659 659 1 021 191
sempach06 600 413 413 0
sempach07 370 359 1 391 140
sempach08 634 634 1 359 233
sempach09 576 576 982 59
sempach10 261 215 321 19
sempach11 197 192 707 113
sempach12 775 724 724 50
OTCVBS 1 [otcvbs1]
set01 31 31 91 0
set02 28 28 100 0
set03 23 23 101 0
set04 18 18 109 0
set05 23 23 101 0
set06 18 18 97 0
set07 22 22 94 0
set08 24 24 99 0
set09 73 73 95 0
set10 24 24 97 0
OTCVBS 11 [otcvbs11]
set02 444Original name: set2/seq3/nuc. 1 273 1 273 69 841 0
set03 555Original name: set2/set4/nuc. 1 131 1 131 72 686 0
PTB TIR [ptb_tir]
stranger01 95 95 95 0
stranger02 280 280 280 0
stranger03 100 100 100 0
walking 315 315 315 0
VOT TIR [vot_tir]
jacket 1 451 1 451 1 451 178
25 9 001 8 701 152 670 983
(b) Utilized external infrared image sequences.
Table 6: All external datasets used for training
Figure 15: Annotated frames of the collected datasets containing a total of over humans in different poses, in both optical and thermal imagery, and in a variety of different environments.
Figure 16: Qualitative samples. First row: Successfully detected humans, sitting and occluded. Second row: Working merging of optical and infrared image. Here, the human is easier to detect in the infrared spectrum. Third row: Night flight and detected mannequin that was placed before the flight. Fourth row: Correct detections by the network that have been missed during the manual labeling process.