Delving into Robust Object Detection from Unmanned Aerial Vehicles: A Deep Nuisance Disentanglement Approach

by   Zhenyu Wu, et al.

Object detection from images captured by Unmanned Aerial Vehicles (UAVs) is becoming increasingly useful. Despite the great success of the generic object detection methods trained on ground-to-ground images, a huge performance drop is observed when they are directly applied to images captured by UAVs. The unsatisfactory performance is owing to many UAV-specific nuisances, such as varying flying altitudes, adverse weather conditions, dynamically changing viewing angles, etc. Those nuisances constitute a large number of fine-grained domains, across which the detection model has to stay robust. Fortunately, UAVs will record meta-data that depict those varying attributes, which are either freely available along with the UAV images, or can be easily obtained. We propose to utilize those free meta-data in conjunction with associated UAV images to learn domain-robust features via an adversarial training framework dubbed Nuisance Disentangled Feature Transform (NDFT), for the specific challenging problem of object detection in UAV images, achieving a substantial gain in robustness to those nuisances. We demonstrate the effectiveness of our proposed algorithm, by showing state-of-the-art performance (single model) on two existing UAV-based object detection benchmarks. The code is available at


page 2

page 6

page 8


Leveraging domain labels for object detection from UAVs

Object detection from Unmanned Aerial Vehicles (UAVs) is of great import...

HIT-UAV: A High-altitude Infrared Thermal Dataset for Unmanned Aerial Vehicles

This paper presents a High-altitude infrared thermal dataset, HIT-UAV, f...

A Comprehensive Approach for UAV Small Object Detection with Simulation-based Transfer Learning and Adaptive Fusion

Precisely detection of Unmanned Aerial Vehicles(UAVs) plays a critical r...

Validation of object detection in UAV-based images using synthetic data

Object detection is increasingly used onboard Unmanned Aerial Vehicles (...

Person Re-identification in Aerial Imagery

Nowadays, with the rapid development of consumer Unmanned Aerial Vehicle...

Fast and Robust Structural Damage Analysis of Civil Infrastructure Using UAV Imagery

The usage of Unmanned Aerial Vehicles (UAVs) in the context of structura...

Nazr-CNN: Fine-Grained Classification of UAV Imagery for Damage Assessment

We propose Nazr-CNN1, a deep learning pipeline for object detection and ...

Code Repositories


[ICCV 2019] "Delving into Robust Object Detection from Unmanned Aerial Vehicles: A Deep Nuisance Disentanglement Approach"

view repo

1 Introduction

(a) Baseline F-RCNN
(b) NDFT-Faster-RCNN (A)
(c) NDFT-Faster-RCNN (A+V)
(d) NDFT-Faster-RCNN(A+V+W)
Figure 5: Examples showing the benefit of the proposed NDFT framework for object (vechicle) detection on the UAVDT dataset: starting from (a) Faster-RCNN [42] baseline, to gradually (b) disentangling the nuisances of altitude (A); (c) disentangling the nuisances of both altitude (A) and view angles (V); and (d) disentangling all the nuisances of altitude (A), view angles (V), and weather (W). The detection performance gradually improves from (a) to (d) with disentanglement on more nuisances (red rectangular boxes denote new correct detections beyond the baseline).

Object detection has been extensively studied over the decades. While most of the promising detectors are able to detect objects of interest in clear images, such images are usually captured from ground-based cameras. With the rapid development of machinery technology, Unmanned Aerial Vehicles (UAVs) equipped with cameras have been increasingly deployed in many industrial application, opening up a new frontier of computer vision applications in security surveillance, peacekeeping, agriculture, deliveries, aerial photography, disaster assistance

[43, 26, 3, 14, 47], etc. One of the core features for the UAV-based applications is to detect objects of interest (e.g., pedestrians or vehicles). Despite high demands, object detection from UAV is yet insufficiently investigated. In the meantime, the large mobility of UAV-mounted cameras bring in greater challenges than traditional object detection (using surveillance or other ground-based cameras), such as but not limited to:

  • Variations in altitude and object scale: The scales of objects captured in the image are closely affected by the flying altitude of UAVs. For example, the image captured by a DJI Inspire 2 series flying at 500 meters altitude [2] will contain very small objects, which are very challenging to detect and track. In addition, a UAV can be operated in a variety of altitudes while capturing images. When shooting in lower altitudes, its camera can capture more details of objects of interest. When it flies to higher altitudes, the camera can inspect a larger area and more objects will be captured in the image. As a consequence, the same object can vary a lot in terms of scale throughout the captured video, with different flying altitudes during a single flight.

  • Variations in view angle: The mobility of UAVs leads to video shoots from different and free angles, in addition to the varying altitudes. For example, a UAV can look at one object from front view, to side view, to bird view, in a very short period of time. The diverse view angles cause arbitrary orientations and aspect ratios of the objects. Some view angles such as bird-view hardly occur in traditional ground-based object detection. As a result, the UAV-based detection model has to deal with more different visual appearances of the same object. Note that more view angles can be presented when altitudes grow higher. Also, wider view angles often lead to denser objects in the view.

  • Variations in weather and illumination: A UAV operated in uncontrolled outdoor environments may fly under various weather and lighting conditions. The changes in illumination (daytime versus nighttime) and weathers (e.g. sunny, cloudy, foggy or rainy), will drastically affect the object visibility and appearance.

Most off-the-shelf detectors are trained with usually less varied, more restricted-view data. In comparison, the abundance of UAV-specific nuisances will cause the resulting UAV-based detection model to operate in a large number of different fine-grained domains. Here a domain could be interpreted as a specific combination of nuisances: for example, the images taken at low-altitude and daytime, and those taken the high-altitude and nighttime domain, constitute two different domains. Therefore, our goal is to train a cross-domain object detection model that stays robust to those massive number of fine-grained domains. Existing potential solutions include data augmentation [1, 13], domain adaption [37, 8], and ensemble of expert models [27]. However, neither of these approaches are easy to generalize to multiple and/or unseen domains [37, 8], and they could lead to over-parameterized models which is not suitable for UAV on-board deployments [1, 13, 27].

A (Almost) Free Lunch: Fine-Grained Nuisance Annotations. In view of the above, we cast UAV-based object detection problem as a cross-domain object detection problem with fine-grained domains. The object types of interest sustain across domains; such task-related features shall be preserved and extracted. The above UAV-specific nuisances constitute the domain-specific nuisances, that should be eliminated for transferable feature learning. For UAVs, major nuisance types are well recognized, e.g., altitude, angle and weather. More Importantly, in the specific case of UAVs, those nuisances annotations could be easily obtained or even freely available. For example, a UAV can record its flying altitudes as metadata by GPS, or more accurately, by a barometric sensor. For another example, weather information is easy to retrieve, since with each UAV flight’s time-stamp and spatial location (or path), one can straightforwardly obtain the weather of specific time/location.

Motivated by those observations, we propose to learn an object detection model that maintains its effectiveness in extracting task-related features while eliminating the recognized types of nuisances, across different domains (e.g., altitudes/angles/weathers). We take advantage of the free (or easy) access to the nuisance annotations. Based on them, we are the first to adopt an adversarial learning framework, to learn task-specific, domain-invariant features by explicitly disentangling task-specific and nuisance features in a supervised way. The framework, dubbed Nuisance Disentangled Feature Transform (NDFT), gives rise to highly robust UAV-based object detection models, that can be directly applicable to not only domains in training, but also more unseen domains, without needing any extra effort of domain adaptation or sampling/labeling. Experiments on two real UAV-based object detection benchmarks suggest the state-of-the-art effectiveness of NDFT.

2 Related Works

2.1 Object Detection: General and UAV-Specific

Object detection has progressed tremendously, partially thanks to established benchmarks (i.e. MS COCO [31] and PASCAL VOC [15]). There are primarily two main streams of approaches: two-stage detectors and single-stage detectors, based on whether the detectors have proposal-driven mechanism or not. Two stage detectors [18, 23, 17, 42, 10, 54, 55] contains region proposal network (RPN) to first generate region proposals, and then extract region-based features to predict the object categories and their corresponding locations. Single-stage detectors [39, 40, 41, 34] apply dense sampling windows over object locations and scales, and usually achieved higher speed than two-stage ones, although often at the cost of (marginal) accuracy decrease.

Aerial Image-based Object Detection A few aerial image datasets (i.e. DOTA [52], NWPU VHR-10 [9], and VEDAI [38] ) were proposed recently. However, those above datasets only contain geo-spatial images (e.g., satellite) with bird-view small objects, which are not as diverse as UAV-captured images with greatly more varied altitudes, poses and weathers. Also, the common practice to detect objects from aerial images remains still to deploy off-the-shelf ground-based object detection models [21, 36].

Public benchmarks were unavailable for specifically UAV-based object detection until recently. Two datasets, UAVDT [12] and VisDrone2018 [57], were released to address this gap. UAVDT consists of 100 video sequences (about 80k frames) captured from UAVs under complex scenarios. Moreover, it also provides full annotations for weather conditions, flying altitudes, and camera views in addition to the ground truth bounding box of the target objects. VisDrone2018 [57] is a large-scale UAV-based object detection and tracking benchmark, composed of 10,209 static images and 179,264 frames from 263 video clips.

Detecting Tiny Objects A typical ad-hoc approach to detect tiny objects is through learning representations of all the objects at multiple scales. This approach is however highly inefficient with limited performance gains. [7]

proposed a super-resolution algorithm using coupled dictionary learning to transfer the target region into high resolution to “augment” its visual appearance.

[50, 29, 32] proposed to internally super-resolve the feature maps of small objects to make them resemble similar characteristics as large objects. SNIP [45] showed that CNNs were not naturally robust to the variations in object scales. It proposed to train and test detectors on the same scales of an image pyramid, and selectively back-propagate the gradients of object instances of different sizes as a function of the image scale during the training stage. SNIPER [46] further processed context regions around ground-truth instances at different appropriate scales to efficiently train the detector at multiple scales, improving the detection of tiny object detection more.

2.2 Handling Domain Variances

Domain Adaptation via Adversarial Training Adversarial domain adaptation [16] was proposed to reduce the domain gap by learning with only labeled data from a source domain plus massive unlabeled data from a target domain. This approach has recently gained increased attention in the detection fields too. [49] learned robust detection models to occlusion and deformations, through hard positive examples generated by an adversarial network. [8] improved the cross-domain robustness of object detection by enforcing adversarial domain adaption on both image and instance levels. [5] introduced a Siamese-GAN to learn invariant feature representations for both labeled and unlabeled aerial images coming from two different domains. CyCADA [25] unified cycle-consistency with adversarial loss to learn domain-invariance. However, these domain adaption methods typically assume one (ideal) source domain and one (non-ideal) target domain. The possibility of generalizing these methodologies to handling many fine-grained domains is questionable. Once a new unseen domain emerges, domain adaptation needs explicit re-training.

In comparison, our proposed framework does not assume any ideal reference (source) domain, but rather tries to extract invariant features shared by many different “non-ideal” target domains (both seen and unseen), by disentangling domain-specific nuisances. The setting thus differs from typical domain adaptation and generalizes to task-specific feature extraction in unseen domains naturally.

Data Augmentation, and Model Ensemble Compared to the considerable amount of research in data augmentation for classification [16], less attention was paid on other tasks such as detection [1]. Classical data augmentation relies on a limited set of pre-known factors (such as scaling, rotation, flipping) that are easy to invoke, and adopt ad-hoc, minor perturbations that are unlikely to change labels, in order to gain robustness to those variations. However, UAV images will involve a much larger variety of nuisances, many of which are hard to “synthesize”, e.g., images from different angles. [13, 56] proposed learning-based approaches to synthesize new training samples for detection. But they focused on re-combining foreground objects and background contexts, rather than re-composing specific nuisance attributes. Also, the (much) larger augmented dataset adds to training burden and may cause over-parameterized models.

Another methodology was proposed in [27]. To capture the appearance variations caused by different shapes poses and viewing angles, it proposed a Multi-Expert R-CNN consisting of three experts, each responsible for objects with a particular shape: horizontally elongated, square-like, and vertically elongated. This approach has limitations as the model ensemble quickly becomes too expensive as more different domains are involved. It further cannot generalize to unknown or unseen domains.

Feature Disentanglement in Generative Models Feature disentanglement [53] leads to non-overlapped groups of factorized latent representations, each of which would properly describe corresponding information to particular attributes of interest. It has mostly been applied to generative models [11, 44]

, in order to disentangle the factors of variation from the content in the latent feature space. In the image-to-image translation, a recent work

[19] disentangled image representations into shared parts for both domains and exclusive parts for either domain. NDFT extends the idea of feature disentanglement to learning cross-domain robust discriminative models. Due to the different application scope from generative models, we do not add back the disentangled components to reconstruct the original input.

3 Our Approach

Figure 6: Our proposed NDFT-Faster-RCNN network.

3.1 Formulation of NDFT

Our proposed UAV-based cross-domain object detection can be characterized as an adversarial training framework. Assume our training data is associated with an Object detection task , and a UAV-specific Nuisance prediction task . We mathematically express the goal of cross-domain object detection as alternatively optimizing two objectives as follows ( is a weight coefficient):


In (1), denotes the model that performs the object detection task on its input data. The label set are object bounding box coordinates and class labels provided on . is a cost function defined to evaluate the object detection performance on . On the other hand, the labels of the UAV-specific nuisances come from metadata along with (e.g., flying altitude, camera view or weather condition), and a standard cost function (e.g., softmax) is defined to evaluate the task performance on . Here we formulate nuisance robustness as the suppression of the nuisance prediction accuracy from the learned features.

We seek a Nuisance Disentangled Feature Transform (NDFT) by solving (1), such that

  • The object detection task performance is minimally affected over , compared to using .

  • The nuisance prediction task performance is maximally suppressed over , compared to using .

In order to deal with the multiple nuisances case, we extend the (1) to multiple prediction tasks. Here we assume nuisances prediction tasks associated with label sets . are the respective weight coefficients. The modified objective naturally becomes:


, and s can all be implemented by deep networks.

Interpretation as Three-Party Game NDFT can be derived from a three-competitor game optimization:

where is an obfuscator, as a attacker, and as an utilizer (adopting ML security terms). In fact, the two sub-optimizations in (1) denote an iterative routine to solve this unified form (performing coordinate descent between {, }, and ). This form can easily capture many other settings or scenarios, e.g., privacy-preserving visual recognition [51, 48] where encodes features to avoid peeps from while preserving utility for .

Given pre-trained NDFT module , object detection task module , and nuisances prediction modules s
for number of training iterations do
     Sample a mini-batch of n examples {}
     Update NDFT module (weights ) and object detection module (weights ) with stochastic gradients:
     while at least one nuisance prediction task has training accuracy  do Prevent s from becoming too weak.
         Update nuisance prediction modules (weights ) with stochastic gradients:
     Restart every 1000 iterations, and repeat Algorithm 1 from the beginning. Alleviate overfitting.
Algorithm 1 Learning Nuisance Disentangled Feature Transform in UAV-based Object Detection via Adversarial Training

3.2 Implementation and Training

Architecture Overview: NDFT-Faster-RCNN As an instance of the general NDFT framework (3.1), Figure 6 displays an implementation example of NDFT using the Faster-RCNN backbone [42], while later we will demonstrate that NDFT can be plug-and-play with other more sophisticated object detection networks (e.g., FPN).

During training, the input data first goes through the NDFT module , and its output is passed through two subsequent branches simultaneously. The upper object detection branch , uses to detect objects, while the lower nuisance prediction model predicts nuisance labels from the same . Finally, the network minimizes the prediction penalty (error rate) for , while maximizing the prediction penalty for , shown by (3.1).

By jointly training , and s in the above adversarial settings, the NDFT module will find the optimal transform that preserves the object detection related features while removing the UAV-specific nuisances prediction related features, fulfilling the goal of cross-domain object detection that is robust to the UAV-specific nuisances.

Choices of , and In this NDFT-Faster-RCNN example, includes the conv1_x, conv2_x, conv3_x and conv4_x of the ResNet101 part of Faster-RCNN. includes the conv5_x layer, attached with a classification and regression loss for detection. We further implement using the same architecture as (except the number of classes for prediction). The output of is fed to after going through RoIAlign [22] layer, while it is fed to after going through a spatial pyramid pooling layer [23]. Choices of and is the bounding box classification (e.g., softmax) and regression loss (e.g., smooth ) as widely used in traditional two stage detectors. However, using as the adversarial loss in the first row of (3.1) is not straightforward. If we choose as some typical classification loss such as the softmax, then maximizing it directly is prone to gradient explosion. After experimenting with several solutions such as the gradient reversal trick [16], we decide to follow [35]

to choose the negative entropy function of the predicted class vector as the adversarial loss, denoted as

. Minimizing will encourage the model to make “uncertain” predictions (equivalently, close to uniform random guesses) on the nuisances.

Since we replace with in the first objective in (3.1), it no longer needs . Meanwhile, the usage of and remains unaffected in the second objective of (3.1). and are used to pre-train s at the initialization and keep s as “sufficiently strong adversaries” throughout the adversarial training, in order to learn meaningful that can generalize better. Our final framework alternates between:


Training Strategy Just like training GANs [20], our training is prone to collapse and/or bad local minima. We thus presented a carefully-designed training algorithm with the alternating update strategy. The training procedure is summarized in Algorithm 1 and explained below.

For each mini-batch, we first jointly optimize and weights (with s frozen), by minimizing the first objective in (3.2

) using the standard stochastic gradient descent (SGD). Meanwhile, we will keep “monitering”

branches: as is updated, if at least one of the becomes too weak (i.e., showing poor predicting accuracy on the same mini-batch), another update will be triggered by minimizing the second objective in (3.2) using SGD. The goal is to “strengthen” the nuisance prediction competitors. Besides, we also discover an empirical trick, by periodically re-setting the current weights of to random initialization, and then re-train them on (with fixed) to become strong nuisance predictors again, before we re-start the above alternative process of , and s. This re-starting trick is also found to benefit the generalization of learned [51], potentially due to helping get out of some bad local minima.

4 Experimental Results

(a) DE-FPN
Figure 9: An example showing the benefit of the proposed NDFT approach for object detection on VisDrone2018 dataset. The blue and green rectangular boxes denote pedestrians and cars respectively. Red rectangular boxes denote new correctly detected objects by NDFT-DE-FPN beyond the baseline of DE-FPN.

Since public UAV-based object detection datasets (in particular those with nuisance annotations) are currently of very limited availability, we design three sets of experiments to validate the effectiveness, robustness, and generality of NDFT. First, we perform the main body of experiments on the UAVDT benchmark [12], which provides all three UAV-specific nuisance annotations (altitude, weather, and view angle). We demonstrate the clear observation that the more variations are disentangled via NDFT, the larger AP improvement we will gain on UAVDT; and eventually we achieve the state-of-the-art performance on UAVDT.

We then move to the other public benchmark, VisDrone2018. Originally, the nuisance annotations were not released on VisDrone2018. We manually annotate the nuisances on each image: those annotations will be released publicly, and hopefully will be contributed as a part of VisDrone. Learning NDFT gives a performance boost over the the best single model, and leads us to the (single model) state-of-the-art mean average precision (mAP)111mAP on the 10 categories of objects is the standard evaluation criterion on VisDrone2018. on VisDrone2018 validation set222The top-2 models on the UAVDT leaderboard are model ensembles. We compare with only single model solutions for fairness..

In addition, we study a transfer learning setting from the NDFT learned on UAVDT, to VisDrone2018. The goal of exploring transfer is because UAVs often come across unseen scenarios, and a good transferability of learned features facilitates more general usability. When detecting the (shared) vehicles category, shows strong transferability by outperforming the best single-model method currently reported on the VisDrone2018 leaderboard [4].

4.1 UAVDT: Results and Ablation Study

Problem Setting The image object detection track on UAVDT consists of around 41k frames with 840k bounding boxes. It has three categories: car, truck and bus, but the class distribution is highly imbalanced (the latter two occupy less than 5% of bounding boxes). Hence following the convention by the authors in [12], we combine the three into one vehicle class and report AP based on that. All frames are also annotated with three categories of UAV-specific nuisances: flying altitude (low, medium and high), camera views (front-view, side-view and bird-view), and weather condition333We discard another “foggy” class because of its too small size.(daylight, night). We will denote the three nuisances as A, V, and W for short, respectively.

Implementation Details We first did our best due diligence to improve the baseline (without considering nuisance handling) on UAVDT, to ensure a solid enough ground for NDFT. The authors reported a AP of

20 using a Faster-RCNN model with the VGG-16 backbone. We replace the backbone with ResNet-101, and fine-tune hyperparameters such as anchor scale (16,32,64,128,256). We end up with an improved AP of 45.64 (using the same IoU threshold = 0.7 as the authors) as our baseline performance. We also communicated with the authors of

[12] in person and they acknowledged this improved baseline. We then implement NDFT-Faster-RCNN using the architecture depicted in Figure 6, also with a ResNet-101 backbone. We denote , and as the coefficients in (3), for the loss terms for altitude, view and weather nuisances, respectively.

A Low Med High Overall
0.0 68.14 49.71 18.70 45.64
0.01 69.01 50.46 14.63 45.31
0.02 66.97 46.91 16.69 44.17
0.03 66.38 53.00 15.69 45.92
0.05 65.46 48.43 16.58 44.36
Table 2: Learning NDFT-Faster-RCNN on view angle nuisance only, with different values on the UAVDT dataset.
V Front Side Bird Overall
0.0 53.34 68.02 27.05 45.64
0.01 57.45 67.61 25.60 46.16
0.02 61.49 66.85 24.93 45.73
0.03 54.55 68.22 23.07 45.42
0.04 64.93 66.83 24.96 46.10
Table 3: Learning NDFT-Faster-RCNN on weather nuisance only, with different values
W Day Night Overall
0.0 45.63 52.14 45.64
0.01 45.18 59.66 46.62
0.025 43.72 57.41 44.43
0.05 43.89 50.25 43.79
0.1 44.28 48.78 43.60
Table 1: Learning NDFT-Faster-RCNN on altitude nuisance only, with different values on the UAVDT dataset.

Results and Analysis We unfold our full ablation study on UAVDT in a progressive way: first we study the impact of removing each individual nuisance type (A, V, and W) . We then gradually proceed to removing two and three nuisance types, and show the resulting consistent gains.

Tables 3, 3, and 3 show the benefit of removing flying altitude (A), camera view (V) and weather condition (W) nuisances, individually. That could be viewed as learning NDFT-Faster-CNN (Figure 6) with only the corresponding one ( = 1, 2, 3) to be nonzero. The baseline model without nuisance disentanglement has = 0, = 1, 2, 3.

As can be seen from Table 3, compared to the baseline ( = 0), an overall AP gain is obtained at , where we achieve a AP improvement of 0.28.

Table 3 shows the performance gain by removing the camera view (V) nuisance. At , an overall AP improvement of 0.52 is obtained. Similar positive observations are found in Table 3 as well, when the weather (W) nuisance is removed: results in an overall AP boost of 0.98 over the baseline, with the more challenging night class AP increased by 7.52.

Table 4 shows the full results by incrementally adding more adversarial losses into training. For example, stands for simultaneously disentangling flying altitude, camera view and weather nuisances. When using two or three losses, unless otherwise stated, we apply = 0.01 for both/all of them, as discovered to give the best single-nuisance results in Tables 3 - 3. As a consistent observation throughout the table, the more nuisances removed through NDFT, the better AP values we obtain (e.g., outperforms any of the three single models, and further achieves the best AP among all). In conclusion, removing nuisances using NDFT evidently contributes to addressing the tough problem of object detection on high-mobility UAV platforms. Furthermore, the final best-performer improves the class-wise APs noticeably on some most challenging nuisance classes, such as high-altitude, bird-view and nighttime. Improving object detection in those cases can be significant for deploying camera-mounted UAVs to uncontrolled, potentially adverse visual environments with better reliability and robustness.

Baseline A V W A+V A+W V+W A+V+W
Flying Altitude
Low 68.14 66.38 71.09 75.32 66.05 68.61 66.89 74.84
Med 49.71 53.00 52.29 51.59 54.07 49.18 56.07 56.24
High 18.70 15.69 16.62 16.08 18.60 19.19 15.42 20.55
Camera View
Front 53.34 53.90 57.45 62.36 61.23 51.05 56.67 64.88
Side 68.02 67.41 67.61 68.47 68.82 68.71 67.62 67.50
Bird 27.05 24.56 25.60 23.97 24.43 27.96 24.41 28.79
Weather Condition
Day 45.63 47.32 45.30 45.18 46.26 45.19 45.90 45.91
Night 52.14 45.82 56.70 59.66 59.16 59.78 53.35 64.16
Overall 45.64 45.92 46.16 46.62 46.88 46.64 46.03 47.91
Table 4: UAVDT NDFT-Faster-RCNN with multiple attribute disentanglement.

Adopting Stronger FPN Backbones We demonstrate that the performance gain by NDFT does not vanish as we adopt more sophisticated backbones, e.g. FPN [30]. Training FPN on UAVDT leads to the baseline performance improved from 45.64 to 49.05. By replacing Faster-RCNN with FPN in the NDFT training pipeline, the resulting model learns to simultaneously disentangle nuisances ( = 0.005, = 1,2,3). We are able to further increase the overall AP to 52.03, showing the general benefit of NDFT regardless of the backbone choices.

Proof-of-Concepts for NDFT-based Tracking With object detection as our main focus, we also evaluate NDFT on UAVDT tracking for proof-of-concept. We choose SORT [6] (a popular online and real-time tracker) and evaluate on the multi-object tracking (MOT) task defined on UAVDT. We follow the tracking-by-detection framework adopted in [12], and compare the tracking results based on the detection inputs from vanilla Faster-RCNN and NDFT-Faster-RCNN (), respectively. All evaluation protocols are inherited from [12]. As in Table 5, NDFT-FRCNN largely outperforms the vanilla baseline in 10 out of the 11 metrics, showing its promise even beyond detection.

FRCNN 43.7 58.9 34.8 39.0 74.3 33.9 28.0 33,037 172,628 2,350 5,787
NDFT-FRCNN 52.9 66.8 44.5 38.4 76.5 39.8 27.3 32,581 152,379 1,550 5,026
Table 5: NDFT versus vanilla baseline on MOT task.
(a) DE-FPN
(b) NDFT-DE-FPN(r)
Figure 12: An example showing the superior performance of NDFT-DE-FPN(r) over DE-FPN for object detection on VisDrone2018 dataset. Red boxes highlight the local regions where NDFT-DE-FPN(r) is able to detect substantially more vehicles than DE-FPN (the state-of-the-art single-model method on VisDrone2018).

Comparing NDFT with Multi-Task Learning Another plausible option to utilize nuisance annotations is to jointly predict and s as standard multi-task learning. To compare it with NDFT fairly, we switch the sign from to in (3.1) first row, through which the nuisance prediction tasks become three auxiliary losses (AL) in multi-task learning. We minimize this new optimization and carefully re-tune s for AL by performing grid search. As seen from Table 6, while AL is able to slightly improve over the baseline too (as expected), NDFT is evidently and consistently better thanks to its unique ability to encode invariances. The experiments objectively establish the role of adversarial losses versus standard auxiliary losses.

Altitude View Weather
Overall Low Med High Front Side Bird Day Night
Baseline 45.64 68.14 49.71 18.70 53.34 68.02 27.05 45.63 52.14
AL 45.69 66.58 50.80 18.28 61.49 66.85 24.93 45.62 53.64
NDFT 46.81 70.48 55.06 16.12 57.06 68.07 27.59 46.05 59.56
Table 6: Comparing the baseline Faster-RCNN, adding auxiliary losses, and our proposed NDFT method.

4.2 VisDrone2018: Results and Analysis

Problem Setting

The image object detection track on VisDrone2018 provides a dataset of 10,209 images, with 10 categories of pedestrians, vehicles and other traffic objects annotated. We manually annotate the UAV-specific nuisances, with the same three categories as on UAVDT.

According to the leaderboard [4] and workshop report [58], the best-performing single model is DE-FPN, which utilized FPN (removing P6) with a ResNeXt-101 64-4d backbone. We implement DE-FPN by identically following their method description in [58], as our comparison subject.

Implementation Details

Taking the DE-FPN backbone, NDFT is learned by simultaneously disentangling three nuisances (A+V+W). We create the DE-FPN model with NDFT, termed as NDFT-DE-FPN. The performance of DE-FPN and NDFT-DE-FPN are evaluated using the mAP over the 10 object categories on the VisDrone2018 validation set, since the testing set is not publicly accessible.

( = 1,2,3) 0 0.001 0.003 0.004 0.005 0.01 0.02
mAP 48.41 48.97 49.75 51.66 52.77 51.67 50.42
Table 7: mAP comparison on VisDrone2018 validation set.

Results and Analysis

As in Table 7, NDFT-DE-FPN gives rise to a 4.36 mAP boost over DE-FPN, making it a new state-of-the-art single model on VisDrone2018. Figure 9 shows a visual comparison example.

4.3 Transfer from UAVDT to VisDrone2018

Problem Setting

We use VisDrone2018 as a testbed to showcase the transferablity of NDFT features learned from UAVDT. We choose DE-FPN as the comparison subject.

Implementation Details

DE-FPN is trained on VisDrone 2018 training set and tested on the vehicle category of validation set. We then train the same DE-FPN backbone on UAVDT with three nuisances (A+V+W) disentangled (). The learned is then transferred to VisDrone2018, by only re-training the classification/regression layer while keep other featured extraction layers all fixed. In that way, we focus on assessing the learned feature transferablity using NDFT. Besides, we repeat the same above routine with , to create a transferred DE-FPN baseline without nuisance disentanglement. We denote the two transferred models as NDFT-DE-FPN(r) and DE-FPN(r), respectively. Since vehicle is the only shared category between UAVDT and VisDrone2018, we compare average precision on the vehicle class only to ensure a fair transfer setting. The performance of DE-FPN, NDFT-DE-FPN(r) and DE-FPN(r) are compared on the VisDrone 2018 validation set (since the testing set is not publicly accessible).

Results and Analysis

The APs of DE-FPN, DE-FPN(r) and NDFT-DE-FPN(r) are 76.80, 75.27 and 79.50, receptively on the vehicle category. Directly transferring DE-FPN from UAVDT to VisDrone2018 (fine-tuned on the latter) does not give rise to competitive performance, showing a substantial domain mismatch between the two datasets. However, transferring the learned NDFT to VisDrone2018 leads to performance boosts, with a 4.23 AP margin over the transfer baseline without disentanglement, and 2.70 over DE-FPN. It demonstrates that NDFT could potentially contribute to a more generally transferable UAV object detector that handles more unseen scenes (domains). A visual comparison example on VisDrone2018 is presented in Figure 12.

5 Conclusion

This paper investigates object detection from UAV-mounted cameras, a vastly useful yet under-studied problem. The problem appears to be more challenging than standard object detection, due to many UAV-specific nuisances. We propose to gain robustness to those nuisances, by explicitly learning a Nuisance Disentangled Feature Transform (NDFT), utilizing the “free” metadata. Extensive results on real UAV imagery endorse its effectiveness.