There is a growing interest in the commercial and recreational use of drones. This in turn imposes a threat to public safety. The Federal Aviation Administration (FAA) and NASA have reported numerous cases of drones disturbing the airline flight operations, leading to near collisions. It is therefore important to develop a robust drone monitoring system that can identify and track illegal drones. Drone monitoring is however a difficult task because of diversified and complex background in the real world environment and numerous drone types in the market.
Generally speaking, techniques for localizing drones can be categorized into two types: acoustic and optical sensing techniques. The acoustic sensing approach achieves target localization and recognition by using a miniature acoustic array system. The optical sensing approach processes images or videos to estimate the position and identity of a target object. In this work, we employ the optical sensing approach by leveraging the recent breakthrough in the computer vision field.
The objective of video-based object detection and tracking is to detect and track instances of a target object from image sequences. In earlier days, this task was accomplished by extracting discriminant features such as the scale-invariant feature transform (SIFT)  and the histograms of oriented gradients (HOG) 
. The SIFT feature vector is attractive since it is invariant to object’s translation, orientation and uniform scaling. Besides, it is not too sensitive to projective distortions and illumination changes since one can transform an image into a large collection of local feature vectors. The HOG feature vector is obtained by computing normalized local histograms of image gradient directions or edge orientations in a dense grid. It provides another powerful feature set for object recognition.
In 2012, Krizhevsky et al. 
demonstrated the power of the convolutional neural network (CNN) in the ImageNet grand challenge, which is a large scale object classification task, successfully. This work has inspired a lot of follow-up work on the developments and applications of deep learning methods. A CNN consists of multiple convolutional and fully-connected layers, where each layer is followed by a non-linear activation function. These networks can be trained end-to-end by back-propagation. There are several variants in CNNs such as the R-CNN, SPPNet  and Faster-RCNN . Since these networks can generate highly discriminant features, they outperform traditional object detection techniques in object detection by a large margin. The Faster-RCNN includes a Region Proposal Network (RPNs) to find object proposals, and it can reach real time computation.
The contributions of our work are summarized below.
To the best of our knowledge, this is the first one to use the deep learning technology for the challenging drone detection and tracking problem.
We propose to use a large number of synthetic drone images, which are generated by conventional image processing and 3D rendering algorithms, along with a few real 2D and 3D data to train the CNN.
We propose to use the residue information from an image sequence to train and test an CNN-based object tracker. It allows us to track a small flying object in a cluttered environment.
We propose an integrated drone monitoring system that consists of a drone detector and a generic object tracker. The integrated system outperforms the detection-only and the tracking-only sub-systems.
We have validated the proposed system on several drone datasets.
2 Data Collection and Augmentation
2.1 Data Collection
The first step in developing the drone monitoring system is to collect drone flying images and videos for the purpose of training and testing. We collect two drone datasets as shown in Fig. 1. They are explained below.
Public-Domain drone dataset.
It consists of 30 YouTube video sequences captured in an indoor or outdoor environment with different drone models. Some samples in this dataset are shown in Fig. 0(a). These video clips have a frame resolution of 1280 x 720 and their duration is about one minute. Some video clips contain more than one drone. Furthermore, some shoots are not continuous.
USC drone dataset.
It contains 30 video clips shot at the USC campus. All of them were shot with a single drone model. Several examples of the same drone in different appearance are shown in Fig. 0(b). To shoot these video clips, we consider a wide range of background scenes, shooting camera angles, different drone shapes and weather conditions. They are designed to capture drone’s attributes in the real world such as fast motion, extreme illumination, occlusion, etc. The duration of each video approximately one minute and the frame resolution is 1920 x 1080. The frame rate is 15 frames per second.
We annotate each drone sequence with a tight bounding box around the drone. The ground truth can be used in CNN training. It can also be used to check the CNN performance when we apply it to the testing data.
2.2 Data Augmentation
The preparation of a wide variety of training data is one of the main challenges in the CNN-based solution. For the drone monitoring task, the number of static drone images is very limited and the labeling of drone locations is a labor intensive job. The latter also suffers from human errors. All of these factors impose an additional barrier in developing a robust CNN-based drone monitoring system. To address this difficulty, we develop a model-based data augmentation technique that generates training images and annotates the drone location at each frame automatically.
The basic idea is to cut foreground drone images and paste them on top of background images as shown in Fig. 2. To accommodate the background complexity, we select related classes such as aircrafts, cars in the PASCAL VOC 2012 . As to the diversity of drone models, we collect 2D drone images and 3D drone meshes of many drone models. For the 3D drone meshes, we can render their corresponding images by changing camera’s view-distance, viewing-angle, lighting conditions. As a result, we can generate many different drone images flexibly. Our goal is to generate a large number of augmented images to simulate the complexity of background images and foreground drone models in a real world environment. Some examples of the augmented drone images of various appearances are shown in Fig. 2.
Specific drone augmentation techniques are described below.
We apply geometric transformations such as image translation, rotation and scaling. We randomly select the angle of rotation from the range (-30, 30). Furthermore, we conduct uniform scaling on the original foreground drone images along the horizontal and the vertical direction. Finally, we randomly select the drone location in the background image.
To simulate drones in the shadows, we generate regular shadow maps by using random lines and irregular shadow maps via Perlin noise . In the extreme lighting environments, we observe that drones tend to be in monochrome (i.e. the gray-scale) so that we change drone images to gray level ones.
This augmentation technique is used to simulate blurred drones caused by camera’s motion and out-of-focus. We use some blur filters (e.g. the Gaussian filter, the motion Blur filter) to create the blur effects on foreground drone images.
3 Drone Monitoring System
To realize the high performance, the system consists of two modules; namely, the drone detection module and the drone tracking module. Both of them are built with the deep learning technology. These two modules complement each other, and they are used jointly to provide the accurate drone locations for a given video input.
3.1 Drone Detection
The goal of drone detection is to detect and localize the drone in static images. Our approach is built on the Faster-RCNN 
, which is one of the state-of-the-art object detection methods for real-time applications. The Faster-RCNN utilizes the deep convolutional networks to efficiently classify object proposals. To achieve real time detection, the Faster-RCNN replaces the usage of external object proposals with the Region Proposal Networks (RPNs) that share convolutional feature maps with the detection network. The RPN is constructed on the top of convolutional layers. It consists of two convolutional layers – one that encodes conv feature maps for each proposal to a lower-dimensional vector and the other that provides the classification scores and regressed bounds. The Faster-RCNN achieves nearly cost-free region proposals and it can be trained end-to-end by back-propagation. We use the Faster-RCNN to build the drone detector by training it with synthetic drone images generated by the proposed data augmentation technique as described in Sec.2.2.
3.2 Drone Tracking
The drone tracker attempts to locate the drone in the next frame based on its location at the current frame. It searches around the neighborhood of the current drone’s position. This helps detect a drone in a certain region instead of the entire frame. To achieve this objective, we use the state-of-the-art object tracker called the Multi-Domain Network (MDNet) . The MDNet is able to separate the domain independent information from the domain specific information in network training. Besides, as compared with other CNN-based trackers, the MDNet has fewer layers, which lowers the complexity of an online testing procedure.
To improve the tracking performance furthermore, we propose a video pre-processing step. That is, we subtract the current frame from the previous frame and take the absolute values pixelwise to obtain the residual image of the current frame. Note that we do the same for the R,G,B three channels of a color image frame to get a color residual image. Three color image frames and their corresponding color residual images are shown in Fig. 4 for comparison. If there is a panning movement of the camera, we need to compensate the global motion of the whole frame before the frame subtraction operation.
Since there exists strong correlation between two consecutive images, most background of raw images will cancel out and only the fast moving object will remain in residual images. This is especially true when the drone is at a distance from the camera and its size is relatively small. The observed movement can be well approximated by a rigid body motion. We feed the residual sequences to the MDNet for drone tracking after the above pre-processing step. It does help the MDNet to track the drone more accurately. Furthermore, if the tracker loses the drone for a short while, there is still a good probability for the tracker to pick up the drone in a faster rate. This is because the tracker does not get distracted by other static objects that may have their shape and color similar to a drone in residual images. Those objects do not appear in residual images.
3.3 Integrated Detection and Tracking System
There are limitations in detection-only or tracking-only modules. The detection-only module does not exploit the temporal information, leading to huge computational waste. The tracking-only module does not attempt to recognize the drone object but follow a moving target only. To build a complete system, we need to integrate these two modules into one. The flow chart of the proposed drone monitoring system is shown in Fig. 5.
Generally speaking, the drone detector has two tasks – finding the drone and initializing the tracker. Typcially, the drone tracker is used to track the detected drone after the initialization. However, the drone tracker can also play the role of a detector when an object is too far away to be robustly detected as a drone due to its small size. Then, we can use the tracker to track the object before detection based on the residual images as the input. Once the object is near, we can use the drone detector to confirm whether it is a drone or not.
An illegal drone can be detected once it is within the field of view and of a reasonable size. The detector will report the drone location to the tracker as the start position. Then, the tracker starts to work. During the tracking process, the detector keeps providing the confidence score of a drone at the tracked location as a reference to the tracker. The final updated location can be acquired by fusing the confidence scores of the tracking and the detection modules as follows.
For a candidate bounding box, we can compute the confidence scores of this location via
where and denote the confidence scores obtained by the detector and the tracker, respectively, is the confidence score of this candidate location and parameters , , , are used to control the acceptance threshold.
We compute the confidence score of a couple of bounding box candidates, denoted by , , where denoted the set of candidate indices. Then, we select the one with the highest score:
where is the finally selected bounding box and is its confidence score. If , the system will report a message of rejection.
4 Experimental Results
4.1 Drone Detection
We test on both the real-world and the synthetic datasets. Each of them contains 1000 images. The images in the real-world dataset are sampled from videos in the USC Drone dataset. The images in the synthetic dataset are generated using different foreground and background images in the training dataset. The detector can take any size of images as the input. These images are then re-scaled such that their shorter side has 600 pixels .
To evaluate the drone detector, we compute the precision-recall curve. Precision is the fraction of the total number of detections that are true positive. Recall is the fraction of the total number of labeled samples in positive class that are true positive. The area under the precision-recall curve (AUC)  is also reported. The effectiveness of the proposed data augmentation technique is illustrated in Fig. 6. In this figure, we compare the performance of the baseline method that uses simple geometric transformations only and that of the method that uses all mentioned data augmented techniques, including geometric transformations, illumination conditions and image quality simulation. Clearly, better detection performance can be achieved by more augmented data. We see around and improvements in the AUC measure on the real-world and the synthetic datasets, respectively.
4.2 Drone Tracking
The MDNet is adopted as the object tracker. We take 3 video sequences from the USC drone dataset as testing ones. They cover several challenges, including scale variation, out-of-view, similar objects in background, and fast motion. Each video sequence has a duration of 30 to 40 seconds with 30 frames per second. Thus, each sequence contains 900 to 1200 frames. Since all video sequences in the USC drone dataset have relatively slow camera motion, we can also evaluate the advantages of feeding residual frames (instead of raw images) to the MDNet.
The performance of the tracker is measured with the area-under-the-curve (AUC) measure. We first measure the intersection over union for all frames in all video sequences as
where the “Area of Overlap” is the common area covered by the predicted and the ground truth bounding boxes and the “Area of Union” is the union of the predicted and the ground truth bounding boxes. The IoU value is computed at each frame. If it is higher than a threshold, the success rate is set to 1; otherwise, 0. Thus, the success rate value is either 1 or 0 for a given frame. Once we have the success rate values for all frames in all video sequences for a particular threshold, we can divide the total success rate by the total frame number. Then, we can obtain a success rate curve as a function of the threshold. Finally, we measure the area under the curve (AUC) which gives the desired performance measure.
We compare the success rate curves of the MDNet using the original images and the residual images in Fig. 7. As compared to the raw frames, the AUC value increases by around 10% using the residual frames as the input. It collaborates the intuition that removing background from frames helps the tracker identify the drones more accurately. Although residual frames help improve the performance of the tracker for certain conditions, it still fails to give good results in two scenarios: 1) movement with fast changing directions and 2) co-existence of many moving objects near the target drone. To overcome these challenges, we have the drone detector operating in parallel with the drone tracker to get more robust results.
4.3 Fully Integrated System
The fully integrated system contains both the detection and the tracking modules. We use the USC drone dataset to evaluate the performance of the fully integrated system. The performance comparison (in terms of the AUC measure) of the fully integrated system, the conventional MDNet (the tracker-only module) and the Faster-RCNN (the detector-only module) is shown in Fig. 8. The fully integrated system outperforms the other benchmarking methods by substantial margins. This is because the fully integrated system can use detection as the means to re-initialize its tracking bounding box when it loses the object.
A video-based drone monitoring system was proposed in this work. The system consisted of the drone detection module and the drone tracking module. Both of them were designed based on deep learning networks. We developed a model-based data augmentation technique to enrich the training data. We also exploited residue images as the input to the drone tracking module. The fully integrated monitoring system takes advantage of both modules to achieve high performance monitoring. Extensive experiments were conducted to demonstrate the superior performance of the proposed drone monitoring system.
This research is supported by a grant from the Pratt & Whitney Institute of Collaborative Engineering (PWICE).
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, pp. 1097–1105, 2012.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, pp. 91–99, 2015.
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies
for accurate object detection and semantic segmentation,” in
Computer Vision and Pattern Recognition, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” in European Conference on Computer Vision, pp. 346–361, Springer, 2014.
-  H. Nam, B. Han, Learning Multi-Domain Convolution Neural Networks for Visual Tracking.In CVPR, 2016.
-  J. Huang and C. X. Ling, “Using AUC and accuracy in evaluating learning algorithms,” IEEE Transactions on knowledge and Data Engineering, vol. 17, no. 3, pp. 299–310, 2005.
-  N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 1, pp. 886–893, IEEE, 2005.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, no. 2, pp. 91–110, 2004.
-  M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.” http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
-  K. Perlin, “An image synthesizer,” ACM Siggraph Computer Graphics, vol. 19, no. 3, pp. 287–296, 1985.