Aerial multi-object tracking by detection using deep association networks

09/04/2019 ∙ by Ajit Jadhav, et al. ∙ Indian Institute of Technology Delhi IIIT Sri City 0

A lot a research is focused on object detection and it has achieved significant advances with deep learning techniques in recent years. Inspite of the existing research, these algorithms are not usually optimal for dealing with sequences or images captured by drone-based platforms, due to various challenges such as view point change, scales, density of object distribution and occlusion. In this paper, we develop a model for detection of objects in drone images using the VisDrone2019 DET dataset. Using the RetinaNet model as our base, we modify the anchor scales to better handle the detection of dense distribution and small size of the objects. We explicitly model the channel interdependencies by using "Squeeze-and-Excitation" (SE) blocks that adaptively recalibrates channel-wise feature responses. This helps to bring significant improvements in performance at a slight additional computational cost. Using this architecture for object detection, we build a custom DeepSORT network for object detection on the VisDrone2019 MOT dataset by training a custom Deep Association network for the algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Object detection and tracking has remained an important research problem in computer vision

[9, 13, 33, 53]

. It is relevant for myriad of applications such as video surveillance, scene understanding, semantic segmentation, object localization, robot manipulation etc. In real time scenarios, object detection can pose several challenges such as scale, pose, illumination variations, occlusion, clutter etc. In case of videos, the additional challenge is due to the motion information in dynamic environments. We deal with a specialized category of drone images where the major challenge is posed due to fine granularity and absence of strong discriminative features to handle the inter and intra class variance. In case of unmanned aerial vehicles (UAVs), for autonomous navigation identification of obstacles for a height is very relevant. Drones are generally used for patroling border areas which cannot be done by manual military forces. The typical application ranges from tracking criminals in surveillance videos

[44], search and rescue [51], sports analysis and scene understanding [52, 34, 23, 48]. There are certain other challenges which are specific to drone images such as density of objects is huge, smaller scale, camera motion constraints and realtime deployment issues. Motivated by these issues, we focus on object detection and tracking in aerial imagery.

Figure 1: Detection Network

Owing to the flexibility of drone usage and navigation capabilities, the acquired images can also be utilized to perform 3D reconstruction and object discovery. However, in order to do so techniques resorting to simultaneous localization and mapping (SLAM) based algorithms are required which are again heavily dependent on several other sensor based data such as accelerometer, gyroscope, magnetometer etc. Further, the task of objection detection or collision avoidance methods typically require huge computational overhead. In case of mobile drone videos, the deep learning techniques require to process the images in real time with high accuracy rates. There are two most popularly used frameworks for object detection: i) two-stage framework and ii) single-stage framework. The two-stage framework represented by R-CNN [15] and its variants [16, 43, 8, 28, 6] extract object proposals followed by object classification and bounding box regression. The single stage framework, such as YOLO [40, 41, 42] and SSD [33, 14]

, apply object classifiers and bounding box regressors in an end-to-end manner without explicitly extracting object proposals. Most of the state-of-the-art methods

[40, 43, 41, 42, 26, 29, 27] typically focus on detecting generic objects from natural images, where most of the targets are sparsely distributed with fewer numbers. However, due to the intrinsic data distribution differences between drone images and natural images, the traditional CNN-based methods tend to miss such densely distributed small objects.

In this paper, we provide a novel multi-object tracking by detection framework particularly for aerial images captured by drones. We detect ten predefined categories of objects (i.e., pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle) in drone images collected for VisDrone 2019 dataset [56]. In view of above discussions, the key contributions can be summarized as follows,

  • We utilize denser anchor scales with large scale variance to detect the dense distribution of smaller objects.

  • We utilize Squeeze-and-Excitation (SE) [20] blocks to capture the channel dependencies which results in better feature representation for the detection task in moving camera constraints.

  • For the tracking model, we train the deep association network [47] on the object hypotheses generated from the detection module and feed it to the the deep sort algorithm [46] for tracking.

Remaining sections in the paper are organized as follows. In Sec. 2 we discuss the related work in object detection and tracking. In Sec. 3, we outline the methodology we propose to detect objects and subsequently track them. In Sec. 4, we discuss experimental results and conclude the paper in Sec. 5.

2 Related Work

In this section we provide a detailed overview of the contemporary techniques prevalent in the domains which are closely related in this context.

2.1 Aerial imagery object detection

In [56],release a challenge dataset over drone images with varying weather and lighting conditions. A thorough review of the latest techniques on the benchmark dataset is provided with exhaustive evaluation protocols. In [27], the authors utilize novel real time object detection and tracking deep learning based algorithms over mobile devices with drones. In [35], authors present object detection method for data collected with asynchronous drone cameras. Spatio-temporal information is captured from the event stream after motion compensation is applied for object localization in motion. In [5], authors provide an autonomous target detection and tracking algorithm for AR Parrot Drone. In [22], authors provide real-time motion detection algorithm for visual inertial drone systems in case of dynamic backgrounds. This can run on low-power application Snapdragon processor with efficient performance capabilities. In [7], authors release an interesting challenge dataset for bird vs drone detection in order to prevent smuggling using drones in shore areas. The idea is to generate an alert in case of presence of drones in videos where there might be birds as well flying in the air. In [49], authors provide an architecture for collaborative aerial system with autonomous networking capabilities in aerial traffic. It consists of multi-drone systems consisting of quadcoptors fitted with various on-board sensing devices for communication. It aids in several applications such as disaster assistance, aerial monitoring and search and rescue operations. In [1], authors provide an end-to-end trainable deep architecture for drone detection by leveraging data augmentation techniques. In [19], authors propose novel Layer Proposal Networks for localizing and counting the number of objects in a dynamic environment. They leverage the spatial layout information in the kernels for improving the localization accuracy.

2.2 Multi-object tracking

In [12], authors propose a temporal generative network namely recurrent autoregressive network to model the appearance and motion features in temporal sequences. It strongly couples internal and external memory with the network thus incorporating information about previous frames trajectories and long term dependencies. In [25], in order to efficiently learn the long-term appearance models via a recurrent network, Bilinear LSTM based technique is proposed. In [55], authors utilize the advantages of single object tracking and data association methods to detect and track objects in noisy environments. In [18], authors postulate the tracking problem as a weighted graph labeling problem. They fuse the head and full body detectors for tracking purposes. In [50], authors provide mechanisms o handle temporal errors in tracking such as drifting and track ID switches. This happens due to occlusion or noise present in the scene. Thus, they incorporate motion and shape information in a siamese network to improve tracking performance. In [54], authors propose Deep Continuous Conditional Random Field (DCCRF) for handling inter-object relation and movement patterns in tracking. In [37], authors introduce a category agnostic detection free tracker using segmentation masks with semantic segmentation based approaches. In [4], authors propose a generalised labeled multi-Bernoulli (GLMB) filter for large scale multi-object tracking.

2.3 Motion segmentation

Unsupervised motion segmentation is very important task leveraging object localization as well as adaptive video compression. In [24], authors provide motion segmentation and tracking by co-clustering techniques. Motion segmentation is performed by grouping of the trajectories. In [39]

, authors provide a joint framework for unsupervised learning of depth, motion and optical flow to perform motion segmentation by exploiting geometric constraints. In

[21]

, authors adopt saliency estimation with spatial neighborhood information in a graph modeling framework. They utilize optical flow and edge cues for feature extraction. In

[36]

, authors introduce motion event dataset. They utilize Structure from Motion (SfM) based pipeline with computationally efficient deep neural network for event detection. They rely on dense depth map computation for motion segmentation and estimate the 3D velocities for moving objects.

3 Methodology

The VisDrone dataset comprises of images taken at varying altitudes and egocentric movements due to high-altitude wind speeds leading to drastic scale change and occlusions in the scene. The Detection and Tracking (DnT) framework is optimized for handling such scenarios. A large fraction of objects are small and dense which generic DnT frameworks are unable to detect which eventually becomes basis of every tracking scheme. A better detection framework not only ensures the detection is good but also provides a good basis for tracking. Since we track using object to object association in sequential frames, need for an optimal detector becomes more significant. We describe our DnT architecture for object detection and tracking illustrated in Fig. 1. The first section puts forward in detail, the selection of RetinaNet as the base deep learning architecture for object detection on the drone dataset. We construct a novel training strategy consisting of a combination of optimal set of anchor scales and utilization of SE blocks for detection and learning a deep association network for tracking detected images in the subsequent frames.

3.1 Selection of Base Detector: YOLOv3 vs RetinaNet

We evaluate the results of two single-stage object detectors: YOLOv3 and RetinaNet. For the YOLO model, we use the same training parameters as mentioned in [redmon2018yolov3] and instead of using the original set of variable square input sizes of 320, 352, 384, 416, 448, 480, 512, 544, 576, 608 we use a set of larger input sizes of 544, 576, 608, 640, 672, 704, 736, 768, 800 to account for high scale and variablitiy of the images in the VisDrone dataset. For this algorithm, on the COCO dataset the 9 clusters for anchors were: (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), (373 × 326). We use the same clusters for training our model on the VisDrone-DET dataset. For the RetinaNet network, we use the same parameters for training the model as mentioned in [31] while increasing the input size to 1500 × 1000 and increasing the maximum number of detections to 500. We select RetinaNet as our base Detector as it outperforms YOLOv3 on the VisDrone Dataset.

Figure 2: Tracking Network

3.2 Anchor scales

One of the most important design factors in a one-stage detection system is how densely it covers the space of possible image boxes. Thus, the anchor box parameters in RetinaNet [31]

, are critical in creating a Detection framework that is robust to varying object scales. RetinaNet uses translation-invariant anchor boxes. On pyramid levels P3 to P7 in RetinaNet, the anchors have areas of 32*32 to 512*512. At each pyramid level anchors at three aspect ratios 1:2, 1:1, 2:1 are used and anchors of sizes 20, 21/3, 22/3 of the original set of 3 aspect ratio anchors are used for denser scale coverage, at each level. In total there are A = 9 anchors per level and across levels they cover the scale range 32 -813 pixels with respect to the network’s input image. The anchor parameters used for the original RetinaNet architecture are suited for object detection on natural images. However, as a large number of objects in the VisDrone2019 dataset have a size smaller than 32*32 pixels, many of them having a size nearly equal to 8*8 pixels, the standard anchor parameters are not the best fit for detecting objects in drone images. This results in objects which don’t have any anchors assigned to them, resulting in these objects not contributing to the training of the model and thus, the model is unable to identify such small objects. To address this issue,we modify the anchor parameters to cover the range of sizes of objects in the dataset. While we use the same anchor sizes, anchor aspect ratios and strides for the anchors, we use the scales 0.1, 0.25, 0.5, 1, 21/3, 2.2, which cover a larger variance in size as well as are denser due the use of 6 scales instead of the original 3. This results in assigning anchors to the smaller sized objects more effectively resulting in them contributing to the training and better training of the model.

3.3 SE Blocks

In RetinaNet, we generate the set of feature maps P3, P4, P5, P6, P7 using the feature activation outputs by each stage’s last residual block for the ResNet backbone architecture. Specifically, we use the output of the last residual blocks C3, C4, C5 which denote the ouputs of conv3, conv4, conv5. We modify the architecture by using passing the outputs C3, C4, C5 through a SE block before feeding them to the feature pyramid network. This leads to better represented features for generation of P3, P4, P5, P6, P7 resulting in better detection results.

3.4 Multi-Object Tracking Framework

A multi-object tracking model is built using the detection model for detecting objects in the frames. Similar to DeepSORT, our algorithm learns a deep association network using patches from COCO dataset which enables us in scoring patches on the basis of deep feature similarity. Unlike DeepSORT, we keep track of identity labels for multiple objects of similar classes. Also, when matching detections from subsequent frames, we associate a confidence measure which is provided by the detector and fuse it with the deep association metric, thereby improving tracking for scenarios where confidence score of detected object in the next frame is high but the deep association is low.

First, the detections are generated from the frames using the object detection model and then the feature embeddings are generated using the trained Deep Association model. The detections including object labels and confidence scores along with the feature embeddings are then passed to the algorithm similar to DeepSORT, which generates the object tracklets based on the detections.

Figure 3: Qualitative Results

3.5 Training Strategy

RetinaNet is trained with stochastic gradient descent (SGD). All models are trained with initial learning rate of 1e-5 with weight decay of 0.0001 and momentum of 0.9 is used. The training loss is the sum the focal loss and the standard smooth L1 loss used for box regression

[16]

. To improve speed, we only decode box predictions from at most 1k top-scoring predictions per FPN level, after thresholding detector confidence at 0.05. The top predictions from all levels are merged and class-wise non-maximum suppression with a threshold of 0.5 is applied to yield the final detections. The same parameters mentioned above were used for training all the models. The base RetinaNet model was trained for 26 epochs with 1618 iterations per epoch using a batch size of 4. The model with improved scales was trained for 25 epochs with 3246 iterations per epoch using a batch size of 4. Finally, the model having new scales along with the SE blocks was trained for 27 epochs with 3246 iterations per epoch using a batch size of 2.

4 Experimental results and analysis

The DET framework was evaluated using Visdrone2019 challenge dataset which comprises of multi object detection and tracking datasets. In this section, we describe in detail the optimized hyper-parameters and the intricate implementation details. The proposed DET framework is evaluated on the VisDrone2019 [56] dataset benchmarks.

Method \AP@IoU 0.50:0.95 0.50 0.75
Yolo v3 13.8 30.43 11.18
RetinaNet 14.45 23.74 15.14
RetinaNet
(dense scales)
15.39 33.13 13.07
RetinaNet
(dense scales
+SE attention)
17.19 37.69 13.97
Table 1: Average Precision at maxDetections=500

4.1 Dataset

VisDrone2019 is a large-scale visual object detection benchmark, which was collected in a very wide area from 14 different cities in China. For object detection, it consists of 6,471 images in the training set and 548 images. It has a total of 10 categories, consisting of real-world scenarios such as pedestrian, car, bus, etc. captured using multiple drones with different models under various weather and lighting conditions. VisDrone-DET dataset111It can be downloaded from the following link: http:// www.aiskyeye.com., focuses on detecting ten predefined categories of objects (i.e., pedestrian, person, car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle) in images from drones. Since the dataset consists of default test and train splits, we divide the training set into Train and Validation Splits and select our base network architecture based on the validation results. We finetune our results using the same approach and test on the test set provided in the dataset.

4.2 Evaluation Metrics

Output of the algorithm consists of output list of detected bounding boxes with confidence scores for each image. Following the evaluation protocol in MS COCO [32], we use the AP IoU=0.50:0.05:0.95 , AP IoU=0.50 , AP IoU=0.75, AR max=1, AR max=10, AR max=100 and AR max=500 metrics to evaluate the results of detection algorithms. These criteria penalize missing detection of objects as well as duplicate detections (two detection results for the same object instance). Specifically, APIoU=0.50:0.05:0.95 is computed by averaging over all 10 Intersection over Union (IoU) thresholds (i.e., in the range [0.50 : 0.95] with the uniform step size 0.05) of all categories, which is used as the primary metric for evaluation and comparison of models. APIoU=0.50 and APIoU=0.75 are computed at the single IoU thresholds 0.5 and 0.75 over all categories, respectively. The ARmax=1, ARmax=10, ARmax=100 and ARmax=500 scores are the maximum recalls given 1, 10, 100 and 500 detections per image, averaged over all categories and IoU thresholds.

Method \AR@maxDets 1 10 100 500
Yolo v3 0.36 2.63 17.53 19.34
RetinaNet 0.59 5.91 20.96 21.38
RetinaNet
(dense scales)
0.48 4.78 22.02 30.49
RetianNet
(dense scales
+SE attention)
0.52 4.69 23.44 31.93
Table 2: Average Recall at IoU 0.50:0.95
Method AP[%] AP50[%] AP75[%] AR1[%] AR10[%] AR100[%] AR500[%]
CornerNet[26] 17.41 34.12 15.78 0.39 3.32 24.37 26.11
Light-RCNN [28] 16.53 32.78 15.13 0.35 3.16 23.09 25.07
DetNet [29] 15.26 29.23 14.34 0.26 2.57 20.87 22.28
RefineDet512 [53] 14.9 28.76 14.08 0.24 2.41 18.13 25.69
Retinanet [27] 11.81 21.37 11.62 0.21 1.21 5.31 19.29
FPN [30] 16.51 32.2 14.91 0.33 3.03 20.72 24.93
Cascade-RCNN [6] 16.09 16.09 15.01 0.28 2.79 21.37 28.43
Ours 11.19 25.65 8.78 0.56 4.87 17.19 24.09
Table 3: Detection Results

4.3 Implementation Details

We use Resnet-50 as the backbone for our detection architecture [17]. We also use pretrained weights from COCO [32] dataset for initialization of all our models [10]. The network architecture is shown in Fig. 2. In the training stage, the input images are upsampled to 1500 × 1000. For the data augmentation, we use a standard combination of random transform techniques such as rotation, translation, shear, scaling and horizontal flipping. In the test stage, we do not fix the image size and set the confidence threshold to 0.05. We train the network for 50K iterations with the batch size set to 1. The stochastic gradient descent (SGD) solver is adopted to optimize the network with the base learning rate set to 1e-5.

For multi-object tracking, the patches generated from our object detector on MS COCO detection dataset [32]

are resized to 128*128 and fed to the Deep Association network for training. The initial learning was set to 1e-3. The network was regularized with a weight decay of 1 × 10−8 and dropout inside the residual units with probability 0.4. The model was trained for 120k iterations with a batch size of 128.

Method AP AP@0.25 AP@0.50 AP@0.75 AP car AP bus AP truck AP ped AP van
cem [2] 5.7 9.22 4.89 2.99 6.51 10.58 8.33 0.7 2.38
cmot [3] 14.22 22.11 14.58 5.98 27.72 17.95 7.79 9.95 7.71
gog [38] 6.16 11.03 5.3 2.14 17.05 1.8 5.67 3.7 2.55
h2t [45] 4.93 8.93 4.73 1.12 12.9 5.99 2.27 2.18 1.29
ihtls [11] 4.72 8.6 4.34 1.22 12.07 2.38 5.82 1.94 1.4
Ours 13.88 23.19 12.81 5.64 32.2 8.83 6.61 18.61 3.16
Table 4: Tracking Results

4.4 Performance Evaluation

As shown in the results in Table 1, we see that RetinaNet performs better on the VisDrone dataset based on the AP metric where AP score of YOLO is 13.8 while that of RetinaNet is 14.45. Also, we can see that the APIoU=0.5 score is 30.43 while it’s APIoU=0.75 score is 11.18 while for RetianNet APIoU=0.5 score is 23.74 while it’s APIoU=0.75 score is 15.14. The huge drop in the AP value YOLO for higher values of IoU indicates that while it is able to detect objects better than RetinaNet, it struggles to localize the object detections effectively which is an inherent issue with the YOLO architecture. So, we proceed our studies by building a better model based on the RetinaNet architecture. The qualitative results are shown in Fig. 3.

As shown in Table 1, the initial base RetinaNet model achives an AP score of 14.45 with ARmax=500 score of 21.38%. For our model with new dense scales we achieve an AP score of 15.39 which is an approximate 6% increase over the base RetinaNet model. Also, we get an ARmax=500 score of 31.49% for this model thus, we have a much higher recall due to the increased number of detections as a consequence of using denser scales resulting in better detection of objects across a large variance of object sizes in the dataset. After using SE block along with this architecture, while we only have small increment from 30.49% to 31.93% in the ARmax=500 ,we see a significant  12% increase in the AP score to give us an AP score of 17.19. This indicates that while we don’t have significant increase in the number of objects detected, the detected objects are better localized compared to the previous model which results in a higher AP score. This is also proved by APIoU=0.50 and APIoU=0.75 seen in Table 1. where we see that the APIoU=0.50 value increased from 33.13 to 37.69 and the APIoU=0.75 value increased from 13.07 to 13.97. This indicates increase in AP values across all detection thresholds and thus, we can see that the objects are better localized due to the use of better represented features obtained by explicitly modelling interdependencies between channels by use of SE blocks.

Table 2 shows the Average Recall score for different number of maximum detections in the scene on VisDrone detection validation split. Vanilla RetinaNet performs better than standard Yolo v3 on all AR scores. For our model with new dense scales, we achieve better recall rates when the number of detections are high. At maxDets=500, the dense scales model increases the average recall from 21.38% to 30.49%. Incorporation of Squeeze and Excitation blocks, further improves the AR for all maxDets especially when the number of detections are greater than 100. The final model increases AR from 30.49% to 31.93% for maxDets=500.

Table 3

shows VisDrone 2019 detection results evaluated on the provided test set. We can observe that even when our method gives sub-optimal average precision, it performs drastically well for average recall for top 1 and top 10 detections. This has an optimal effect on our tracking pipeline. Although the trained Detector performs well on validation set, it performs sub-optimally on the test set. This means possibility of better generalization and more emphasis on smaller objects. The skewness of data is a larger problem that makes learning all the classes difficult. As can be seen from Table

4, our method performs better on smaller objects like pedestrians and cars than all the other methods, and on par with other methods for larger objects such as trucks, vans, buses,etc.

Also we observe that although the trained detector isn’t the most optimal one, our tracker is still able to achieve higher accuracy than almost all the baselines. This proves the robustness of our tracker. Even when the tracked objects have low confidence, the deep association network correctly matches the same object in the subsequent frames. This is due to combined learning of similarity based on deep feature embedding and detection scores.

5 Conclusion

Aerial Object detection problem is an important but preliminary step for the main task of Aerial Multi-Object Tracking. Large number of average confidence detections are preferable than less number of high confidence detections to build an optimal tracker. We presented an efficient tracking and detection framework that performs substantially well on VisDrone DET and MOT datasets respectively. We empirically choose RetinaNet as our base architecture and modify the anchor scale parameter for handling multi-scale dense objects in the scene. We also incorporate SE blocks enabling adaptive re-calibration of channel-wise feature responses. We show that although our method does not achieve overall best results on the detection model, it surpasses other methods as we increase the maximum number of detections. Our tracking pipeline utilizes the same idea and constructs feature embeddings from a trained deep association network along with generated detections and their confidence scores to create labeled tracks for every detected object. It should be emphasied that the proposed framework aims to improve multi-object tracking for aerial imagery. Not surprisingly, the uneven class distribution of data makes it difficult to learn features for all objects which can also be seen in the results. This can be improved in future by better augmentation methods, collecting more relevant data and incorporating structure similarity losses. Similarly, certain conditions like high camera motion, complex motion dynamics, occlusions create problems in tracking. However, these types of situations require a better understanding of the physics of scene such as flow maps, depth maps and semantic maps etc. which is beyond the scope of this paper.

References

  • [1] C. Aker and S. Kalkan (2017) Using deep networks for drone detection. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §2.1.
  • [2] A. Andriyenko and K. Schindler (2011) Multi-target tracking by continuous energy minimization. In CVPR 2011, pp. 1265–1272. Cited by: Table 4.
  • [3] S. Bae and K. Yoon (2014) Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 1218–1225. Cited by: Table 4.
  • [4] M. Beard, B. T. Vo, and B. Vo (2018) A solution for large-scale multi-object tracking. arXiv preprint arXiv:1804.06622. Cited by: §2.2.
  • [5] K. Boudjit and C. Larbes (2015) Detection and implementation autonomous target tracking with a quadrotor ar. drone. In 2015 12th International Conference on Informatics in Control, Automation and Robotics (ICINCO), Vol. 2, pp. 223–230. Cited by: §2.1.
  • [6] Z. Cai and N. Vasconcelos (2019) Cascade R-CNN: high quality object detection and instance segmentation. CoRR abs/1906.09756. External Links: Link, 1906.09756 Cited by: §1, Table 3.
  • [7] A. Coluccia, M. Ghenescu, T. Piatrik, G. De Cubber, A. Schumann, L. Sommer, J. Klatte, T. Schuchert, J. Beyerer, M. Farhadi, et al. (2017) Drone-vs-bird detection challenge at ieee avss2017. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §2.1.
  • [8] J. Dai, Y. Li, K. He, and J. Sun (2016) R-fcn: object detection via region-based fully convolutional networks. In Advances in neural information processing systems, pp. 379–387. Cited by: §1.
  • [9] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. Cited by: §1.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §4.3.
  • [11] C. Dicle, O. I. Camps, and M. Sznaier (2013) The way they move: tracking multiple targets with similar appearance. In Proceedings of the IEEE international conference on computer vision, pp. 2304–2311. Cited by: Table 4.
  • [12] K. Fang, Y. Xiang, X. Li, and S. Savarese (2018) Recurrent autoregressive networks for online multi-object tracking. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 466–475. Cited by: §2.2.
  • [13] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan (2009) Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32 (9), pp. 1627–1645. Cited by: §1.
  • [14] C. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg (2017) Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659. Cited by: §1.
  • [15] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §1.
  • [16] R. Girshick (2015) Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 1440–1448. Cited by: §1, §3.5.
  • [17] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.3.
  • [18] R. Henschel, L. Leal-Taixe, D. Cremers, and B. Rosenhahn (2018) Fusion of head and full-body detectors for multi-object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1428–1437. Cited by: §2.2.
  • [19] M. Hsieh, Y. Lin, and W. H. Hsu (2017) Drone-based object counting by spatially regularized regional proposal network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4145–4153. Cited by: §2.1.
  • [20] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: 2nd item.
  • [21] Y. Hu, J. Huang, and A. G. Schwing (2018) Unsupervised video object segmentation using motion saliency-guided spatio-temporal propagation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 786–802. Cited by: §2.3.
  • [22] C. Huang, P. Chen, X. Yang, and K. T. Cheng (2017) REDBEE: a visual-inertial drone system for real-time moving object detection. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1725–1731. Cited by: §2.1.
  • [23] K. Huang, D. Tao, Y. Yuan, X. Li, and T. Tan (2010) Biologically inspired features for scene classification in video surveillance. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 41 (1), pp. 307–313. Cited by: §1.
  • [24] M. Keuper, S. Tang, B. Andres, T. Brox, and B. Schiele (2018) Motion segmentation & multiple object tracking by correlation co-clustering. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.3.
  • [25] C. Kim, F. Li, and J. M. Rehg (2018) Multi-object tracking with neural gating using bilinear lstm. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 200–215. Cited by: §2.2.
  • [26] H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. CoRR abs/1808.01244. External Links: Link, 1808.01244 Cited by: §1, Table 3.
  • [27] C. Li, X. Sun, J. Cai, P. Xu, C. Li, L. Zhang, F. Yang, J. Zheng, J. Feng, Y. Zhai, et al. (2019) Intelligent mobile drone system based on real-time object detection. BIOCELL 1 (1). Cited by: §1, §2.1, Table 3.
  • [28] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2017) Light-head R-CNN: in defense of two-stage object detector. CoRR abs/1711.07264. External Links: Link, 1711.07264 Cited by: §1, Table 3.
  • [29] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2018) DetNet: A backbone network for object detection. CoRR abs/1804.06215. External Links: Link, 1804.06215 Cited by: §1, Table 3.
  • [30] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2117–2125. Cited by: Table 3.
  • [31] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §3.1, §3.2.
  • [32] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §4.2, §4.3, §4.3.
  • [33] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §1, §1.
  • [34] F. Meng, H. Li, Q. Wu, K. N. Ngan, and J. Cai (2017) Seeds-based part segmentation by seeds propagation and region convexity decomposition. IEEE Transactions on Multimedia 20 (2), pp. 310–322. Cited by: §1.
  • [35] A. Mitrokhin, C. Fermüller, C. Parameshwara, and Y. Aloimonos (2018) Event-based moving object detection and tracking. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–9. Cited by: §2.1.
  • [36] A. Mitrokhin, C. Ye, C. Fermuller, Y. Aloimonos, and T. Delbruck (2019) EV-imo: motion segmentation dataset and learning pipeline for event cameras. arXiv preprint arXiv:1903.07520. Cited by: §2.3.
  • [37] A. Ošep, W. Mehner, P. Voigtlaender, and B. Leibe (2018) Track, then decide: category-agnostic vision-based multi-object tracking. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 1–8. Cited by: §2.2.
  • [38] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes (2011) Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR 2011, pp. 1201–1208. Cited by: Table 4.
  • [39] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black (2019) Competitive collaboration: joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12240–12249. Cited by: §2.3.
  • [40] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §1.
  • [41] J. Redmon and A. Farhadi (2017) YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7263–7271. Cited by: §1.
  • [42] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1.
  • [43] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §1.
  • [44] X. Wang (2013) Intelligent multi-camera video surveillance: a review. Pattern recognition letters 34 (1), pp. 3–19. Cited by: §1.
  • [45] L. Wen, W. Li, J. Yan, Z. Lei, D. Yi, and S. Z. Li (2014) Multiple target tracking based on undirected hierarchical relation hypergraph. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1282–1289. Cited by: Table 4.
  • [46] N. Wojke, A. Bewley, and D. Paulus (2017) Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649. Cited by: 3rd item.
  • [47] N. Wojke and A. Bewley (2018) Deep cosine metric learning for person re-identification. In 2018 IEEE winter conference on applications of computer vision (WACV), pp. 748–756. Cited by: 3rd item.
  • [48] Q. Wu, H. Li, F. Meng, and K. N. Ngan (2018) Toward a blind quality metric for temporally distorted streaming video. IEEE Transactions on Broadcasting 64 (2), pp. 367–378. Cited by: §1.
  • [49] E. Yanmaz, S. Yahyanejad, B. Rinner, H. Hellwagner, and C. Bettstetter (2018) Drone networks: communications, coordination, and sensing. Ad Hoc Networks 68, pp. 1–15. Cited by: §2.1.
  • [50] Y. Yoon, A. Boragule, Y. Song, K. Yoon, and M. Jeon (2018) Online multi-object tracking with historical appearance matching and scene adaptive detection filtering. In 2018 15th IEEE International conference on advanced video and signal based surveillance (AVSS), pp. 1–6. Cited by: §2.2.
  • [51] Y. Yuan, Y. Feng, and X. Lu (2016) Statistical hypothesis detector for abnormal event detection in crowded scenes. IEEE transactions on cybernetics 47 (11), pp. 3597–3608. Cited by: §1.
  • [52] Y. Yuan, Z. Jiang, and Q. Wang (2017) HDPA: hierarchical deep probability analysis for scene parsing. In 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 313–318. Cited by: §1.
  • [53] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li (2018) Single-shot refinement neural network for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4203–4212. Cited by: §1, Table 3.
  • [54] H. Zhou, W. Ouyang, J. Cheng, X. Wang, and H. Li (2018) Deep continuous conditional random fields with asymmetric inter-object constraints for online multi-object tracking. IEEE Transactions on Circuits and Systems for Video Technology 29 (4), pp. 1011–1022. Cited by: §2.2.
  • [55] J. Zhu, H. Yang, N. Liu, M. Kim, W. Zhang, and M. Yang (2018) Online multi-object tracking with dual matching attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 366–382. Cited by: §2.2.
  • [56] P. Zhu, L. Wen, D. Du, X. Bian, H. Ling, Q. Hu, Q. Nie, H. Cheng, C. Liu, X. Liu, et al. (2019) Visdrone-det2019: the vision meets drone object detection in image challenge results. In Proceedings of the International Conference on Computer Vision (ICCV), pp. 0–0. Cited by: §1, §2.1, §4.