Vision Meets Drones: Past, Present and Future

01/16/2020 ∙ by Pengfei Zhu, et al. ∙ JD.com, Inc. Tianjin University University at Albany General Electric 15

Drones, or general UAVs, equipped with cameras have been fast deployed with a wide range of applications, including agriculture, aerial photography, fast delivery, and surveillance. Consequently, automatic understanding of visual data collected from drones becomes highly demanding, bringing computer vision and drones more and more closely. To promote and track the developments of object detection and tracking algorithms, we have organized two challenge workshops in conjunction with European Conference on Computer Vision (ECCV) 2018, and IEEE International Conference on Computer Vision (ICCV) 2019, attracting more than 100 teams around the world. We provide a large-scale drone captured dataset, VisDrone, which includes four tracks, i.e., (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. This paper first presents a thorough review of object detection and tracking datasets and benchmarks, and discuss the challenges of collecting large-scale drone-based object detection and tracking datasets with fully manual annotations. After that, we describe our VisDrone dataset, which is captured over various urban/suburban areas of 14 different cities across China from North to South. Being the largest such dataset ever published, VisDrone enables extensive evaluation and investigation of visual analysis algorithms on the drone platform. We provide a detailed analysis of the current state of the field of large-scale object detection and tracking on drones, and conclude the challenge as well as propose future directions and improvements. We expect the benchmark largely boost the research and development in video analysis on drone platforms. All the datasets and experimental results can be downloaded from the website: https://github.com/VisDrone/VisDrone-Dataset.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Computer vision has been attracting increasing amounts of attention in recent years due to its wide range of applications, such as transportation surveillance, smart city, and human-computer interaction. As two fundamental problems in computer vision, object detection and tracking are under extensive investigation. Among many factors and efforts that lead to the fast evolution of computer vision techniques, a notable contribution should be attributed to the invention or organization of numerous benchmarks and challenges, such as Caltech [31], KITTI [49]

, ImageNet

[105], and MS COCO [80] for object detection, and OTB [130], VOT [14], MOTChallenge [67], UA-DETRAC [124], and LaSOT [42] for object tracking.

Drones (or UAVs) equipped with cameras have been fast deployed to a wide range of areas, including agriculture, aerial photography, fast delivery, and surveillance. Consequently, automatic understanding of visual data collected from these drones become highly demanding, which brings computer vision to drones more and more closely. Despite the great progresses in general computer vision algorithms, such as detection and tracking, these algorithms are not usually optimal for dealing with drone captured sequences or images. This is because of various challenges such as large viewpoint changes and scales. Therefore, it is essential to develop and evaluate new vision algorithms for drone captured visual data. However, as pointed out in [89, 55], studies toward this goal is seriously limited by the lack of publicly available large-scale benchmarks or datasets. Some recent efforts [89, 103, 55] have been devoted to construct datasets captured by drones focusing on object detection or tracking. These datasets are still limited in size and scenarios covered, due to the difficulties in data collection and annotation. Thorough evaluations of existing or newly developed algorithms remain an open problem. A more general and comprehensive benchmark is desired for further boosting video analysis research on drone platforms.

Thus motivated, we have organized two challenge workshops in conjunction with European Conference on Computer Vision (ECCV) 2018, and IEEE International Conference on Computer Vision (ICCV) 2019, attracting more than research teams around the world. The challenge focuses on object detection and tracking with four tracks.

Image object detection track (DET). Given a pre-defined set of object classes, e.g., cars and pedestrians, the algorithm is required to detect objects of these classes from individual images taken by drones.

Video object detection track (VID). Similar to DET, the algorithm is required to detect objects of predefined object classes from videos taken by drones.

Single object tracking track (SOT).

This track aims to estimate the state of a target, indicated in the first frame, across frames in an online manner.

Multi-object tracking track (MOT). The goal of the track is to track multiple objects, i.e., localize object instances in each video frame and recover their trajectories in video sequences. In the VisDrone-2018 challenge, this track is divided into two sub-tracks. The first track allows the algorithms to take the provided object detections in each video frame, while the second track is on the other way round. In the VisDrone-2019 challenge, we merge these two sub-tracks, and do not distinguish submitted algorithms according to whether they use the provided object detections in each video frame as input or not.

Image object detection scenario #images categories avg. #labels/categories resolution occlusion labels year
Caltech Pedestrian [31] driving 2012
KITTI Detection [49] driving 2012
PASCAL VOC2012 [39] life 2012
ImageNet Object Detection [105] life 2013
MS COCO [80] life 2014
VEDAI [98] satellite 2015
COWC [92] aerial 2016
CARPK [55] drone 2017
DOTA [132] aerial 2018
VisDrone drone 2018
Video object detection scenario #frames categories avg. #labels/categories resolution occlusion labels year
ImageNet Video Detection [105] life 2015
UA-DETRAC Detection [124] surveillance 2015
MOT17Det [88] life 2017
Okutama-Action [5] drone 2017
UAVDT-DET [34] drone 2018
DroneSURF [61] drone - 2019
VisDrone drone 2018
Single object tracking scenarios #sequences #frames year
ALOV3000 [107] life 2014
OTB100 [130] life 2015
TC128 [76] life 2015
VOT2016 [62] life 2016
UAV123 [89] drone 2016
NfS [47] life 2017
DTB70 [47] drone 2017
POT 210 [77] planar objects 2018
UAVDT-SOT [34] drone 2018
VisDrone drone 2018
Multi-object tracking scenario #frames categories avg. #labels/categories resolution occlusion labels year
KITTI Tracking [49] driving 2013
MOTChallenge 2015 [67] surveillance 2015
UA-DETRAC Tracking [124] surveillance 2015
DukeMTMC [102] surveillance 2016
Campus [103] drone 2016
MOT17 [88] surveillance 2017
UAVDT-MOT [34] drone 2018
VisDrone drone 2018
TABLE I: Comparison of the state-of-the-art benchmarks and datasets. Note that, the resolution indicates the maximum resolution of videos/images included in the benchmarks and datasets. ()

Notably, in the workshop challenges, we provide a large-scale dataset, which consists of video clips with frames and static images. The data is recorded by various drone-mounted cameras, diverse in a wide range of aspects including location (taken from different cities in China), environment (urban and rural regions), objects (e.g., pedestrian, vehicles, and bicycles), and density (sparse and crowded scenes). We select categories of objects of frequent interests in drone applications, such as pedestrians and cars. Altogether we carefully annotate more than million bounding boxes of object instances from these categories. Moreover, some important attributes including visibility of the scenes, object category and occlusion, are provided for better data usage. The detailed comparison of the provided drone datasets with other related benchmark datasets in object detection and tracking are presented in Table I.

In this paper, we focus on the VisDrone Challenge in 2018 and 2019, as well as the methods, results, and evaluation protocols of the challenge111http://www.aiskyeye.com.. We hope the challenge largely boost the research and development in related fields.

2 Related Work

We briefly discuss some prior work in constructing benchmark object detection and tracking datasets, as well as the related challenges in recent conferences.

Fig. 1: Some annotated example images of the proposed datasets. The dashed bounding box indicates the object is occluded. Different bounding box colors indicate different classes of objects. For better visualization, we only display some attributes.

2.1 Generic Object Detection and Tracking Datasets

Image object detection datasets.

Several benchmarks have been collected for evaluating object detection algorithms. Enzweiler and Gavrila [37] present the Daimler dataset, captured by a vehicle driving through urban environment. The dataset includes manually annotated pedestrians in video images in the training set, and video images with annotated pedestrians in the testing set. The Caltech dataset [31] consists of approximately hours of Hz videos taken from a vehicle driving through regular traffic in an urban environment. It contains frames with a total of annotated bounding boxes of unique pedestrians. The KITTI Detection dataset [49] is designed to evaluate the car, pedestrian, and cyclist detection algorithms in autonomous driving scenarios, with training and testing images. Mundhenk et al.[92] create a large dataset for classification, detection and counting of cars, which contains unique cars from six different image sets, each covering a different geographical location and produced by different imagers. The UA-DETRAC benchmark [124, 84, 83] provides objects in frames for vehicle detection.

The PASCAL VOC dataset [41, 40] is one of the pioneering work in generic object detection filed, which is designed to provide a standardized test bed for object detection, image classification, object segmentation, person layout, and action classification. ImageNet [30, 105] follows the footsteps of the PASCAL VOC dataset by scaling up more than an order of magnitude in number of object classes and images, i.e., PASCAL VOC 2012 has object classes and images vs. ILSVRC2012 with object classes and annotated images. Recently, Lin et al.[80] release the MS COCO dataset, containing more than images with million manually segmented object instances. It has object categories with instances on average per category. Notably, it contains object segmentation annotations which are not available in ImageNet.

Video object detection datasets.

The ILSVRC 2015 challenge [105] opens the “object detection in video” track, which contains a total of snippets for training, snippets for validation, and snippets for testing. YouTube-Object dataset [97] is another large-scale dataset for video object detection, which consists of videos with over frames for classes of moving objects. However, only frames are annotated with a bounding-box around an object instance. Improved from the YouTube-Object dataset, Kalogeiton et al. [60] further provide the annotations of instance segmentation222http://calvin.inf.ed.ac.uk/datasets/youtube-objects-dataset/..

Single object tracking datasets.

In recent years, numerous datasets have been developed for single object tracking evaluation. Wu et al.[129] develop a standard platform to evaluate the single object tracking algorithms, and scale up the data size from sequences to sequences in [130]. Similarly, Liang et al.[76] collect video sequences for evaluating the color enhanced trackers. To track the progress in visual tracking field, Kristan et al.[64, 62, 63] organize a VOT competition from to by presenting new datasets and evaluation strategies for tracking evaluation. Smeulders et al.[107] present the ALOV300 dataset, which contains video sequences with visual attributes, such as long duration, zooming camera, moving camera and transparency. Li et al.[68] construct a large-scale dataset with video sequences of pedestrians and rigid objects, covering kinds of objects captured from moving cameras. Du et al.[32] design a dataset including annotated video sequences, focusing on deformable object tracking in unconstrained environments. To evaluate tracking algorithms in higher frame rate video sequences, Galoogahi et al.[47] propose a dataset including videos ( frames) recorded by the higher frame rate cameras ( frame per second) from real world scenarios. Besides using video sequences captured by RGB cameras, Felsberg et al.[45, 38] organize a series of competitions from 2015 to 2017, focusing on visual tracking on thermal video sequences recorded by eight different types of sensors.

Multi-object tracking datasets.

The most widely used multi-object tracking evaluation datasets include PETS09 [46], KITTI-T [49], MOTChallenge [67, 87], and UA-DETRAC [124, 84, 83]. The PETS09 dataset [46] mainly focuses on multi-pedestrian detection, tracking and counting in the surveillance scenarios. The KITTI Tracking dataset [49] is designed for object tracking in autonomous driving, which is recorded from a moving vehicle with viewpoint of the driver. MOT15 [67] and MOT16 [87] aim to provide a unified dataset, platform, and evaluation protocol for multi-object tracking algorithms, including and sequences respectively. Recently, the UA-DETRAC benchmark [124, 84, 83] is constructed, which contains a total of sequences to track multiple vehicles, where sequences are filmed from a surveillance viewpoint.

Moreover, in some scenarios, a network of cameras are set up to capture multi-view information to conduct multi-view multi-object tracking. The dataset in [46] is recorded using multi-camera with fully overlapping views in constrained environments. Other datasets are captured by non-overlapping cameras. For example, Chen et al.[18] collect four datasets, each of which includes to cameras with non-overlapping views in real scenes and simulation environments. Zhang et al.[139] develop a dataset composed of to cameras covering both indoor and outdoor scenes at a university. Ristani et al.[102] organize a challenge and present a large-scale fully-annotated and calibrated dataset, including more than million 1080p video frames taken by cameras with more than identities.

2.2 Drone-based Datasets

To date, there only exist a handful of drone-captured datasets in computer vision field. Hsieh et al.[55] present a dataset for car counting, which consists of images captured in parking lot scenarios with the drone platform, including annotated cars. Robicquet et al.[103] collect several video sequences with the drone platform in campuses, including various types of objects, (i.e., pedestrians, bikes, skateboarders, cars, buses, and golf carts), which enable the design of new object tracking and trajectory forecasting algorithms. Barekatain [5] present a new Okutama-Action dataset for concurrent human action detection with the aerial view. The dataset includes minute-long fully-annotated sequences with action classes. In [89], a high-resolution UAV123 dataset is presented for single object tracking, which contains aerial video sequences with fully annotated frames, including the bounding boxes of people and their corresponding action labels. Li et al.[72] capture video sequences of high diversity by drone cameras and manually annotate the bounding boxes of objects for single object tracking evaluation. Moreover, Du et al.[34] construct a new UAV benchmark focusing on complex scenarios for three tasks including object detection, single object tracking, and multiple object tracking. In [104], Rozantsev et al. present two separate datasets for detecting flying objects, i.e., the UAV dataset and the aircraft dataset. The former one comprises video sequences with the resolution and annotated bounding boxes of objects, acquired by a camera mounted on a drone flying indoors and outdoors. The latter one consists of publicly available videos of radio-controlled planes with annotated bounding boxes. Recently, Xia et al.[132] propose a large-scale dataset in aerial images collected from different sensors and platforms to advance object detection research in earth vision. In contrast to the aforementioned datasets acquired in constrained scenarios for object tracking, detection and counting, our VisDrone dataset is captured in various unconstrained scenes, focusing on four core problems in computer vision fields, i.e., image object detection, video object detection, single object tracking, and multi-object tracking.

2.3 Existing Challenges

The international workshop on computer vision for UAVs333https://sites.google.com/site/uavision2018/. focuses on hardware, software and algorithmic (co-)optimizations towards state-of-the-art image processing on UAVs. The VOT challenge workshop444http://www.votchallenge.net/vot2019/. provides the tracking community with a precisely defined and repeatable way to compare short-term trackers as well as provides a common platform for discussing the evaluation and advancements made in the field of single-object tracking. The BMTT and BMTT-PETS workshops555https://motchallenge.net/. aims to pave the way for a unified framework towards more meaningful quantification of multi-object tracking. The PASCAL VOC challenge has been held for eight years from 2005 to 2012, which aims to recognize objects from a number of visual object classes in realistic scenes. The ILSVRC challenge also has been held for eight years from 2010 to 2017, which is designed to evaluate algorithms for object detection and image classification at large scale. Compared to the aforementioned challenges, our workshop challenge focuses on object detection and tracking on drones with the following four tracks: (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. Our goal is to develop and distribute a new challenging benchmark for real world problems on drones with new difficulties, e.g., large scale and viewpoint variations, and heavy occlusions.

3 VisDrone Overview

A critical basis for effective algorithm evaluation is a comprehensive dataset. For this purpose, in VisDrone, we systematically collected the largest VisDrone benchmark dataset to advance the object detection and tracking research on drones. It consists of video clips with frames and additional static images. The videos/images are acquired by various drone platforms, i.e., DJI Mavic, Phantom series (3, 3A, 3SE, 3P, 4, 4A, 4P), including different scenarios across different cites in China, i.e., Tianjin, Hongkong, Daqing, Ganzhou, Guangzhou, Jincang, Liuzhou, Nanjing, Shaoxing, Shenyang, Nanyang, Zhangjiakou, Suzhou and Xuzhou. The dataset covers various weather and lighting conditions, representing diverse scenarios in our daily life. The maximal resolutions of video clips and static images are and , respectively.

The VisDrone benchmark focuses on the following four tasks (see Fig. 1), i.e., (1) image object detection, (2) video object detection, (3) single-object tracking, and (4) multi-object tracking. We construct a website: www.aiskyeye.com for accessing the VisDrone dataset and perform evaluation of those four tasks. Notably, for each task, the images/videos in the training, validation, and testing subsets are captured at different locations, but share similar scenarios and attributes. The training subset is used to train the algorithms, the validation subset is used to validate the performance of algorithms, the test-challenge subset is used for workshop competition, and the test-dev subset is used as the default test set for public evaluation. The manually annotated ground-truths for training and validation subsets are made available to participants, but the ground-truths of the testing subset are reserved in order to avoid (over)fitting of algorithms.

To participate our challenge, research teams are required to create their own accounts using the institutional email address. After registration, participants can choose the tasks of interest, and submit the results specifying locations or trajectories of objects in the images or videos using the corresponding accounts. We encourage the participants to use the provided training data, but also allow them to use additional training data. The use of additional training data must be indicated during submission. In the following subsections, we describe the data statistics and annotation of the datasets for each track in detail.

4 DET Track

The DET track tackles the problem of localizing multiple object categories in the image. For each image, algorithms are required to locate all the object instances of predefined set of object categories, e.g., car and pedestrian, in a given input images (if any). That is, we require the detection algorithm to predict the bounding box of each instance of each object class in the image, with a real-valued confidence. We mainly focus on ten object categories in evaluation, including pedestrian, person666

If a human maintains standing pose or walking, we classify it as

pedestrian; otherwise, it is classified as a person., car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Some rarely occurring vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. The performance of algorithms is evaluated by the average precision (AP) across different object categories and intersection over union (IoU) thresholds.

Fig. 2: The number of objects per image vs. percentage of images in the training, validation, test-challenge and test-dev subsets in the DET track.
Fig. 3: The number of objects with different occlusion degrees of different object categories in the training, validation, test-challenge and test-dev sets in the DET track.

4.1 Data Collection and Annotation

The DET dataset consists of images in unconstrained challenging scenes, including images in the training subset, in the validation subset, in the test-challenge subset, and in the test-dev subset. We plot the number of objects per image vs. percentage of images to show the distributions of the number of objects in each image in Fig. 2 and the number of objects in different object categories with different occlusion degrees in Fig. 3. Notably, the large variations of the number of objects in each image and the class imbalance issue significantly challenge the performance of detection algorithms. For example, as shown in Fig. 2, the minimal and maximal numbers of objects in the test-challenge subsets are and , and the number of the awning-tricycle instances is more than less than the car instances.

In this track, we focus on people and vehicles in our daily life, and define object categories of interest including pedestrian, person777If a human maintains standing pose or walking, we classify it as pedestrian; otherwise, it is classified as a person., car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle, in evaluation. Some rarely occurring vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. We manually annotate the bounding boxes of different categories of objects in each image. After that, cross-checking is conducted to ensure annotation quality. In addition, we also provide two kinds of useful annotations, the occlusion and truncation ratios. Specifically, we use the fraction of objects being occluded to define the occlusion ratio, and define three degrees of occlusions: no occlusion (occlusion ratio ), partial occlusion (occlusion ratio ), and heavy occlusion (occlusion ratio ). For the truncation ratio, it is used to indicate the degree of object parts appears outside a frame. If an object is not fully captured within a frame, we annotate the bounding box across the frame boundary and estimate the truncation ratio based on the region outside the image frame. It is worth mentioning that a target is skipped during evaluation if its truncation ratio is larger than .

4.2 Evaluation Protocol

For the DET track, we require each evaluated algorithm to output a list of detected bounding boxes with confidence scores in each test image. Following the evaluation protocol in MS COCO [80], we use the AP, AP, AP, AR, AR, AR and AR scores to evaluate the performance of detection algorithms. These criteria penalize missing detections as well as false alarm. Specifically, AP is computed by averaging over all intersection over union (IoU) thresholds (i.e., in the range with uniform step size ) of all categories, which is used as the primary metric for ranking algorithms. AP and AP are computed at the single IoU thresholds and over all categories, respectively. The AR, AR, and AR scores are the maximum recalls given , , and detections per image, averaged over all categories and IoU thresholds. Please refer to [80] for more details.

Codename AP Institutions Contributions and References
VisDrone-2018 Challenge:
AHOD 12.77 Tsinghua University Jianqiang Wang, Yali Li, Shengjin Wang [119]
CERTH-ODI 5.04 ITI Technical College Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris [101]
CFE-SSDv2 26.48 Peking University Qijie Zhao, Feng Ni, Yongtao Wang [81]
DDFPN 21.05 Tianjin University Liyu Lu [79]
DE-FPN 27.10 South China University of Technology Jingkai Zhou, Yi Luo, Hu Lin, Qiong Liu[78]
DFS 16.73 SUN YAT-SEN University Ke Bo [78]
DPNet 30.92 University of Electronic Science and Technology of China HongLiang Li, Qishang Cheng, Wei Li, Xiaoyu Chen, Heqian Qiu, Zichen Song [101]
Faster R-CNN+ 9.67 Shandong University Tiaojio Lee, Yue Fan, Han Deng, Lin Ma, Wei Zhang [101]
Faster R-CNN2 21.34 Xidian University Fan Zhang [101]
Faster R-CNN3 3.65 Northwestern Polytechnical University Yiling Liu, Ying Li [101]
FPN+ 13.32 Texas A&M University, IBM Karthik Suresh, Hongyu Xu, Nitin Bansal, Chase Brown, Yunchao Wei, Zhangyang Wang, Honghui Shi [78]
FPN2 16.15 Chongqing University Zhenwei He, Lei Zhang [78]
FPN3 13.94 Nanjing University of Science and Technology Chengzheng Li, Zhen Cui [78]
HAL-Retina-Net 31.88 Tsinghua University Yali Li, Zhaoyue Xia, Shengjin Wang [79]
IITH DODO 14.04 IIT Hyderabad Nehal Mamgain, Naveen Kumar Vedurupaka, K. J. Joseph, Vineeth N. Balasubramanian [101]
JNU_Faster RCNN 8.72 Jiangnan University Haipeng Zhang [101]
Keras-RetinaNet 7.72 Xidian University Qiuchen Sun, Sheng Jiang [79]
L-H RCNN+ 21.34 Xidian University Li Yang, Qian Wang, Lin Cheng, Shubo Wei[74]
MFaster-RCNN 18.08 Beijing University of Posts and Telecommunications Wenrui He, Feng Zhu [101]
MMF 7.54 Xiamen University Yuqin Zhang, Weikun Wu, Zhiyao Guo, Minyu Huang [101, 100]
MMN 10.40 Ocean University of China Xin Sun [101]
MSCNN 2.89 National University of Defense Technology Dongdong Li, Yangliu Kuai, Hao Liu, Zhipeng Deng, Juanping Zhao [11]
MSYOLO 16.89 Xidian University Haoran Wang, Zexin Wang, Ke Wang, Xiufang Li [100]
RDMS 22.68 Fraunhofer IOSB Oliver Acatay, Lars Sommer, Arne Schumann[138]
RefineDet+ 21.07 University of Chinese Academy of Sciences Kaiwen Duan, Honggang Qi, Qingming Huang [138]
RetinaNet2 5.21 Xidian University Li Yang, Qian Wang, Lin Cheng, Shubo Wei[79]
R-SSRN 9.49 Xidian University Wenzhe Yang, Jianxiu Yang[138]
SOD 8.27 Shanghai Jiao Tong University, University of Ottawa Lu Ding, Yong Wang, Chen Qian, Robert Laganière, Xinbin Luo [22]
SODLSY 13.61

National Laboratory of Pattern Recognition

Sujuan Wang, Yifan Zhang, Jian Cheng[100]
YOLO-R-CNN 12.06 University of Kansas Wenchi Ma, Yuanwei Wu, Usman Sajid, Guanghui Wang [101, 100]
YOLOv3+ 15.26 Xidian University Siwei Wang, Xintao Lian [100]
YOLOv3++ 10.25 University of Kansas Yuanwei Wu, Wenchi Ma, Usman Sajid, Guanghui Wang [100]
YOLOv3_DP 20.03 Xidian University Qiuchen Sun, Sheng Jiang[100]
VisDrone-2019 Challenge:
ACM-OD 29.13 SK T-Brain Sungeun Hong, Sungil Kang, Donghyeon Cho [78]
Airia-GA-Cascade 25.99 Institute of Automation, Chinese Academy of Sciences Yu Zhu, Qiang Chen [11]
BetterFPN 28.55 ShanghaiTech University Junhao Hu, Lei Jin [78]
Cascade R-CNN+ 17.67 Fraunhofer IOSB Jonas Meier, Lars Sommer, Lucas Steinmann, Arne Schumann [11]
Cascade R-CNN++ 18.33 University of Hong Kong Haocheng Han, Jiaqi Fan [11]
CenterNet 26.03 National University of Singapore, Pensees.ai, Xidian University Yanchao Li, Zhikang Wang, Yu Heng Toh, Furui Bai,Jane Shen [144]
CenterNet-Hourglass 22.36 Harbin Institute of Technology, Institute of Automation, Chinese Academy of Sciences Da Yu, Lianghua Huangm, Xin Zhao, Kaiqi Huang [144]
CNAnet 26.35 Chongqing University Keyang Wang, Lei Zhang [11]
CN-DhVaSa 27.83 Siemens Technology and Services Private Limited Dheeraj Reddy Pailla, Varghese Alex Kollerathu, Sai Saketh Chennamsetty [144]
ConstraintNet 16.09 Xidian University Dening Zeng, Di Li [144]
DA-RetinaNet 17.05 Nanjing University of Posts and Telecommunications Jingjing Xu, Dechun Cong [79]
DBCL 16.78 Snowcloud.ai Wei Dai, Weiyang Wang [131]
DCRCNN 17.79 BTS Digital Almaz Zinollayev, Anuar Askergaliyev [11]
DPNet-ensemble 29.62 University of Electronic Science and Technology of China Qishang Cheng, Heqian Qiu, Zichen Song, Hongliang Li [11]
DPN 25.09 Institute of Automation, Chinese Academy of Sciences Nuo Xu, Xin Zhang, Binjie Mao, Chunlei Huo, Chunhong Pan [11]
EHR-RetinaNet 26.46 Hanyang University Jaekyum Kim, Byeongwon Lee, Chunfei Ma, Jun Won Choi, Seungji Yang[79]
EnDet 17.81 Beijing Institute of Technology Pengyi Zhang, Yunxin Zhong [100, 101]
ERCNNs 20.45 Kakao Brain Jihoon Lee, Ildoo Kim [11, 101]
FS-Retinanet 26.31 Beijing Institute of Technology, Samsung Stanford Ziming Liu, Jing Ge, Tong Wu, Lin Sun, Guangyu Gao  [79]
GravityNet 25.66 University of Glasgow Toh Yu Heng, Harry Nguyen [144]
HRDet+ 28.39 South China University of Technology Jingkai Zhou, Weida Qin, Qiong Liu, Haitao Xiong [110]
HTC-drone 22.61 Queen Mary University of London Xindi Zhang [16]
Libra-HBR 25.57 Zhejiang University Chunfang Deng, Shuting He, Qinghong Zeng, Zhizhao Duan, Bolun Zhang[106, 94, 11]
MOD-RETINANET 16.96 Harman Aashish Kumar, George Jose, Srinivas S S Kruthiventi [79]
MSCRDet 25.13 Dalian University of Technology Xin Chen, Chang Liu, Shuhao Chen, Xinyu Zhang, Dong Wang, Huchuan Lu[11]
ODAC 17.42 Sun Yat-Sen University Junyi Zhang, Junying Huang, Xuankun Chen, Dongyu Zhang [78]
retinaplus 20.57 Northwestern Polytechnical University Zikai Zhang, Peng Wang [79]
RRNet 29.13 Ocean University of China Changrui Chen, Yu Zhang, Qingxuan Lv, Xiaorui Wang, Shuo Wei, Xin Sun [144]
SAMFR-Cascade RCNN 20.18 Xidian University Haoran Wang, Zexin Wang, Meixia Jia, Aijin Li, Tuo Feng [11]
S+D 28.59 Harbin Institute of Technology Yifu Chen [136, 11, 110]
SGE-Cascade R-CNN 27.33 Xi’an Jiaotong University Xudong Wei, Hao Qi, Wanqi Li, Guizhong Liu [11]
TridentNet 22.51 Huazhong University of Science and Technology Xuzhang Zhang[73]
TSEN 23.83 Nanjing University of Science and Technology Zhifan Zhu, Zechao Li [101, 120, 94]
TABLE II: Teams participating in the VisDrone-DET 2018 and 2019 challenges, ordered alphabetically.

4.3 Algorithms

VisDrone 2018 challenge.

There are different object detection algorithms from different institutes submitted to this track. We present the results and team information in Table II. As shown in Table II, several methods are constructed based on the Faster R-CNN algorithm [101], such as CERTH-ODI, DPNet, Faster R-CNN+, Faster R-CNN2, Faster R-CNN3, IITH DODO, JNU_Faster R-CNN, MFaster R-CNN, and MMN. Some algorithms construct feature pyramids to build high-level semantic feature maps at all scales, including DE-FPN, DFS, FPN+, FPN2, FPN3, and DDFPN. detectors, i.e., MSYOLO, SODLSY, YOLOv3+, YOLOv3++ and YOLOv3_DP, are improved from the one-stage YOLOv3 method [100]. MMF and YOLO-R-CNN fuse multi-models of the Faster R-CNN and YOLOv3 methods. Keras-RetinaNet, RetinaNet2 and HAL-Retina-Net are based on RetinaNet [79]. RDMS, RefineDet+ and R-SSRN are based on the RefineDet method [138]. The top accuracy is achieved by the HAL-Retina-Net method, i.e., AP, which uses the SE module [56] and downsampling-upsampling operations [117] to learn both the channel and spatial attentions.

VisDrone 2019 challenge.

We have received detection methods from different institutes in this track, shown in Table II. methods are improved from Cascade R-CNN [11], i.e., Airia-GA-Cascade, Cascade R-CNN+, Cascade R-CNN++, DCRCNN, DPN, DPNet-ensemble, MSCRDet, SAMFR-Cascade RCNN and SGE-cascade R-CNN. detection methods, i.e., CenterNet, CenterNet-Hourglass, CN-DhVaSa, ConstraintNet, GravityNet and RRNet, are based on the anchor-free method CenterNet [144]. detection methods, i.e., DA-RetinaNet, EHR-RetinaNet, FS-Retinanet, MOD-RETINANET and retinaplus, are improved from the anchor-based method RetinaNet [79]. ACM-OD, BetterFPN and ODAC construct multi-scale feature pyramids using FPN [78]. CNAnet designs the convolution neighbor aggregation mechanism for detection. HRDet+ is improved from HRDet [110], which connects the convolutions from high to low resolutions in parallel to generate discriminative high-resolution representations. TridentNet [73] aims to generate scale-specific feature using a parallel multi-branch architecture.

Some other methods use ensemble mechanism to improve the performance. DPNet-ensemble achieves the top accuracy with AP, which ensembles two object detectors based on Cascade R-CNN [11] using ResNet-50 and ResNet-101 as feature extractors with global context module [13] and deformable convolution [23]. EnDet combines the results of YOLOv3 [100] and Faster R-CNN [101]. TSEN ensembles three two-stage methods including Faster R-CNN [101], Guided Anchoring [120] and Libra R-CNN [94]. ERCNNs combines the results of Cascade R-CNN [11] and Faster R-CNN [101] with different kinds of backbones. Libra-HBR ensembles the improved SNIPER [106], Libra R-CNN [94] and Cascade R-CNN [11].

To further improve the accuracy, some methods jointly predict the masks and bounding boxes of objects. For example, DBCL [131] uses the bounding box annotations to train a segmentation model to produce accurate results. HTC-drone improves the hybrid task cascade algorithm [16] using the instance segmentation cascade. The S+D method is formed by the segmentation algorithm DeepLab [136] and the detection module in HRDet [110].

VisDrone-dev benchmark.

This benchmark is designed for public evaluation. state-of-the-art object detection methods are evaluated, i.e., FPN [78], RetinaNet [79], Light-RCNN [74], RefineDet [138], DetNet [75], Cascade R-CNN [11], and CornerNet [65], shown in Table III.

Method AP AP AP AR AR AR AR
VisDrone-2018 challenge:
HAL-Retina-Net 31.88 46.18 32.12 0.97 7.50 34.43 90.63
DPNet 30.92 54.62 31.17 1.05 8.00 36.80 50.48
DE-FPN 27.10 48.72 26.58 0.90 6.97 33.58 40.57
CFE-SSDv2 26.48 47.30 26.08 1.16 8.76 33.85 38.94
RDMS 22.68 44.85 20.24 1.55 7.45 29.63 38.59
L-H RCNN+ 21.34 40.28 20.42 1.08 7.81 28.56 35.41
Faster R-CNN2 21.34 40.18 20.31 1.36 7.47 28.86 37.97
RefineDet+ 21.07 40.98 19.65 0.78 6.87 28.25 35.58
DDFPN 21.05 42.39 18.70 0.60 5.67 28.73 36.41
YOLOv3_DP 20.03 44.09 15.77 0.72 6.18 26.53 33.27
VisDrone-2019 challenge:
DPNet-ensemble 29.62 54.00 28.70 0.58 3.69 17.10 42.37
RRNet 29.13 55.82 27.23 1.02 8.50 35.19 46.05
ACM-OD 29.13 54.07 27.38 0.32 1.48 9.46 44.53
S+D 28.59 50.97 28.29 0.50 3.38 15.95 42.72
BetterFPN 28.55 53.63 26.68 0.86 7.56 33.81 44.02
HRDet+ 28.39 54.53 26.06 0.11 0.94 12.95 43.34
CN-DhVaSa 27.83 50.73 26.77 0.00 0.18 7.78 46.81
SGE-cascade R-CNN 27.33 49.56 26.55 0.48 3.19 11.01 45.23
EHR-RetinaNet 26.46 48.34 25.38 0.87 7.87 32.06 38.42
CNAnet 26.35 47.98 25.45 0.94 7.69 32.98 42.28
VisDrone-dev:
CornerNet [65] 23.43 41.18 25.02 0.45 4.24 33.05 34.23
Light-RCNN [74] 22.08 39.56 23.24 0.32 3.63 31.19 32.06
FPN [78] 22.06 39.57 22.50 0.29 3.50 30.64 31.61
Cascade R-CNN [11] 21.80 37.84 22.56 0.28 3.55 29.15 30.09
DetNet [75] 20.07 37.54 21.26 0.26 2.84 29.06 30.45
RefineDet [138] 19.89 37.27 20.18 0.24 2.76 28.82 29.41
RetinaNet [79] 18.94 31.67 20.25 0.14 0.68 7.31 27.59
TABLE III: Comparison results of the algorithms on the VisDrone-DET dataset.

4.4 Results and Analysis

Results on the test-challenge set.

Top object detectors in the VisDrone-DET 2018 [147] and 2019 [35] challenges are presented in Table III. In contrast to existing object detection datasets, e.g., MS COCO [80], Caltech [31], and UA-DETRAC [124]), one of the most challenging issues in the VisDrone-DET dataset is the extremely small scale of objects.

As shown in Table III, we find that HAL-Retina-Net and DPNet are the only two methods achieving more than AP in the VisDrone-DET 2018 challenge. Specifically, HAL-Retina-Net uses the Squeeze-and-Excitation [56] and downsampling-upsampling [117] modules to learn both the channel and spatial attentions on multi-scale features. To detect small scale objects, it removes higher convolutional layers in the feature pyramid. The second best detector DPNet uses the Feature Pyramid Networks (FPN) [78] to extract multi-scale features and uses ensemble mechanism to combine three detectors with different backbones, i.e., ResNet-50, ResNet-101 and ResNeXt. Similarly, DE-FPN and CFE-SSDv2 also employ multi-scale features, which rank in the third and fourth places with and AP scores, respectively. RDMS trains variants of RefineDet [138], i.e., three use SEResNeXt-50 and one uses ResNet-50 as the backbone network. Moreover, DDFPN, ranked in the

-th place, introduces deep back-projection super-resolution network

[51] to upsample the image using the deformable FPN architecture [23]. Notably, most of the submitted methods use multi-scale testing strategy in evaluation, which is effective to improve performance.

In the VisDrone-DET 2019 challenge, DPNet-ensemble achieves the best results with AP score, It uses the global context module [13] to integrate context information and deformable convolution [23] to enhance the transformation modeling capability of the detector. RRNet and ACM-OD tie for the second place in ranking with AP score. RRNet is improved from [144] by integrating a re-regression module, formed by the ROIAlign module [52]

and several convolution layers. ACM-OD introduces an active learning strategy, which is conducted with data augmentation for better performance.

In summary, as shown in Table III, although the top detector DPNet-ensemble in the VisDrone-DET 2019 challenge is slightly inferior than the top detector HAL-Retina-Net in the VisDrone-DET 2018 challenge in terms of AP score, we can observe that the average AP score of the top methods in the VisDrone-DET 2019 challenge is greatly improved compared to that in the VisDrone-DET 2018 challenge. However, the top accuracy on this dataset is only , achieved by HAL-Retina-Net in the VisDrone-DET 2018 challenge, It indicates the difficulty of the collected dataset and the badly need of developing more robust methods for real-world applications.

Results on the test-dev set.

We present the evaluation results of the state-of-the-art methods in Table III. CornerNet [65] achieves the top AP score

, which uses the Hourglass-104 backbone for feature extraction. In contrast to FPN

[78] and RetinaNet [138] with extra stages against the image classification task to handle objects with various scales, DetNet [75] re-designs the backbone network for object detection, which maintains the spatial resolution and enlarges the receptive field, achieving AP score. Meanwhile, RefineDet [138] with the VGG-16 backbone performs better than RetinaNet [79] with the ResNet-101 backbone, i.e., vs. in terms of AP score. This is because RefineDet [138] uses the object detection module to regress the locations and sizes of objects based on the coarsely adjusted anchors from the anchor refinement module.

4.5 Discussion

Captured by the cameras equipped on drones, the VisDrone-DET dataset is extremely challenging due to scale variation, occlusion, and class imbalance. Compared to traditional object detection datasets, there are more issues worth exploring in drone captured visual data.

Annotation and evaluation protocol.

As shown in Fig. 4, there are often groups of objects heavily occluded in drone captured visual data (see the orange bounding boxes of bicycles). If we use Non-maximum Suppression (NMS) to suppress duplicate detections in detectors, the majority of true positive objects will be inevitably removed. In some real applications, it is unnecessary and impractical to locate each individual object in the crowd. Thus, it is more reasonable to use a large bounding box with a count number to represent the group of objects in the same category (see the white bounding box of bicycle). Meanwhile, if we use the new annotation remedy, we need to redesign the metric to evaluate detection algorithms, i.e., both the localization and counting accuracy should be considered in evaluation.

Coarse segmentation.

Current object detection methods use bounding boxes to indicate object instances, i.e., a -tuple , where and are the coordinate of the bounding box’s top-left corner, and and are the width and height of the bounding box. As shown in Fig. 4, it is difficult to predict the location and size of the pedestrian (see the yellow bounding box) due to occlusion and non-rigid deformation of human body. A possible way to mitigate such issue is to integrate coarse segmentation into object detection, which might be effective to remove the disturbance of background area enclosed in the bounding box of non-rigid objects, such as person and bicycle, see Fig. 4. In summary, this interesting problem is still far from being solved and worth to explore.

Fig. 4: Descriptions of the challenging issues in the image object detection task.

5 VID Track

The VID track aims to locate object instances from a pre-defined set of categories in the video sequences. That is, given a series of video clips, the algorithms are required to produce a set of bounding boxes of each object instance in each video frame (if any), with real-valued confidences. In contrast to DET track focusing on object detection in individual images, we deal with detection object instances in video clips, which contain temporal consistency in consecutive frames. Five categories of objects are considered in this track, i.e., pedestrian, car, van, bus, and truck. Similar to the DET track, some rarely occurring special vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. The AP score across different object categories and IoU thresholds of algorithm predictions in individual frames are used to evaluate the quality of the results.

Fig. 5: The length of object trajectories vs. the percentage of trajectories in the training, validation, test-challenge and test-dev subsets of the VID and MOT tracks.
Fig. 6: The number of object trajectories in different categories in the training, validation, test-challenge and test-dev subsets of the VID and MOT tracks.

5.1 Data Collection and Annotation

We provide challenging video clips in the VID track, including clips for training ( frames in total), for validation ( frames in total), for testing ( frames in total) and for testing ( frames in total). To clearly describe the data distribution, we plot the number of objects per frame vs. percentage of frames in Fig. 5, and the number of objects of different object categories in Fig. 6. As shown in Fig. 5, the class imbalance issue is extremely severe in the VID and MOT datasets, challenging the performance of algorithms. For example, in the training set, the number of car trajectories is more than of the number of car trajectories. Meanwhile, as shown in Fig. 5, the length of object trajectories varies dramatically, e.g., the maximal and minimal lengths of object trajectories are and , requiring the tracking algorithms to perform well in both short-term and long-term cases.

We manually annotate five categories of objects in each video clip, i.e., pedestrian, car, van, bus, and truck, and conduct the cross-checking to ensure the annotation quality. Similar to the DET track, we also provide the annotations of occlusion and truncation ratios of each object and ignored regions in each video frame. We present the annotation exemplars in the second row of Fig. 1.

5.2 Evaluation Protocol

For the VID track, each evaluated algorithm is required to generate a list of bounding box detections with confidence scores in each video frame. Motivated by the evaluation protocol in MS COCO [80] and ILSVRC [95], we use the AP, AP, AP, AR, AR, AR and AR scores to evaluate the results of video object detection algorithms, which is similar to the DET track. Notably, the AP score is used as the primary metric for ranking methods. Please see [80, 95] for more details.

Codename AP Institutions Contributions and References
VisDrone-2018 Challenge:
CERTH-ODV 9.10 Centre for Research & Technology Hellas Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris [101]
CFE-SSDv2 21.57 Peking University Qijie Zhao, Feng Ni, Yongtao Wang [81]
EODST 16.54 Xidian University Zhaoliang Pi, Yinan Wu, Mengkun Liu [81, 25]
FGFA+ 16.00 Xidian University Jie Gao, Yidong Bai, Gege Zhang, Dan Wang, Qinghua Ma [149]
RD 14.95 Fraunhofer IOSB, Karlsruhe Institute of Technology Oliver Acatay, Lars Sommer, Arne Schumann [138]
RetinaNet_s 8.63 Beijing University of Posts and Telecommunications Jianfei Zhao, Yanyun Zhao [79]
VisDrone-2019 Challenge:
AFSRNet 24.77 Beijing Institute of Technology, Samsung Inc Ziming Liu, Jing Ge, Tong Wu, Lin Sun, Guangyu Gao [79, 145]
CN-DhVaSa 21.58 Siemens Technology and Services Private Limited Dheeraj Reddy Pailla, Varghese Alex Kollerathu, Sai Saketh Chennamsetty [144]
CornerNet-lite-FS 12.65 Ocean University of China Xin Sun, Hongwei Xv, Meng Zhang, Zihe Dong, Lijun Du [66]
DBAI-Det 29.22 DeepBlue Technology Zhipeng Luo, Feng Ni, Bing Dong, Yuehan Yao, Zhenyu Xu [11]
DetKITSY 20.43 Karlsruhe Institute of Technology, Sun Yat-sen University, VIPioneers (Huituo) Inc Wei Tian, Jinrong Hu, Yuduo Song, Zhaotang Chen, Long Chen, Martin Lauer [11]
DM2Det 13.52 KARI, KAIST SungTae Moon, Dongoo Lee, Yongwoo Kim, SungHyun Moon [141]
EODST++ 18.73 Xidian University Zhaoliang Pi, Yingping Li, Xier Chen, Yanchao Lian, Yinan Wu [81, 115, 24, 69]
FT 9.15 Northwestern Polytechnical University Yunfeng Zhang, Yiwen Wang, Ying Li [101]
FRFPN 16.50 Nanjing University of Science and Technology Zhifan Zhu, Zechao Li [101, 151]
HRDet+ 23.03 South China University of Technology Jingkai Zhu, Weida Qin, Qiong Liu, Haitao Xiong [110]
Libra-HBR 18.29 Zhejiang University Chunfang Deng, Qinghong Zeng, Zhizhao Duan, Bolun Zhang [106, 94, 11]
Sniper+ 18.16 Xi’an Jiaotong University Xingjie Zhao, Ting Sun, Guizhong Liu [106]
VCL-CRCNN 21.61 Tsinghua University Zhibin Xiao [11]
TABLE IV: Teams participating in VisDrone-VID 2018 and 2019 challenges, ordered alphabetically.

5.3 Algorithms

VisDrone 2018 challenge.

We have received entries in the VID track of the VisDrone-2018 challenge, shown in Table IV. Four methods are directly derived from image object detectors, i.e., CERTH-ODV, CFE-SSDv2, RetinaNet_s, and RD. The EODST method is constructed based on SSD [81], and uses the ECO tracker [25] to exploit the temporal coherence. FGFA+ is modified from the video object detection framework [149] by enhancing contrast and brightness of frames. CFE-SSDv2 achieves the top accuracy (i.e., AP), which uses a comprehensive feature enhancement module to enhance the features for small objects.

VisDrone 2019 challenge.

As presented in Table IV, video detection methods are submitted in this track. Similar to VisDrone-VID2018 challenge, the majority of submissions are directly derived from object detectors on static images. For instance, there are methods based on Cascade R-CNN [11], i.e., DBAI-Det, DetKITSY and VCL-CRCNN. Libra-HBR combines improved SNIPER [106], Libra R-CNN [94] and cascade R-CNN [11]. CN-DhVaSa and CornerNet-lite-FS are based on the anchor-free methods CenterNet [144] and CornerNet [66]

, respectively. AFSRNet integrates feature selected anchor-free head (FSAF)

[145] into RetinaNet [79] to improve the accuracy. FRCFPN is derived from Faster R-CNN [101] with data augmentation [151]. EODST++ improves the method EODST in VisDrone-VID 2018 challenge, using SSD [81] and FCOS [115] for detection in individual frames, and ECO [24] and SiamRPN++ [69] to track objects to recall false negatives in detection. FT improves Faster R-CNN [101] based on three-dimensional convolution to exploit temporal information for better performance.

VisDrone-dev benchmark.

We evaluate state-of-the-art video object detection methods, i.e., FGFA [149] and D&T [44], and state-of-the-art image object detection methods, i.e., Faster R-CNN [101], FPN [78], CornerNet [65], and CenterNet [144], in this track. Specifically, FPN [78] and Faster R-CNN [101] are anchor-based methods, and CornerNet [65] and CenterNet [144] are anchor-free methods. The FGFA [149] and D&T [44] methods attempt to exploit temporal coherence of objects in consecutive frames to improve the performance.

Method AP AP AP AR AR AR AR
VisDrone-2018 challenge:
CFE-SSDv2 21.57 44.75 17.95 11.85 30.46 41.89 44.82
EODST 16.54 38.06 12.03 10.37 22.02 25.52 25.53
FGFA+ 16.00 34.82 12.65 9.63 19.54 22.37 22.37
RD 14.95 35.25 10.11 9.67 24.60 29.72 29.91
CERTH-ODV 9.10 20.35 7.12 7.02 13.51 14.36 14.36
RetinaNet_s 8.63 21.83 4.98 5.80 12.91 15.15 15.15
VisDrone-2019 challenge:
DBAI-Det 29.22 58.00 25.34 14.30 35.58 50.75 53.67
AFSRNet 24.77 52.52 19.38 12.33 33.14 45.14 45.69
HRDet+ 23.03 51.79 16.83 4.75 20.49 38.99 40.37
VCL-CRCNN 21.61 43.88 18.32 10.42 25.94 33.45 33.45
CN-DhVaSa 21.58 48.09 16.76 12.04 29.60 39.63 40.42
DetKITSY 20.43 46.33 14.82 8.64 25.80 33.40 33.40
ACM-OD 18.82 43.15 13.42 5.98 22.29 34.78 35.92
EODST++ 18.73 44.38 12.68 9.67 22.84 27.62 27.62
Libra-HBR 18.29 44.92 11.64 10.69 26.68 35.83 36.57
Sniper+ 18.16 38.56 14.79 9.98 27.18 38.21 39.08
VisDrone-dev:
FGFA [149] 14.44 33.34 11.85 7.29 21.37 27.09 27.21
D&T [44] 14.21 32.28 10.39 7.59 19.39 26.57 25.64
FPN [78] 12.93 29.88 10.12 7.03 19.71 25.59 25.59
CenterNet [144] 12.35 28.93 9.92 6.41 18.93 24.87 24.87
CornerNet [65] 12.29 28.37 9.48 6.07 18.60 24.03 24.03
Faster-RCNN [101] 10.25 26.83 6.70 5.93 12.98 13.55 13.55
TABLE V: Comparison results of the algorithms on the VisDrone-VID dataset.

5.4 Results and Analysis

Results on the test-challenge set.

We report the evaluation results of the submissions in the VisDrone-VID 2018 [148] and 2019 [146] challenges in Table V. CFE-SSDv2 obtains the best AP score in the VisDrone-VID 2018 challenge, which is improved from SSD [81] by integrating a comprehensive feature enhancement module for accurate results, especially for small objects. Different from CFE-SSDv2, EODST exploits temporal information to associate object detections in individual frames using the ECO tracking algorithm [25], achieving the second best AP . FGFA+ ranks in the third place with AP, which is a variant of video object detection method FGFA [149] with various data augmentation strategies.

In the VisDrone-VID 2019 challenge, researchers propose more powerful algorithms, which benefit from several state-of-the-art detectors, such as HRDet [110], Cascade R-CNN [11], CenterNet [144], RetinaNet [79], FPN [78]. All top detectors, i.e., DBAI-Det, AFSRNet, HRDet+, VCL-CRCNN and CN-DhVaSa, surpass the top detector CFE-SSDv2 in the VisDrone-VID 2018 challenge. We witness the significant improvement of the performance of video object detection methods. However, there still remains much room for improvement. DBAI-Det achieves the best results with AP, which is constructed based on Cascade R-CNN [11] with ResNeXt101 [133], and integrates the deformable convolution operation [23] and global context module [13] to improve the performance. AFSRNet ranks the second place with AP, which integrates the feature selected anchor-free head [145] into RetinaNet [79]. HRDet+, VCL-CRCNN and CN-DhVaSa rank in the third, forth, and fifth places, which are improved from HRDet [110], Cascade R-CNN [11], and CenterNet [144], respectively. To deal with large scale variations of objects, other top detectors, such as DetKITSY and EODST++, employ multi-scale features and proposals for detection, which performs better than the state-of-the-art video object detector FGFA [149]. Notably, most of the video object detection methods are computationally expensive for practical applications, whose running speed are less than fps on a workstation with GTX 1080Ti GPU.

Results on the test-dev set.

The evaluation results of state-of-the-art video object detection methods [149, 44], and state-of-the-art image object detection methods [78, 144, 65, 101] on the test-dev set are presented in Table V. We find that the two video object detectors performs much better than the four image object detectors. For example, the second best video object detector D&T [44] improves in AP score compared to the top image object detector FPN [78], which demonstrates the importance of exploiting the temporal information in video object detection. FGFA [149] leverages temporal coherence to enhance the features of objects for accurate results. D&T [44]

simultaneously solves detection and tracking with an end-to-end trained convolutional neural network, and a multi-task loss for frame-based object detection and across-frame track regression. However, how to exploit temporal information is still an open question for video object detection.

5.5 Discussion

Different from the DET task, the accuracy of detection methods suffers from degenerated object appearances in videos such as motion blur, pose variations, and video defocus. Exploiting temporal coherence and aggregating features in consecutive frames might to be two effective ways to handle such issue.

Temporal coherence.

A feasible way to exploit temporal coherence is using object trackers, e.g., ECO [24] and SiamRPN++ [69], into detection algorithms. Specifically, we can assign a tracker to each detected object instance in individual frames to guide detection in consecutive frames, which is effective to suppress false negatives in detection. Meanwhile, integrating re-identification module is another promising way to exploit temporal coherence for better performance, just as described in D&T [44].

Feature aggregation.

Aggregating features in consecutive frames is also a useful way to improve the performance. As stated in FGFA [149], aggregating nearby features along the motion paths to leverage temporal coherence significantly improves the detection accuracy. Thus, we can take several consecutive frames as input, and feed them into deep neural networks to extract temporal salient features using D convolution operations or optical flow algorithm.

Fig. 7: (a) The number of frames vs. the aspect ratio (height divided by width) change rate with respect to the first frame, (b) the number of frames vs. the area variation rate with respect to the first frame, and (c) the distributions of the number of frames of video clips, in the training, validation, test-challenge and test-dev subsets for the SOT track.

6 SOT Track

For the SOT track, we focus on generic single object tracking, also known as model-free tracking [123, 130, 143]. In particular, for an input video sequence and the initial bounding box of the target object in the first frame, the SOT track requires the algorithms to locate the target bounding boxes in the subsequent video frames. The tracking targets in these sequences include pedestrians, cars, buses, and animals.

6.1 Data Collection and Annotation

In 2018, we provide video sequences with fully annotated frames, split into four subsets, i.e., the training set ( sequences with frames in total), validation set ( sequences with frames in total), testing-challenge 2018 set ( sequences with frames in total), and testing-dev set ( sequences with frames in total). Notably, the testing-challenge 2018 subset is designed to evaluate the algorithms submitted in the VisDrone-SOT 2018 challenge competition. To thoroughly evaluate the performance of algorithms in long-term tracking, we add new collected sequences with frames in total in the test-challenge 2018 set to form the test-challenge 2019 set, which is used in the VisDrone-SOT 2019 challenge competition. The tracking targets in all these sequences include pedestrian, cars, and animals. The statistics of target objects, i.e., the aspect ratio in different frames, the area change ratio, and the sequence length are presented in Fig. 7.

The enclosing bounding box of target object in each video frame is annotated to evaluate the performance of trackers. To thoroughly analyze the tracking performance, we also annotate sequence attributes [89], i.e., aspect ratio change, background clutter, camera motion, fast motion, full occlusion, illumination variation, low resolution, out-of-view, partial occlusion, scale variation, similar object, and viewpoint change, described as follows.

Aspect ratio change: the fraction of ground truth aspect ratio in the first frame and at least one subsequent frame is outside the range .

Background clutter: the background near the target has similar appearance as the target.

Camera motion: abrupt motion of the camera.

Fast motion: motion of the ground truth bounding box is larger than pixels between two consecutive frames.

Full occlusion: the target is fully occluded.

Illumination variation: the illumination of the target changes significantly.

Low resolution: at least one ground truth bounding box has less than pixels.

Out-of-view: some portion of the target leaves the view.

Partial occlusion: the target is partially occluded.

Scale variation: the ratio of initial and at least one subsequent bounding box is outside the range .

Similar object: there are objects of similar shape or same type near the target.

Viewpoint change: viewpoint affects target appearance significantly.

6.2 Evaluation Protocol

Following the evaluation methodology in [130], we use the success and precision scores to evaluate the performance of trackers. The success score is defined as the area under the success plot. That is, with each bounding box overlap threshold in the interval , we compute the percentage of successfully tracked frames to generate the successfully tracked frames vs. bounding box overlap threshold plot. The overlap between the the tracker prediction and the ground truth bounding box is computed as , where and represent the intersection and union between the two regions, and computes the number of pixels within the region. Meanwhile, the precision score is computed as the percentage of frames whose estimated location is within the given threshold distance of the ground truth based on the Euclidean distance in the image plane. Here, we set the distance threshold to pixels in evaluation. Notably, the success score is used as the primary metric for ranking methods.

Codename Success Score Institutions Contributors and References
VisDrone-2018 Challenge:
AST 56.2 Beihang University, Lancaster University, Shenyang Aerospace University Chunlei Liu, Wenrui Ding, Jinyu Yang, Baochang Zhang, Jungong Han, Hanlin Chen [71]
BTT 60.5 Shandong University Ke Song, Xixi Hu, Wenhao Wang, Yaxuan Li, and Wei Zhang [93]
C3DT 53.6 South China University of Technology Haojie Li, Sihang Wu [93]
CFCNN 55.2 Karlsruhe Institute of Technology Wei Tian, Martin Lauer [48]
CFWCRKF 50.6 Beijing University of Posts and Telecommunications Shengyin Zhu, Yanyun Zhao [53]
CKCF 32.3 Centre for Research & Technology Hellas Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris [54]
DCFNet 47.4 Civil Aviation University of China, Institute of Automation, Chinese Academy of Sciences Jing Li, Qiang Wang, Weiming Hu [121]
DCST 52.8

Nanjing Artificial Intelligence Chip Research, IACAS

, Institute of Automation, Chinese Academy of Sciences, Nanjing University of Information Science and Technology
Jiaqing Fan, Yifan Zhang, Jian Cheng, Kaihua Zhang, Qingshan Liu [6]
DeCoM 56.9 Seoul National University, NAVER Corp Byeongho Heo, Sangdoo Yun, Jin Young Choi [93]
IMT3 17.6 University of South Australia Asanka G. Perera
LZZ-ECO 68.0 Xidian University Xiaotong Li, Jie Zhang, Xin Zhang [25]
OST 50.3 University of Ottawa Yong Wang, Lu Ding, Robert Laganière, Xinbin Luo [25]
TRACA+ 45.7 Seoul National University, Samsung R&D Campus Kyuewang Lee, Jongwon Choi, Jin Young Choi [20]
SDRCO 56.3 Beijing University of Posts and Telecommunications, Tencent, Sun yat-sen university, Tsinghua University Zhiqun He, Ruixin Zhang, Peizhen Zhang, Xiaohao He [53]
SECFNet 51.1 National University of Defense Technology, Shanghai Jiao Tong University Dongdong Li, Yangliu Kuai, Hao Liu, Zhipeng Deng, Juanping Zhao [116]
STAPLE_SRCA 61.9 Xidian University Wenhua Zhang, Yang Meng [90]
VITALD 62.8 Harbin Institute of Technology, University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences Yuankai Qi, Yifan Yang, Weidong Chen, Kaiwen Duan, Qianqian Xu, Qingming Huang [93, 108]
VisDrone-2019 Challenge:
ACNT 53.2 Jiangnan University, University of Surrey Tianyang Xu, Xiaojun Wu, Zhenhua Feng, Josef Kittler [26]
AST 51.9 Nanjing University of information science and technology Kang Yang, Xianhai Wang, Ning Wang, Jiaqing Fan, Kaihua Zhang [26]
ATOMFR 61.7 Xidian university Wenhua Zhang, Haoran Wang, Jinliu Zhou [26]
ATOMv2 46.8 Institute of Automation, Chinese Academy of Sciences Lianghua Huang, Xin Zhao, Kaiqi Huang [26]
DATOM_AC 54.1 Northwestern Polytechnical University Xizhe Xue, Xiaoyue Yin, Shanrong Zou, Ying Li [26]
DC-Siam 46.3 Northwestern Polytechnical University Jinghao Zhou, Peng Wang [26, 69, 70]
DR-V-LT 57.9 Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, University of Chinese Academy of Sciences Shiyu Xuan, Shengyang Li [69]
ED-ATOM 63.5 Institute of Information Engineering, Chinese Academy of Sciences, University of Chinese Academy of Sciences, CloudWalk Technology Chunhui Zhang, Shengwei Zhao, Kangkai Zhang, Shikun Li, Hao Wen, Shiming Ge [26]
flow_MDNet_RPN 52.6 Xi’an Jiaotong University Han Wu, Xueyuan Yang, Yong Yang, Guizhong Liu [93]
HCF 36.1 Yuneec Aviation Technology, University of Ottawa, Institute of Information Engineering, Chinese Academy of Sciences Zhuojin Sun, Yong Wang, Chunhui Zhang [48]
MATOM 40.9 Institute of Optics and Electronics, Chinese Academy of Sciences Lijun Zhou, Qintao Hu [26]
PTF 54.4 Xidian university Ruohan Zhang, Jie Chen, Jie Gao, Xiaoxue Li, Lingling Shi [26]
SE-RPN 41.9 Wuhan University Xu Lei, Jinwang Wang [70]
SiamDW-FC 38.3 Institute of Automation, Chinese Academy of Sciences Zhipeng Zhang, Weiming Hu [140]
SiamFCOT 47.2 Zhejiang University Yinda Xu, Zeyu Wang [70]
Siam-OM 59.3 Xidian university Xin Zhang, Xiaotong Li, Jie Zhang [26, 150]
SOT-SiamRPN++ 56.8 Zhejiang University Zhizhao Duan, Wenjun Zhu, Xi Yu, Bo Han, Zhiyong Yu, Ting He [69]
SMILE 59.4 Xidian university Ruiyan Ma, Yanjie Gao, Yuting Yang, Wei Song, Yuxuan Li [26, 69]
SSRR 44.7 Nanjing University of information science and technology Ning Wang, Kaihua Zhang [26]
Stable-DL 38.2 University of Ottawa, Shanghai Jiao Tong University, Beihang University, INSKY Lab, Leotail Intelligent Tech Yong Wang, Lu Ding, Robert Laganière, Jiuqing Wan, Wei Shi
TDE 37.2 Institute of Information Engineering, Chinese Academy of Sciences, University of Chinese Academy of Science, Yuneec Aviation Technology, University of Ottawa Chunhui Zhang, Shengwei Zhao, Zhuojin Sun, Yong Wang, Shiming Ge [134]
TIOM 55.3 Beijing University of Posts and Telecommunications Shengyin Zhu, Yanyun Zhao [26]
TABLE VI: Teams participating in VisDrone-SOT 2018 and 2019 challenges, ordered alphabetically.

6.3 Algorithms

VisDrone 2018 challenge.

We present the results and team information in this track in Table VI, including entries from different institutes. CFWCRKF, CKCF, DCST and STAPLE_SRCA are based on the correlation filters, while C3DT, VITALD, DeCom and BTT are improved from the deep MDNet method [93]. Seven other trackers combine the CNN models and correlation filter algorithms, i.e., CFCNN, DCFNet, LZZ-ECO, OST, TRACA+, SDRCO and SECFNet. Notably, OST, CFCNN and LZZ-ECO use object detectors to perform target re-detection for more robustness. AST predicts the target using saliency map and IMT3 is based on the normalized cross correlation filter. The LZZ-ECO method produces the best results with success score, which uses YOLOv3 [100] to re-detect the drifted target and ECO [25] to track the target object.

VisDrone 2019 challenge.

As shown in Table VI, there are trackers from different institutes submitted in this track. Among them, trackers are constructed based on ATOM [26], i.e., ACNT, AST, ATOMFR, ATOMv2, DATOM_AC, ED-ATOM, MATOM, SSRR and TIOM. Notably, ED-ATOM achieves the best performance with success score and precision score. PTF follows the ECO algorithm [24], and Siam-OM and SMILE use the Siamese networks based on ATOM [26]. other trackers are also using the Siamese network architecture, including DC-Siam, DR-V-LT, SiamDW-FC, SiamFCOT and SOT-SiamRPN++.

VisDrone-dev benchmark.

state-of-the-art single-object tracking methods are evaluated for comparison in this track. We roughly divide them into three categories, i.e., the correlation filters based, the siamese network based, and the convolutional network based approaches, listed as follows.

Correlation filters based approach: KCF [54], CSRDCF [82], LCT [86], DSST [27], ECO [24], SRDCF [28], SCT [21], fDSST [29], Staple [6], Staple_CA [90], BACF [48], PTAV [43], STRCF [71], and HCFT [85]888Since HCFT [85] adaptively learns correlation filters on each convolutional layer to encode the target appearance, we category it into both the correlation filter and convolutional network based approaches..

Siamese network based approach: DSiam [50], SiameseFC [7], and SiamRPN++ [69].

Convolutional network based approach: HCFT [85], MDNet [93], CFNet [116], TRACA [19], and ATOM [26].

6.4 Results and Analysis

Results on the test-challenge set.

The overall success and precision scores of top submissions in the VisDrone-SOT 2018 [125] and 2019 [36] challenges are shown in Fig. 8(a) and (b), respectively. Notably, several challenging factors in the collected dataset, such as background clutter, large scale variation, and occlusion, make the trackers easily to drift. To that end, some trackers integrate the state-of-the-art detectors to re-detect the target when drifting occurs. For example, in the VisDrone-SOT 2018 challenge, LZZ-ECO combines YOLOv3 [100] and ECO [25] to achieve the best success score and precision score . VITALD trains RefineDet [138] as a reference for the VITAL tracker [108], which obtains the second best success score and the third best precision score . Another solution to deal with drifting problem is STAPLE_SRCA [90], which develops a sparse context-aware response scheme to recognize whether the target moves out of the scene or be covered by other objects, It obtains the third best success score and the second best precision score . DCST learns the spatio-temporal regularized correlation filters using color clustering based histogram model without the re-detection module, resulting in inferior results with success score and precision score.

We notice that the correlation filter based methods do not perform well in the VisDrone-SOT 2018 challenge. Thus, in the VisDrone-SOT 2019 challenge, the researchers shift their focus from correlation filter based methods to deep neural network based methods, such as ATOM [26] and Siamese networks [7, 50, 69]. Specifically, ATOMFR combines SENet [56] and ATOM [26] to capture the interdependencies within feature channels and suppress feature channels that are of little use to the current target size and location estimation, achieving the top accuracy on the test-challenge 2018 set with success score and precision score .

Another critical engine for the performance improvements is the creation and utilization of large-scale datasets (e.g., MS COCO [80], Got-10k [57], ImageNet DET/VID [105], LaSOT [42], TrackingNet [91], VOT [63] and YoutubeBB [99]) for deep neural network training. For example, ED-ATOM achieves the best results (i.e., success score and precision score) in the VisDrone-SOT 2019 challenge. This is because ED-ATOM is constructed based on ATOM [26] with the low-light image enhancement algorithm [135] and the online data augmentation scheme [137, 9]. Meanwhile, the model is trained on ImageNet DET/VID [105], MS COCO [80], Got-10k [57], and LaSOT [42].

Moreover, tracker combination is an effective strategy to improve the performance. Siam-OM uses ATOM [26] to handle short-term tracking, while DaSiam [150] with ResNet to handle long-term tracking, ranked in the forth place in the VisDrone-SOT 2019 challenge. SIMLE combines two state-of-the-art trackers ATOM [26] and SiamRPN++ [69] to improve the performance, ranked in the fifth place. DR-V-LT integrates the distractor-aware verification network into SiamRPN++ [69], which is robust to similar objects challenge, ranked in the eighth place.

In addition, comparing the results of the submitted trackers on the test-challenge 2018 and test-challenge 2019 sets, we find that the tracking accuracy is significantly degraded. The best tracker ED-ATOM achieves success score and precision score on the test-challenge 2018 set vs. success score and precision score on the test-challenge 2019 set. It demonstrates the difficulties of the new collected long-term tracking sequences, and suggests the need to develop more effective trackers for challenging scenarios on drones.

Results on the test-dev set.

We evaluate state-of-the-art trackers on the test-dev set in Fig. 8(c). As shown in Fig. 8(c), ATOM [26] (marked as the orange cross in the top-right corner) obtains the best success score and the third best precision score. This is attributed to the network trained offline on large-scale datasets to directly predict the IoU overlap between the target and a bounding box estimate. However, it performs not well in terms of low resolution and out of view. MDNet [93] and SiamRPN++ [69] rank the second and third places in terms of success score, respectively. In summary, training on large-scale datasets brings significant performance improvement of trackers.

Fig. 8: (a) The success vs. precision scores of the top trackers in the VisDrone-SOT 2018 (denoted as red marks) and VisDrone-SOT 2019 (denoted as blue marks) challenges on the test-challenge 2018 set. The trackers in the VisDrone-SOT 2018 and VisDrone-SOT 2019 challenges are presented in the red and blue markers, respectively. (b) The success vs. precision scores of the top trackers in the VisDrone-SOT 209 challenge on the test-challenge 2019 set. (c) The success vs. precision scores of the state-of-the-art trackers on the test-dev set.

6.5 Discussion

The state-of-the-art SOT algorithms on the VisDrone-SOT dataset are inspired by the algorithms in the object detection and re-identification fields. They benefit a lot from offline training on large-scale datasets, such as MS COCO [80], Got-10k [57], ImageNet DET/VID [105], LaSOT [42], TrackingNet [91], VOT [63] and YoutubeBB [99]. However, fast motion, low resolution, and occlusion still challenge the performance of the SOT algorithms.

Abrupt motion.

Several SOT algorithms [69, 15, 122] formulate object tracking as the one-shot detection task, which use the bounding box in the first frame as the only exemplar. These methods rely on the pre-set anchor boxes to regress the bounding box of target in consecutive frames. However, the pre-defined anchor boxes can not adapt to various motion patterns and scales of targets, especially when the fast motion and occlusion occur. To this end, we can attempt to integrate the motion information or re-detection module to improve the accuracy of tracking algorithms.

Low resolution

is another challenging factor greatly affects tracking accuracy. Most of the state-of-the-art methods [26, 50, 69] merely focus on the appearance variations of target region, producing unstable and inaccurate results. We believe that exploiting context information surrounding the target and super-resolution technique can be helpful to improve the tracking performance.

Occlusion

happens frequently in tracking process, which is the obstacle to the accurate tracking results. Some previous algorithms [1, 12, 32, 33] attempt to use part-based representations to handle the appearance changes caused by occlusion. Meanwhile, using re-initialization module [59] is an effective strategy to get rid of occlusion, i.e., the re-initialization module is able to re-detect the target after reappearing in the scenes. In addition, predicting the motion patterns of the target based on its trajectory in history is also a promising way worth to explore.

7 MOT Track

The MOT track aims to recover the trajectories of objects in video sequences, which is an important problem in computer vision with many applications, such as surveillance, activity analysis, and sport video analysis. In the VisDrone-2018 challenge, we divide this track into two sub-tracks depending on whether using prior detection results in individual frames. Specifically, for one sub-track, a submitted algorithm is required to recover the trajectories of objects in video sequences without taking the object detection results as input. The evaluation protocol presented in [95] (i.e., the average precision (AP) of trajectories per object class) is used to evaluate the performance of trackers. In contrast, for the second sub-track, prior object detection results in individual frames are provided and the participating algorithm can work on top of the input detections. In the VisDrone-2019 challenge, we merge these two tracks, and do not distinguish submitted algorithms according to whether they use object detection in each video frame as input or not. The average precision (AP) of the recovered trajectories in [95] is used to evaluate the performance of submitted trackers. Notably, this track uses the same data as the VID track. Specifically, five categories of objects (i.e., pedestrian, car, van, bus, and truck) in video clips are considered in evaluation.

Codename AP MOTA Institutions Contributors and References
VisDrone-2018 Challenge:
Ctrack 16.12 30.80 Centre for Research & Technology Hellas Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris [114]
deep-sort_d2 10.47 - Beijing University of Posts and Telecommunications Jianfei Zhao, Yanyun Zhao [79, 127]
FRMOT - 33.10 Universidad Autónoma de Madrid Elena Luna, Diego Ortego, Juan C. San Miguel, José M. Martínez [101]
GOG_EOC - 36.90 Harbin Institute of Technology, University of Chinese Academy of Sciences Hongyang Yu, Guorong Li, Qingming Huang [96]
MAD 7.27 - Xidian University Wei Song, Yuxuan Li, Zhaoliang Pi, Wenhua Zhang [100, 116]
SCTrack - 35.80 University of Missouri-Columbia, U.S. Naval Research Laboratory Noor M. Al-Shakarji, Filiz Bunyak, Guna Seetharaman, Kannappan Palaniappan [2]
TrackCG - 42.60 Karlsruhe Institute of Technology Wei Tian, Zhiming Ma, Martin Lauer [113]
V-IOU - 40.20 Technische Universität Berlin Erik Bochinski, Tobias Senst, Thomas Sikora [10]
VisDrone-2019 Challenge:
DBAI-Tracker 43.94 - DeepBlue Technology (shanghai) Zhipeng Luo, Yuehan Yao, Zhenyu Xu, Feng Ni, Bing Dong [11, 10, 96, 54]
Flow-Tracker 30.87 - Xi’an Jiaotong University Weiqiang Li, Jiatong Mu, Guizhong Liu [11, 109, 10]
GGDTRACK 23.09 - Axis Communications, Centre for Methematical Sciences Håkan Ardö, Mikael Nilsson [101, 4]
HMTT 28.67 - Beijing University of Posts and Telecommunications Siyang Pan, Zhihang Tong, Yanyun Zhao [10, 150, 144, 142]
IITD_DeepSort 13.88 - Indian Institute of Information Technology, Indian Institute of Technology Ajit Jadhav, Prerana Mukherjee, Vinay Kaushik, Brejesh Lall [79, 128]
OS-MOT 0.16 - University of Ottawa, Shanghai Jiao Tong University, YUNEEC Aviation Technology, Institute of Information Engineering, Chinese Academy of Sciences, INSKY Lab, Leotail Intelligent Tech Yong Wang, Lu Ding, Robert Laganière, Zhuojin Sun, Chunhui Zhang, Wei Shi [8]
SCTrack 10.09 - University of Technology, University of Missouri-Columbia, U.S. Naval Research Laboratory Noor M. Al-Shakarji, Filiz Bunyak, Guna Seetharaman, Kannappan Palaniappan [101, 2]
SGAN 2.54 - Harbin Institute of Technology, University of Chinese Academy of Sciences Hongyang Yu, Guorong Li, Qingming Huang [3]
T&D-OF 12.37 - Dalian University of Technology Xinyu Zhang, Xin Chen, Shuhao Chen, Chang Liu, Dong Wang, Huchuan Lu [22, 58, 17]
TNT_DRONE 27.32 - University of Washington, Beijing University of Posts and Telecommunications Haotian Zhang, Yanting Zhang, Gaoang Wang, Jenq-Neng Hwang [118]
TrackKITSY 39.19 - Karlsruhe Institute of Technology, Sun Yat-sen University Wei Tian, Jinrong Hu, Yuduo Song, Zhaotang Chen, Long Chen, Martin Lauer [11, 112]
VCLDAN 7.5 - Tsinghua University Zhibin Xiao [111]
TABLE VII: Teams participating in VisDrone-MOT 2018 and 2019 challenges, ordered alphabetically.

7.1 Evaluation Protocol

In the VisDrone-2018 challenge, the MOT track is divided into two sub-tracks depending on whether prior detection results are used in individual frames. For the multi-object tracking without detection input, we use the tracking evaluation protocol in [95] to evaluate the performance of the algorithms. That is, each algorithm is required to output a list of bounding box with confidence scores and the corresponding identities. We sort the tracklets (formed by the bounding box detections with the same identity) according to the average confidence of their bounding box detections. A tracklet is considered correct if the intersection over union (IoU) overlap with ground truth tracklet is larger than a threshold. Similar to [95], we use three thresholds in evaluation, i.e., , , and . The performance of an algorithm is evaluated by averaging the mean average precision (mAP) per object class over different thresholds.

For multi-object tracking with detection input, we follow evaluation protocol in [87] to evaluate the performance of the algorithms. That is, the average rank over metrics (i.e., MOTA, MOTP, IDF1, FAF, MT, ML, FP, FN, IDS, and FM) is used to rank the algorithms. The MOTA metric combines three error sources: FP, FN and IDS. The MOTP metric is the average dissimilarity between all true positives and the corresponding ground truth targets. The IDF1 metric indicates the ratio of correctly identified detections over the average number of ground truth and computed detections. The FAF metric indicates the average number of false alarms per frame. The FP metric describes the total number of tracker outputs which are the false alarms, and FN is the total number of targets missed by any tracked trajectories in each frame. The IDS metric describes the total number of times that the matched identity of a tracked trajectory changes, while FM is the times that trajectories are disconnected. Both the IDS and FM metrics reflect the accuracy of tracked trajectories. The ML and MT metrics measure the percentage of tracked trajectories less than and more than of the time span based on the ground truth respectively.

In the VisDrone-2019 challenge, we do not distinguish submitted algorithms according to whether they use object detection in each video frame as input or not. Similar to the evaluation protocol used in the multi-object tracking without detection input in the VisDrone-2018 challenge, we use the protocol in [95] to evaluate the performance of algorithms.

7.2 Algorithms

VisDrone 2018 challenge.

There are multi-object tracking algorithms submitted in this track, shown in Table VII. Ctrack aggregates the predicted events in grouped targets and uses the temporal constraints to stitch short tracklets [114]. V-IOU [10] uses the spatial overlap to associate input detections in consecutive frames. GOG_EOC develops a context harmony model to create exchanging object context patches via the Siamese network, and tracks the objects using the algorithm in [96]. SCTrack [2] uses a color correlation cost matrix to maintain object identities. TrackCG [113] achieves the best performance with MOTA score among all trackers using the public input detections. It first estimates the target state using the motion pattern of grouped objects to build short tracklets, and uses the graph model to generate long trajectories. Two other methods use private input detections, i.e., MAD, using YOLOv3 [100] for detection and CFNet [116]) for association, and deep-sort_v2, using RetinaNet [79] for detection and Deep-SORT [127]) for association.

VisDrone 2019 challenge.

In this track, we have received entries for different institutes, shown in Table VII. Most of the submissions are based on the tracking-by-detection framework, i.e., the trackers exploit temporal coherence to associate detections in individual frames to recover the trajectories of targets. At first, several submissions use the state-of-the-art detectors, such as R-FCN [22], RetinaNet [79], Cascade R-CNN [11], and CenterNet [144] to generate object detections in individual frames. After that, some submitted methods use the single object tracking methods, such as KCF [54] and DaSiameseRPN [150], to recover false negatives of detectors. Some other methods, such as GGDTRACK, Flow-Tracker, OS-MOT, T&D-OF, TrackKITSY, and SGAN, attempt to exploit low-level or middle-level temporal information to improve the tracking performance. The HMTT, IITD_DeepSort, SCTrack, T&D-OF, TNT_DRONE, and VCLDCN methods use the metric learning algorithms to compute the similarities between detections in consecutive frames, which is effective in occlusion and miss detection challenges.

VisDrone-dev benchmark.

We evaluate multi-object tracking methods in this track for comparison, including GOG [96], IOUT [10], SORT [127] and MOTDT [17]. Notably, the FPN [78] object detection method is used to generate the input detections in each individual frame.

Method AP AP AP AP AP AP AP AP AP
VisDrone-2018 challenge:
Ctrack 16.12 22.40 16.26 9.70 27.74 28.45 8.15 7.95 8.31
deep-sort_d2 10.47 17.26 9.40 4.75 29.14 2.38 3.46 7.12 10.25
MAD 7.27 12.72 7.03 2.07 16.23 1.65 2.85 14.16 1.46
VisDrone-2019 challenge:
DBAI-Tracker 43.94 57.32 45.18 29.32 55.13 44.97 42.73 31.01 45.85
TrackKITSY 39.19 48.83 39.36 29.37 54.92 29.05 34.19 36.57 41.20
Flow-Tracker 30.87 41.84 31.00 19.77 48.44 26.19 29.50 18.65 31.56
HMTT 28.67 39.05 27.88 19.08 44.35 30.56 18.75 26.49 23.19
TNT_DRONE 27.32 35.09 26.92 19.94 38.06 22.65 33.79 12.62 29.46
GGDTRACK 23.09 31.01 22.70 15.55 35.45 28.57 11.90 17.20 22.34
IITD_DeepSort 13.88 23.19 12.81 5.64 32.20 8.83 6.61 18.61 3.16
T&D-OF 12.37 17.74 12.94 6.43 23.31 22.02 2.48 9.59 4.44
SCTrack 10.09 14.95 9.41 5.92 18.98 17.86 4.86 5.20 3.58
VCLDAN 7.50 10.75 7.41 4.33 21.63 0.00 4.92 10.94 0.00
VisDrone-dev:
GOG [96] 5.14 11.02 3.25 1.14 13.70 3.09 1.94 3.08 3.87
IOUT [10] 4.34 8.32 3.29 1.40 10.90 2.15 2.53 1.98 4.11
SORT [127] 3.37 5.78 2.82 1.50 8.30 1.04 2.47 0.95 4.06
MOTDT [17] 1.22 2.43 0.92 0.30 0.36 0.00 0.15 5.08 0.49
TABLE VIII: Comparisons results of the algorithms on the VisDrone-MOT dataset using the evaluation protocol in [95].
Method MOTA MOTP IDF1 FAF MT ML FP FN IDS FM
VisDrone-2018 challenge:
TrackCG 42.6 74.1 58.0 0.86 323 395 14722 68060 779 3717
V-IOU 40.2 74.9 56.1 0.76 297 514 11838 74027 265 1380
GOG_EOC 36.9 75.8 46.5 0.29 205 589 5445 86399 354 1090
SCTrack 35.8 75.6 45.1 0.39 211 550 7298 85623 798 2042
FRMOT 33.1 73.0 50.8 1.15 254 463 21736 74953 1043 2534
Ctrack 30.8 73.5 51.9 1.95 369 375 36930 62819 1376 2190
VisDrone-dev:
GOG [96] 28.7 76.1 36.4 0.78 346 836 17706 144657 1387 2237
IOUT [10] 28.1 74.7 38.9 1.60 467 670 36158 126549 2393 3829
SORT [127] 14.0 73.2 38.0 3.57 506 545 80845 112954 3629 4838
MOTDT [17] -0.8 68.5 21.6 1.97 87 1196 44548 185453 1437 3609
TABLE IX: Comparisons results of the algorithms on the VisDrone-MOT dataset using the CLEAR-MOT evaluation protocol [87].

7.3 Results and Analysis

Results on the test-challenge set.

We report the evaluation results of the trackers in the VisDrone-VDT 2018 [148] and VisDrone-MOT 2019 [126] challenges with the evaluation protocols [95] and [87] in Table VIII and IX, respectively. As shown in Table VIII, in the subtrack without using prior input detections in the VisDrone-VDT 2018 challenge, Ctrack achieves the best AP score by aggregating the prediction events in grouped targets and stitching the tracks by temporal constraints. In this way, the targets in crowded scenarios are able to be recovered after being occluded. In the VisDrone-MOT 2019 Challenge, the submitted algorithms achieve significant improvements, e.g., DBAI-Tracker improves the top AP score by , i.e., vs. . Notably, the top three trackers, i.e., DBAI-Tracker, TrackKITSY and Flow-Tracker, use Cascade R-CNN [11] to generate detections in individual frames, and integrate the temporal information, e.g., FlowNet [109] and IoU tracker [10] to complete association. Similarly, HMTT combines CenterNet [144], IoU tracker [10] and DaSiameseRPN [150] for multiple object tracking, ranked in the forth place in the challenge.

For the sub-track using provided input detections, i.e., generated by Faster R-CNN [101] in the VisDrone-MOT 2018 challenge, TrackCG achieves the best MOTA and IDF1 scores. V-IOU achieves slightly inferior MOTA and IDF1 scores than TrackCG, but produces the best IDS score, i.e., . It associates detections based on spatial overlap, i.e., intersection-over-union, in consecutive frames. We speculate that the overlapping based measurement is reliable enough in drone captured videos from high altitude, which do not contain large displacements of objects. GOG_EOC obtains the best FAF, FP and FM scores, i.e., , , and , which uses both the detection overlap and context harmony degree to measure the similarities between detections in consecutive frames. SCTrack designs a color correlation cost matrix to maintain object identities. However, the color information is not reliable enough, resulting in inferior results, i.e., ranked in the forth place in terms of MOTA (). FRMOT is an online tracker using the Hungarian algorithm for associating detections, leading to relative large IDS () and FM () scores.

Results on the test-dev set.

We evaluate multi-object tracking on the test-dev set with the evaluation protocols [95] and [87], shown in Table VIII and IX, respectively. Notably, FPN [78] is used to generate object detections in individual frames for the sub-track using prior input detections.

GOG [96] and IOUT [10] benefit from global information of whole sequences and spatial overlap between frame detections, achieving the best tracking results in terms of both evaluation protocols [95] and [87]. SORT [127] approximates the inter-frame displacements of each object with a linear constant velocity model, which is independent of object categories and camera motion, significantly degrading its performance. MOTDT [17] computes the similarities between objects using appearance model trained on other large-scale person re-identification datasets without fine-tuning, leading to inferior accuracy.

7.4 Discussion

Most of the MOT algorithms formulate the tracking task as a data association problem, which aims to associate object detections in individual frames to generate object trajectories. Thus, the accuracy of object detection in individual frames significantly influence the performance of MOT. Intuitively, integrating object detection and tracking into a unified framework is promising to improve the performance. In the following, we discuss two potential research directions to further boost the performance.

Similarity calculation.

For the data association problem, similarity computation between different detections in individual frames is crucial for the tracking performance. The appearance and motion information should be considered in computing the similarities. For example, a Siamese network offline trained on the ImageNet VID dataset [105] can be used to exploit temporal discriminative features of objects. The Siamese network can be finetuned in tracking process to further improve the accuracy. Meanwhile, several low-level and mid-level motion features are also effective and useful for the MOT algorithms, such as KLT and optical flow.

is another effective way to improve the MOT performance. For example, based on the scene understanding module, we can infer the enter or exit ports in the scenes. The information of the enter and exit ports is a strong priori for the trackers to distinguish occlusion, termination, or re-appearing of the target. Meanwhile, the tracker is also able to suppress false trajectories based on general knowledge and scene understanding, e.g., the vehicles are only driven on the road rather on the building. In summary, this area is worth further studying to improve the MOT performance.

8 Conclusion

We introduce a new large-scale benchmark, VisDrone, to facilitate the research of object detection and tracking on drone captured imagery. With over worker hours, a vast collection of object instances are gathered, annotated, and organized to drive the advancement of object detection and tracking algorithms. We place emphasis on capturing images and video clips in real life environments. Notably, the dataset is recorded over different cites in China with various drone platforms, featuring a diverse real-world scenarios. We provide a rich set of annotations including more than million annotated object instances along with several important attributes. The VisDrone benchmark is made available to the research community through the project website: www.aiskyeye.com. The best submissions in the four tracks are still far from satisfactory in real applications.

Acknowledgements

We would like to thank Jiayu Zheng and Tao Peng for valuable and constructive suggestions to improve the quality of this paper.

References

  • [1] A. Adam, E. Rivlin, and I. Shimshoni (2006) Robust fragments-based tracking using the integral histogram. In CVPR, pp. 798–805. Cited by: §6.5.
  • [2] N. M. Al-Shakarji, G. Seetharaman, F. Bunyak, and K. Palaniappan (2017) Robust multi-object tracking with semantic color correlation. In AVSS, pp. 1–7. Cited by: §7.2, TABLE VII.
  • [3] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, F. Li, and S. Savarese (2016) Social LSTM: human trajectory prediction in crowded spaces. In CVPR, pp. 961–971. Cited by: TABLE VII.
  • [4] H. Ardö and M. Nilsson (2019) Multi target tracking by learning from generalized graph differences. CoRR abs/1908.06646. Cited by: TABLE VII.
  • [5] M. Barekatain, M. Martí, H. Shih, S. Murray, K. Nakayama, Y. Matsuo, and H. Prendinger (2017) Okutama-action: an aerial view video dataset for concurrent human action detection. In CVPRWorkshops, pp. 2153–2160. Cited by: TABLE I, §2.2.
  • [6] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H. S. Torr (2016) Staple: complementary learners for real-time tracking. In CVPR, pp. 1401–1409. Cited by: §6.3, TABLE VI.
  • [7] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2016) Fully-convolutional siamese networks for object tracking. In ECCV, pp. 850–865. Cited by: §6.3, §6.4.
  • [8] D. P. Bertsekas (1992) Auction algorithms for network flow problems: A tutorial introduction. Comp. Opt. and Appl. 1 (1), pp. 7–66. Cited by: TABLE VII.
  • [9] G. Bhat, J. Johnander, M. Danelljan, F. S. Khan, and M. Felsberg (2018) Unveiling the power of deep tracking. In ECCV, pp. 493–509. Cited by: §6.4.
  • [10] E. Bochinski, V. Eiselein, and T. Sikora (2017) High-speed tracking-by-detection without using image information. In AVSS, pp. 1–6. Cited by: §7.2, §7.2, §7.3, §7.3, TABLE VII, TABLE VIII, TABLE IX.
  • [11] Z. Cai and N. Vasconcelos (2018) Cascade R-CNN: delving into high quality object detection. In CVPR, pp. 6154–6162. Cited by: §4.3, §4.3, §4.3, TABLE II, TABLE III, §5.3, §5.4, TABLE IV, §7.2, §7.3, TABLE VII.
  • [12] Z. Cai, L. Wen, Z. Lei, N. Vasconcelos, and S. Z. Li (2014) Robust deformable and occluded object tracking with dynamic graph. TIP 23 (12), pp. 5497–5509. Cited by: §6.5.
  • [13] Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. CoRR abs/1904.11492. Cited by: §4.3, §4.4, §5.4.
  • [14] L. Cehovin, A. Leonardis, and M. Kristan (2016) Visual object tracking performance measures revisited. TIP 25 (3), pp. 1261–1274. Cited by: §1.
  • [15] B. X. Chen and J. K. Tsotsos (2019) Fast visual object tracking with rotated bounding boxes. CoRR abs/1907.03892. Cited by: §6.5.
  • [16] K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy, and D. Lin (2019) Hybrid task cascade for instance segmentation. In CVPR, Cited by: §4.3, TABLE II.
  • [17] L. Chen, H. Ai, Z. Zhuang, and C. Shang (2018) Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME, pp. 1–6. Cited by: §7.2, §7.3, TABLE VII, TABLE VIII, TABLE IX.
  • [18] W. Chen, L. Cao, X. Chen, and K. Huang (2017) An equalized global graph model-based approach for multicamera object tracking. TCSVT 27 (11), pp. 2367–2381. Cited by: §2.1.
  • [19] J. Choi, H. J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, and J. Y. Choi (2018)

    Context-aware deep feature compression for high-speed visual tracking

    .
    In CVPR, pp. 479–488. Cited by: §6.3.
  • [20] J. Choi, H. J. Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, and J. Y. Choi (2018) Context-aware deep feature compression for high-speed visual tracking. In CVPR, Cited by: TABLE VI.
  • [21] J. Choi, H. J. Chang, J. Jeong, Y. Demiris, and J. Y. Choi (2016) Visual tracking using attention-modulated disintegration and integration. In CVPR, pp. 4321–4330. Cited by: §6.3.
  • [22] J. Dai, Y. Li, K. He, and J. Sun (2016) R-FCN: object detection via region-based fully convolutional networks. In NeurIPS, pp. 379–387. Cited by: TABLE II, §7.2, TABLE VII.
  • [23] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §4.3, §4.4, §4.4, §5.4.
  • [24] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In CVPR, pp. 6931–6939. Cited by: §5.3, §5.5, TABLE IV, §6.3, §6.3.
  • [25] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2017) ECO: efficient convolution operators for tracking. In CVPR, pp. 6931–6939. Cited by: §5.3, §5.4, TABLE IV, §6.3, §6.4, TABLE VI.
  • [26] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg (2018) ATOM: accurate tracking by overlap maximization. CoRR abs/1811.07628. Cited by: §6.3, §6.3, §6.4, §6.4, §6.4, §6.4, §6.5, TABLE VI.
  • [27] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg (2014) Accurate scale estimation for robust visual tracking. In BMVC, Cited by: §6.3.
  • [28] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg (2015) Learning spatially regularized correlation filters for visual tracking. In ICCV, pp. 4310–4318. Cited by: §6.3.
  • [29] M. Danelljan, G. Häger, F. S. Khan, and M. Felsberg (2017) Discriminative scale space tracking. TPAMI 39 (8), pp. 1561–1575. Cited by: §6.3.
  • [30] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li (2009) ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §2.1.
  • [31] P. Dollár, C. Wojek, B. Schiele, and P. Perona (2012) Pedestrian detection: an evaluation of the state of the art. TPAMI 34 (4), pp. 743–761. Cited by: TABLE I, §1, §2.1, §4.4.
  • [32] D. Du, H. Qi, W. Li, L. Wen, Q. Huang, and S. Lyu (2016) Online deformable object tracking based on structure-aware hyper-graph. TIP 25 (8), pp. 3572–3584. Cited by: §2.1, §6.5.
  • [33] D. Du, H. Qi, L. Wen, Q. Tian, Q. Huang, and S. Lyu (2017) Geometric hypergraph learning for visual tracking. TCYB 47 (12), pp. 4182–4195. Cited by: §6.5.
  • [34] D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In ECCV, pp. 375–391. Cited by: TABLE I, §2.2.
  • [35] D. Du, P. Zhu, L. Wen, X. Bian, H. Ling, Q. Hu, and et al. (2019) VisDrone-det2019: the vision meets drone object detection in image challenge results. In ICCV Workshops, Cited by: §4.4.
  • [36] D. Du, P. Zhu, L. Wen, X. Bian, H. Ling, Q. Hu, and et al. (2019) VisDrone-sot2019: the vision meets drone single object tracking challenge results. In ICCV Workshops, Cited by: §6.4.
  • [37] M. Enzweiler and D. M. Gavrila (2009) Monocular pedestrian detection: survey and experiments. TPAMI 31 (12), pp. 2179–2195. Cited by: §2.1.
  • [38] M. F. et al. (2016) The thermal infrared visual object tracking VOT-TIR2016 challenge results. In ECCVWorkshops, pp. 824–849. Cited by: §2.1.
  • [39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html Cited by: TABLE I.
  • [40] M. Everingham, S. M. A. Eslami, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: A retrospective. IJCV 111 (1), pp. 98–136. Cited by: §2.1.
  • [41] M. Everingham, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2010) The pascal visual object classes (VOC) challenge. IJCV 88 (2), pp. 303–338. Cited by: §2.1.
  • [42] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling (2018) LaSOT: A high-quality benchmark for large-scale single object tracking. CoRR abs/1809.07845. Cited by: §1, §6.4, §6.5.
  • [43] H. Fan and H. Ling (2017) Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In ICCV, pp. 5487–5495. Cited by: §6.3.
  • [44] C. Feichtenhofer, A. Pinz, and A. Zisserman (2017) Detect to track and track to detect. In ICCV, pp. 3057–3065. Cited by: §5.3, §5.4, §5.5, TABLE V.
  • [45] M. Felsberg, A. Berg, G. Häger, J. Ahlberg, M. Kristan, J. Matas, A. Leonardis, L. Cehovin, G. Fernández, T. Vojír, G. Nebehay, and R. P. Pflugfelder (2015) The thermal infrared visual object tracking VOT-TIR2015 challenge results. In ICCVWorkshops, pp. 639–651. Cited by: §2.1.
  • [46] J. Ferryman and A. Shahrokni (2009) PETS2009: dataset and challenge. In AVSS, pp. 1–6. Cited by: §2.1, §2.1.
  • [47] H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey (2017) Need for speed: A benchmark for higher frame rate object tracking. In ICCV, pp. 1134–1143. Cited by: TABLE I, §2.1.
  • [48] H. K. Galoogahi, A. Fagg, and S. Lucey (2017) Learning background-aware correlation filters for visual tracking. In ICCV, pp. 1144–1152. Cited by: §6.3, TABLE VI.
  • [49] A. Geiger, P. Lenz, and R. Urtasun (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: TABLE I, §1, §2.1, §2.1.
  • [50] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang (2017) Learning dynamic siamese network for visual object tracking. In ICCV, pp. 1781–1789. Cited by: §6.3, §6.4, §6.5.
  • [51] M. Haris, G. Shakhnarovich, and N. Ukita (2018) Deep back-projection networks for super-resolution. In CVPR, pp. 1664–1673. Cited by: §4.4.
  • [52] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick (2017) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: §4.4.
  • [53] Z. He, Y. Fan, J. Zhuang, Y. Dong, and H. Bai (2017) Correlation filters with weighted convolution responses. In ICCVWorkshops, pp. 1992–2000. Cited by: TABLE VI.
  • [54] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista (2015) High-speed tracking with kernelized correlation filters. TPAMI 37 (3), pp. 583–596. Cited by: §6.3, TABLE VI, §7.2, TABLE VII.
  • [55] M. Hsieh, Y. Lin, and W. H. Hsu (2017) Drone-based object counting by spatially regularized regional proposal network. In ICCV, Cited by: TABLE I, §1, §2.2.
  • [56] J. Hu, L. Shen, and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §4.3, §4.4, §6.4.
  • [57] L. Huang, X. Zhao, and K. Huang (2018) GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. CoRR abs/1810.11981. Cited by: §6.4, §6.5.
  • [58] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In CVPR, pp. 1647–1655. Cited by: TABLE VII.
  • [59] Z. Kalal, K. Mikolajczyk, and J. Matas (2012) Tracking-learning-detection. TPAMI 34 (7), pp. 1409–1422. Cited by: §6.5.
  • [60] V. Kalogeiton, V. Ferrari, and C. Schmid (2016) Analysing domain shift factors between videos and images for object detection. TPAMI 38 (11), pp. 2327–2334. Cited by: §2.1.
  • [61] I. Kalra, M. Singh, S. Nagpal, R. Singh, M. Vatsa, and P. B. Sujit (2019)

    DroneSURF: benchmark dataset for drone-based face recognition

    .
    In FG, pp. 1–7. Cited by: TABLE I.
  • [62] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. P. Pflugfelder, L. Cehovin, T. Vojír, G. Häger, A. Lukezic, G. Fernández, and et al. (2016) The visual object tracking VOT2016 challenge results. In ECCVWorkshops, pp. 777–823. Cited by: TABLE I, §2.1.
  • [63] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. P. Pflugfelder, L. C. Zajc, T. Vojír, G. Bhat, A. Lukezic, A. Eldesokey, G. Fernández, and et al. (2018) The sixth visual object tracking VOT2018 challenge results. In ECCVWorkshops, pp. 3–53. Cited by: §2.1, §6.4, §6.5.
  • [64] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Cehovin, G. Fernández, T. Vojír, G. Häger, G. Nebehay, and R. P. Pflugfelder (2015) The visual object tracking VOT2015 challenge results. In ICCVWorkshops, pp. 564–586. Cited by: §2.1.
  • [65] H. Law and J. Deng (2018) CornerNet: detecting objects as paired keypoints. In ECCV, pp. 765–781. Cited by: §4.3, §4.4, TABLE III, §5.3, §5.4, TABLE V.
  • [66] H. Law, Y. Teng, O. Russakovsky, and J. Deng (2019) CornerNet-lite: efficient keypoint based object detection. CoRR abs/1904.08900. Cited by: §5.3, TABLE IV.
  • [67] L. Leal-Taixé, A. Milan, I. D. Reid, S. Roth, and K. Schindler (2015) MOTChallenge 2015: towards a benchmark for multi-target tracking. CoRR abs/1504.01942. Cited by: TABLE I, §1, §2.1.
  • [68] A. Li, M. Li, Y. Wu, M. Yang, and S. Yan (2015) NUS-PRO: a new visual tracking challenge. In TPAMI, pp. 1–15. Cited by: §2.1.
  • [69] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan (2018) SiamRPN++: evolution of siamese visual tracking with very deep networks. In CVPR, Cited by: §5.3, §5.5, TABLE IV, §6.3, §6.4, §6.4, §6.4, §6.5, §6.5, TABLE VI.
  • [70] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu (2018) High performance visual tracking with siamese region proposal network. In CVPR, pp. 8971–8980. Cited by: TABLE VI.
  • [71] F. Li, C. Tian, W. Zuo, L. Zhang, and M. Yang (2018) Learning spatial-temporal regularized correlation filters for visual tracking. In CVPR, pp. 4904–4913. Cited by: §6.3, TABLE VI.
  • [72] S. Li and D. Yeung (2017) Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In AAAI, pp. 4140–4146. Cited by: §2.2.
  • [73] Y. Li, Y. Chen, N. Wang, and Z. Zhang (2019) Scale-aware trident networks for object detection. CoRR abs/1901.01892. Cited by: §4.3, TABLE II.
  • [74] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2017) Light-head R-CNN: in defense of two-stage object detector. CoRR abs/1711.07264. Cited by: §4.3, TABLE II, TABLE III.
  • [75] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun (2018) DetNet: design backbone for object detection. In ECCV, pp. 339–354. Cited by: §4.3, §4.4, TABLE III.
  • [76] P. Liang, E. Blasch, and H. Ling (2015) Encoding color information for visual tracking: algorithms and benchmark. TIP 24 (12), pp. 5630–5644. Cited by: TABLE I, §2.1.
  • [77] P. Liang, Y. Wu, H. Lu, L. Wang, C. Liao, and H. Ling (2018) Planar object tracking in the wild: A benchmark. In ICRA, pp. 651–658. Cited by: TABLE I.
  • [78] T. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, §5.3, §5.4, §5.4, TABLE V, §7.2, §7.3.
  • [79] T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In ICCV, pp. 2999–3007. Cited by: §4.3, §4.3, §4.3, §4.4, TABLE II, TABLE III, §5.3, §5.4, TABLE IV, §7.2, §7.2, TABLE VII.
  • [80] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: TABLE I, §1, §2.1, §4.2, §4.4, §5.2, §6.4, §6.5.
  • [81] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In ECCV, pp. 21–37. Cited by: