Computer vision has been attracting increasing amounts of attention in recent years due to its wide range of applications, such as transportation surveillance, smart city, and human-computer interaction. As two fundamental problems in computer vision, object detection and tracking are under extensive investigation. Among many factors and efforts that lead to the fast evolution of computer vision techniques, a notable contribution should be attributed to the invention or organization of numerous benchmarks and challenges, such as Caltech , KITTI 
, ImageNet, and MS COCO  for object detection, and OTB , VOT , MOTChallenge , UA-DETRAC , and LaSOT  for object tracking.
Drones (or UAVs) equipped with cameras have been fast deployed to a wide range of areas, including agriculture, aerial photography, fast delivery, and surveillance. Consequently, automatic understanding of visual data collected from these drones become highly demanding, which brings computer vision to drones more and more closely. Despite the great progresses in general computer vision algorithms, such as detection and tracking, these algorithms are not usually optimal for dealing with drone captured sequences or images. This is because of various challenges such as large viewpoint changes and scales. Therefore, it is essential to develop and evaluate new vision algorithms for drone captured visual data. However, as pointed out in [89, 55], studies toward this goal is seriously limited by the lack of publicly available large-scale benchmarks or datasets. Some recent efforts [89, 103, 55] have been devoted to construct datasets captured by drones focusing on object detection or tracking. These datasets are still limited in size and scenarios covered, due to the difficulties in data collection and annotation. Thorough evaluations of existing or newly developed algorithms remain an open problem. A more general and comprehensive benchmark is desired for further boosting video analysis research on drone platforms.
Thus motivated, we have organized two challenge workshops in conjunction with European Conference on Computer Vision (ECCV) 2018, and IEEE International Conference on Computer Vision (ICCV) 2019, attracting more than research teams around the world. The challenge focuses on object detection and tracking with four tracks.
Image object detection track (DET). Given a pre-defined set of object classes, e.g., cars and pedestrians, the algorithm is required to detect objects of these classes from individual images taken by drones.
Video object detection track (VID). Similar to DET, the algorithm is required to detect objects of predefined object classes from videos taken by drones.
Single object tracking track (SOT).
This track aims to estimate the state of a target, indicated in the first frame, across frames in an online manner.
Multi-object tracking track (MOT). The goal of the track is to track multiple objects, i.e., localize object instances in each video frame and recover their trajectories in video sequences. In the VisDrone-2018 challenge, this track is divided into two sub-tracks. The first track allows the algorithms to take the provided object detections in each video frame, while the second track is on the other way round. In the VisDrone-2019 challenge, we merge these two sub-tracks, and do not distinguish submitted algorithms according to whether they use the provided object detections in each video frame as input or not.
|Image object detection||scenario||#images||categories||avg. #labels/categories||resolution||occlusion labels||year|
|Caltech Pedestrian ||driving||2012|
|KITTI Detection ||driving||2012|
|PASCAL VOC2012 ||life||2012|
|ImageNet Object Detection ||life||2013|
|MS COCO ||life||2014|
|Video object detection||scenario||#frames||categories||avg. #labels/categories||resolution||occlusion labels||year|
|ImageNet Video Detection ||life||2015|
|UA-DETRAC Detection ||surveillance||2015|
|Single object tracking||scenarios||#sequences||#frames||year|
|POT 210 ||planar objects||2018|
|Multi-object tracking||scenario||#frames||categories||avg. #labels/categories||resolution||occlusion labels||year|
|KITTI Tracking ||driving||2013|
|MOTChallenge 2015 ||surveillance||2015|
|UA-DETRAC Tracking ||surveillance||2015|
Notably, in the workshop challenges, we provide a large-scale dataset, which consists of video clips with frames and static images. The data is recorded by various drone-mounted cameras, diverse in a wide range of aspects including location (taken from different cities in China), environment (urban and rural regions), objects (e.g., pedestrian, vehicles, and bicycles), and density (sparse and crowded scenes). We select categories of objects of frequent interests in drone applications, such as pedestrians and cars. Altogether we carefully annotate more than million bounding boxes of object instances from these categories. Moreover, some important attributes including visibility of the scenes, object category and occlusion, are provided for better data usage. The detailed comparison of the provided drone datasets with other related benchmark datasets in object detection and tracking are presented in Table I.
In this paper, we focus on the VisDrone Challenge in 2018 and 2019, as well as the methods, results, and evaluation protocols of the challenge111http://www.aiskyeye.com.. We hope the challenge largely boost the research and development in related fields.
2 Related Work
We briefly discuss some prior work in constructing benchmark object detection and tracking datasets, as well as the related challenges in recent conferences.
2.1 Generic Object Detection and Tracking Datasets
Image object detection datasets.
Several benchmarks have been collected for evaluating object detection algorithms. Enzweiler and Gavrila  present the Daimler dataset, captured by a vehicle driving through urban environment. The dataset includes manually annotated pedestrians in video images in the training set, and video images with annotated pedestrians in the testing set. The Caltech dataset  consists of approximately hours of Hz videos taken from a vehicle driving through regular traffic in an urban environment. It contains frames with a total of annotated bounding boxes of unique pedestrians. The KITTI Detection dataset  is designed to evaluate the car, pedestrian, and cyclist detection algorithms in autonomous driving scenarios, with training and testing images. Mundhenk et al. create a large dataset for classification, detection and counting of cars, which contains unique cars from six different image sets, each covering a different geographical location and produced by different imagers. The UA-DETRAC benchmark [124, 84, 83] provides objects in frames for vehicle detection.
The PASCAL VOC dataset [41, 40] is one of the pioneering work in generic object detection filed, which is designed to provide a standardized test bed for object detection, image classification, object segmentation, person layout, and action classification. ImageNet [30, 105] follows the footsteps of the PASCAL VOC dataset by scaling up more than an order of magnitude in number of object classes and images, i.e., PASCAL VOC 2012 has object classes and images vs. ILSVRC2012 with object classes and annotated images. Recently, Lin et al. release the MS COCO dataset, containing more than images with million manually segmented object instances. It has object categories with instances on average per category. Notably, it contains object segmentation annotations which are not available in ImageNet.
Video object detection datasets.
The ILSVRC 2015 challenge  opens the “object detection in video” track, which contains a total of snippets for training, snippets for validation, and snippets for testing. YouTube-Object dataset  is another large-scale dataset for video object detection, which consists of videos with over frames for classes of moving objects. However, only frames are annotated with a bounding-box around an object instance. Improved from the YouTube-Object dataset, Kalogeiton et al.  further provide the annotations of instance segmentation222http://calvin.inf.ed.ac.uk/datasets/youtube-objects-dataset/..
Single object tracking datasets.
In recent years, numerous datasets have been developed for single object tracking evaluation. Wu et al. develop a standard platform to evaluate the single object tracking algorithms, and scale up the data size from sequences to sequences in . Similarly, Liang et al. collect video sequences for evaluating the color enhanced trackers. To track the progress in visual tracking field, Kristan et al.[64, 62, 63] organize a VOT competition from to by presenting new datasets and evaluation strategies for tracking evaluation. Smeulders et al. present the ALOV300 dataset, which contains video sequences with visual attributes, such as long duration, zooming camera, moving camera and transparency. Li et al. construct a large-scale dataset with video sequences of pedestrians and rigid objects, covering kinds of objects captured from moving cameras. Du et al. design a dataset including annotated video sequences, focusing on deformable object tracking in unconstrained environments. To evaluate tracking algorithms in higher frame rate video sequences, Galoogahi et al. propose a dataset including videos ( frames) recorded by the higher frame rate cameras ( frame per second) from real world scenarios. Besides using video sequences captured by RGB cameras, Felsberg et al.[45, 38] organize a series of competitions from 2015 to 2017, focusing on visual tracking on thermal video sequences recorded by eight different types of sensors.
Multi-object tracking datasets.
The most widely used multi-object tracking evaluation datasets include PETS09 , KITTI-T , MOTChallenge [67, 87], and UA-DETRAC [124, 84, 83]. The PETS09 dataset  mainly focuses on multi-pedestrian detection, tracking and counting in the surveillance scenarios. The KITTI Tracking dataset  is designed for object tracking in autonomous driving, which is recorded from a moving vehicle with viewpoint of the driver. MOT15  and MOT16  aim to provide a unified dataset, platform, and evaluation protocol for multi-object tracking algorithms, including and sequences respectively. Recently, the UA-DETRAC benchmark [124, 84, 83] is constructed, which contains a total of sequences to track multiple vehicles, where sequences are filmed from a surveillance viewpoint.
Moreover, in some scenarios, a network of cameras are set up to capture multi-view information to conduct multi-view multi-object tracking. The dataset in  is recorded using multi-camera with fully overlapping views in constrained environments. Other datasets are captured by non-overlapping cameras. For example, Chen et al. collect four datasets, each of which includes to cameras with non-overlapping views in real scenes and simulation environments. Zhang et al. develop a dataset composed of to cameras covering both indoor and outdoor scenes at a university. Ristani et al. organize a challenge and present a large-scale fully-annotated and calibrated dataset, including more than million 1080p video frames taken by cameras with more than identities.
2.2 Drone-based Datasets
To date, there only exist a handful of drone-captured datasets in computer vision field. Hsieh et al. present a dataset for car counting, which consists of images captured in parking lot scenarios with the drone platform, including annotated cars. Robicquet et al. collect several video sequences with the drone platform in campuses, including various types of objects, (i.e., pedestrians, bikes, skateboarders, cars, buses, and golf carts), which enable the design of new object tracking and trajectory forecasting algorithms. Barekatain  present a new Okutama-Action dataset for concurrent human action detection with the aerial view. The dataset includes minute-long fully-annotated sequences with action classes. In , a high-resolution UAV123 dataset is presented for single object tracking, which contains aerial video sequences with fully annotated frames, including the bounding boxes of people and their corresponding action labels. Li et al. capture video sequences of high diversity by drone cameras and manually annotate the bounding boxes of objects for single object tracking evaluation. Moreover, Du et al. construct a new UAV benchmark focusing on complex scenarios for three tasks including object detection, single object tracking, and multiple object tracking. In , Rozantsev et al. present two separate datasets for detecting flying objects, i.e., the UAV dataset and the aircraft dataset. The former one comprises video sequences with the resolution and annotated bounding boxes of objects, acquired by a camera mounted on a drone flying indoors and outdoors. The latter one consists of publicly available videos of radio-controlled planes with annotated bounding boxes. Recently, Xia et al. propose a large-scale dataset in aerial images collected from different sensors and platforms to advance object detection research in earth vision. In contrast to the aforementioned datasets acquired in constrained scenarios for object tracking, detection and counting, our VisDrone dataset is captured in various unconstrained scenes, focusing on four core problems in computer vision fields, i.e., image object detection, video object detection, single object tracking, and multi-object tracking.
2.3 Existing Challenges
The international workshop on computer vision for UAVs333https://sites.google.com/site/uavision2018/. focuses on hardware, software and algorithmic (co-)optimizations towards state-of-the-art image processing on UAVs. The VOT challenge workshop444http://www.votchallenge.net/vot2019/. provides the tracking community with a precisely defined and repeatable way to compare short-term trackers as well as provides a common platform for discussing the evaluation and advancements made in the field of single-object tracking. The BMTT and BMTT-PETS workshops555https://motchallenge.net/. aims to pave the way for a unified framework towards more meaningful quantification of multi-object tracking. The PASCAL VOC challenge has been held for eight years from 2005 to 2012, which aims to recognize objects from a number of visual object classes in realistic scenes. The ILSVRC challenge also has been held for eight years from 2010 to 2017, which is designed to evaluate algorithms for object detection and image classification at large scale. Compared to the aforementioned challenges, our workshop challenge focuses on object detection and tracking on drones with the following four tracks: (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking. Our goal is to develop and distribute a new challenging benchmark for real world problems on drones with new difficulties, e.g., large scale and viewpoint variations, and heavy occlusions.
3 VisDrone Overview
A critical basis for effective algorithm evaluation is a comprehensive dataset. For this purpose, in VisDrone, we systematically collected the largest VisDrone benchmark dataset to advance the object detection and tracking research on drones. It consists of video clips with frames and additional static images. The videos/images are acquired by various drone platforms, i.e., DJI Mavic, Phantom series (3, 3A, 3SE, 3P, 4, 4A, 4P), including different scenarios across different cites in China, i.e., Tianjin, Hongkong, Daqing, Ganzhou, Guangzhou, Jincang, Liuzhou, Nanjing, Shaoxing, Shenyang, Nanyang, Zhangjiakou, Suzhou and Xuzhou. The dataset covers various weather and lighting conditions, representing diverse scenarios in our daily life. The maximal resolutions of video clips and static images are and , respectively.
The VisDrone benchmark focuses on the following four tasks (see Fig. 1), i.e., (1) image object detection, (2) video object detection, (3) single-object tracking, and (4) multi-object tracking. We construct a website: www.aiskyeye.com for accessing the VisDrone dataset and perform evaluation of those four tasks. Notably, for each task, the images/videos in the training, validation, and testing subsets are captured at different locations, but share similar scenarios and attributes. The training subset is used to train the algorithms, the validation subset is used to validate the performance of algorithms, the test-challenge subset is used for workshop competition, and the test-dev subset is used as the default test set for public evaluation. The manually annotated ground-truths for training and validation subsets are made available to participants, but the ground-truths of the testing subset are reserved in order to avoid (over)fitting of algorithms.
To participate our challenge, research teams are required to create their own accounts using the institutional email address. After registration, participants can choose the tasks of interest, and submit the results specifying locations or trajectories of objects in the images or videos using the corresponding accounts. We encourage the participants to use the provided training data, but also allow them to use additional training data. The use of additional training data must be indicated during submission. In the following subsections, we describe the data statistics and annotation of the datasets for each track in detail.
4 DET Track
The DET track tackles the problem of localizing multiple object categories in the image. For each image, algorithms are required to locate all the object instances of predefined set of object categories, e.g., car and pedestrian, in a given input images (if any). That is, we require the detection algorithm to predict the bounding box of each instance of each object class in the image, with a real-valued confidence. We mainly focus on ten object categories in evaluation, including pedestrian, person666 If a human maintains standing pose or walking, we classify it as
If a human maintains standing pose or walking, we classify it aspedestrian; otherwise, it is classified as a person., car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle. Some rarely occurring vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. The performance of algorithms is evaluated by the average precision (AP) across different object categories and intersection over union (IoU) thresholds.
4.1 Data Collection and Annotation
The DET dataset consists of images in unconstrained challenging scenes, including images in the training subset, in the validation subset, in the test-challenge subset, and in the test-dev subset. We plot the number of objects per image vs. percentage of images to show the distributions of the number of objects in each image in Fig. 2 and the number of objects in different object categories with different occlusion degrees in Fig. 3. Notably, the large variations of the number of objects in each image and the class imbalance issue significantly challenge the performance of detection algorithms. For example, as shown in Fig. 2, the minimal and maximal numbers of objects in the test-challenge subsets are and , and the number of the awning-tricycle instances is more than less than the car instances.
In this track, we focus on people and vehicles in our daily life, and define object categories of interest including pedestrian, person777If a human maintains standing pose or walking, we classify it as pedestrian; otherwise, it is classified as a person., car, van, bus, truck, motor, bicycle, awning-tricycle, and tricycle, in evaluation. Some rarely occurring vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. We manually annotate the bounding boxes of different categories of objects in each image. After that, cross-checking is conducted to ensure annotation quality. In addition, we also provide two kinds of useful annotations, the occlusion and truncation ratios. Specifically, we use the fraction of objects being occluded to define the occlusion ratio, and define three degrees of occlusions: no occlusion (occlusion ratio ), partial occlusion (occlusion ratio ), and heavy occlusion (occlusion ratio ). For the truncation ratio, it is used to indicate the degree of object parts appears outside a frame. If an object is not fully captured within a frame, we annotate the bounding box across the frame boundary and estimate the truncation ratio based on the region outside the image frame. It is worth mentioning that a target is skipped during evaluation if its truncation ratio is larger than .
4.2 Evaluation Protocol
For the DET track, we require each evaluated algorithm to output a list of detected bounding boxes with confidence scores in each test image. Following the evaluation protocol in MS COCO , we use the AP, AP, AP, AR, AR, AR and AR scores to evaluate the performance of detection algorithms. These criteria penalize missing detections as well as false alarm. Specifically, AP is computed by averaging over all intersection over union (IoU) thresholds (i.e., in the range with uniform step size ) of all categories, which is used as the primary metric for ranking algorithms. AP and AP are computed at the single IoU thresholds and over all categories, respectively. The AR, AR, and AR scores are the maximum recalls given , , and detections per image, averaged over all categories and IoU thresholds. Please refer to  for more details.
|Codename||AP||Institutions||Contributions and References|
|AHOD||12.77||Tsinghua University||Jianqiang Wang, Yali Li, Shengjin Wang |
|CERTH-ODI||5.04||ITI Technical College||Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris |
|CFE-SSDv2||26.48||Peking University||Qijie Zhao, Feng Ni, Yongtao Wang |
|DDFPN||21.05||Tianjin University||Liyu Lu |
|DE-FPN||27.10||South China University of Technology||Jingkai Zhou, Yi Luo, Hu Lin, Qiong Liu|
|DFS||16.73||SUN YAT-SEN University||Ke Bo |
|DPNet||30.92||University of Electronic Science and Technology of China||HongLiang Li, Qishang Cheng, Wei Li, Xiaoyu Chen, Heqian Qiu, Zichen Song |
|Faster R-CNN+||9.67||Shandong University||Tiaojio Lee, Yue Fan, Han Deng, Lin Ma, Wei Zhang |
|Faster R-CNN2||21.34||Xidian University||Fan Zhang |
|Faster R-CNN3||3.65||Northwestern Polytechnical University||Yiling Liu, Ying Li |
|FPN+||13.32||Texas A&M University, IBM||Karthik Suresh, Hongyu Xu, Nitin Bansal, Chase Brown, Yunchao Wei, Zhangyang Wang, Honghui Shi |
|FPN2||16.15||Chongqing University||Zhenwei He, Lei Zhang |
|FPN3||13.94||Nanjing University of Science and Technology||Chengzheng Li, Zhen Cui |
|HAL-Retina-Net||31.88||Tsinghua University||Yali Li, Zhaoyue Xia, Shengjin Wang |
|IITH DODO||14.04||IIT Hyderabad||Nehal Mamgain, Naveen Kumar Vedurupaka, K. J. Joseph, Vineeth N. Balasubramanian |
|JNU_Faster RCNN||8.72||Jiangnan University||Haipeng Zhang |
|Keras-RetinaNet||7.72||Xidian University||Qiuchen Sun, Sheng Jiang |
|L-H RCNN+||21.34||Xidian University||Li Yang, Qian Wang, Lin Cheng, Shubo Wei|
|MFaster-RCNN||18.08||Beijing University of Posts and Telecommunications||Wenrui He, Feng Zhu |
|MMF||7.54||Xiamen University||Yuqin Zhang, Weikun Wu, Zhiyao Guo, Minyu Huang [101, 100]|
|MMN||10.40||Ocean University of China||Xin Sun |
|MSCNN||2.89||National University of Defense Technology||Dongdong Li, Yangliu Kuai, Hao Liu, Zhipeng Deng, Juanping Zhao |
|MSYOLO||16.89||Xidian University||Haoran Wang, Zexin Wang, Ke Wang, Xiufang Li |
|RDMS||22.68||Fraunhofer IOSB||Oliver Acatay, Lars Sommer, Arne Schumann|
|RefineDet+||21.07||University of Chinese Academy of Sciences||Kaiwen Duan, Honggang Qi, Qingming Huang |
|RetinaNet2||5.21||Xidian University||Li Yang, Qian Wang, Lin Cheng, Shubo Wei|
|R-SSRN||9.49||Xidian University||Wenzhe Yang, Jianxiu Yang|
|SOD||8.27||Shanghai Jiao Tong University, University of Ottawa||Lu Ding, Yong Wang, Chen Qian, Robert Laganière, Xinbin Luo |
National Laboratory of Pattern Recognition
|Sujuan Wang, Yifan Zhang, Jian Cheng|
|YOLO-R-CNN||12.06||University of Kansas||Wenchi Ma, Yuanwei Wu, Usman Sajid, Guanghui Wang [101, 100]|
|YOLOv3+||15.26||Xidian University||Siwei Wang, Xintao Lian |
|YOLOv3++||10.25||University of Kansas||Yuanwei Wu, Wenchi Ma, Usman Sajid, Guanghui Wang |
|YOLOv3_DP||20.03||Xidian University||Qiuchen Sun, Sheng Jiang|
|ACM-OD||29.13||SK T-Brain||Sungeun Hong, Sungil Kang, Donghyeon Cho |
|Airia-GA-Cascade||25.99||Institute of Automation, Chinese Academy of Sciences||Yu Zhu, Qiang Chen |
|BetterFPN||28.55||ShanghaiTech University||Junhao Hu, Lei Jin |
|Cascade R-CNN+||17.67||Fraunhofer IOSB||Jonas Meier, Lars Sommer, Lucas Steinmann, Arne Schumann |
|Cascade R-CNN++||18.33||University of Hong Kong||Haocheng Han, Jiaqi Fan |
|CenterNet||26.03||National University of Singapore, Pensees.ai, Xidian University||Yanchao Li, Zhikang Wang, Yu Heng Toh, Furui Bai,Jane Shen |
|CenterNet-Hourglass||22.36||Harbin Institute of Technology, Institute of Automation, Chinese Academy of Sciences||Da Yu, Lianghua Huangm, Xin Zhao, Kaiqi Huang |
|CNAnet||26.35||Chongqing University||Keyang Wang, Lei Zhang |
|CN-DhVaSa||27.83||Siemens Technology and Services Private Limited||Dheeraj Reddy Pailla, Varghese Alex Kollerathu, Sai Saketh Chennamsetty |
|ConstraintNet||16.09||Xidian University||Dening Zeng, Di Li |
|DA-RetinaNet||17.05||Nanjing University of Posts and Telecommunications||Jingjing Xu, Dechun Cong |
|DBCL||16.78||Snowcloud.ai||Wei Dai, Weiyang Wang |
|DCRCNN||17.79||BTS Digital||Almaz Zinollayev, Anuar Askergaliyev |
|DPNet-ensemble||29.62||University of Electronic Science and Technology of China||Qishang Cheng, Heqian Qiu, Zichen Song, Hongliang Li |
|DPN||25.09||Institute of Automation, Chinese Academy of Sciences||Nuo Xu, Xin Zhang, Binjie Mao, Chunlei Huo, Chunhong Pan |
|EHR-RetinaNet||26.46||Hanyang University||Jaekyum Kim, Byeongwon Lee, Chunfei Ma, Jun Won Choi, Seungji Yang|
|EnDet||17.81||Beijing Institute of Technology||Pengyi Zhang, Yunxin Zhong [100, 101]|
|ERCNNs||20.45||Kakao Brain||Jihoon Lee, Ildoo Kim [11, 101]|
|FS-Retinanet||26.31||Beijing Institute of Technology, Samsung Stanford||Ziming Liu, Jing Ge, Tong Wu, Lin Sun, Guangyu Gao |
|GravityNet||25.66||University of Glasgow||Toh Yu Heng, Harry Nguyen |
|HRDet+||28.39||South China University of Technology||Jingkai Zhou, Weida Qin, Qiong Liu, Haitao Xiong |
|HTC-drone||22.61||Queen Mary University of London||Xindi Zhang |
|Libra-HBR||25.57||Zhejiang University||Chunfang Deng, Shuting He, Qinghong Zeng, Zhizhao Duan, Bolun Zhang[106, 94, 11]|
|MOD-RETINANET||16.96||Harman||Aashish Kumar, George Jose, Srinivas S S Kruthiventi |
|MSCRDet||25.13||Dalian University of Technology||Xin Chen, Chang Liu, Shuhao Chen, Xinyu Zhang, Dong Wang, Huchuan Lu|
|ODAC||17.42||Sun Yat-Sen University||Junyi Zhang, Junying Huang, Xuankun Chen, Dongyu Zhang |
|retinaplus||20.57||Northwestern Polytechnical University||Zikai Zhang, Peng Wang |
|RRNet||29.13||Ocean University of China||Changrui Chen, Yu Zhang, Qingxuan Lv, Xiaorui Wang, Shuo Wei, Xin Sun |
|SAMFR-Cascade RCNN||20.18||Xidian University||Haoran Wang, Zexin Wang, Meixia Jia, Aijin Li, Tuo Feng |
|S+D||28.59||Harbin Institute of Technology||Yifu Chen [136, 11, 110]|
|SGE-Cascade R-CNN||27.33||Xi’an Jiaotong University||Xudong Wei, Hao Qi, Wanqi Li, Guizhong Liu |
|TridentNet||22.51||Huazhong University of Science and Technology||Xuzhang Zhang|
|TSEN||23.83||Nanjing University of Science and Technology||Zhifan Zhu, Zechao Li [101, 120, 94]|
VisDrone 2018 challenge.
There are different object detection algorithms from different institutes submitted to this track. We present the results and team information in Table II. As shown in Table II, several methods are constructed based on the Faster R-CNN algorithm , such as CERTH-ODI, DPNet, Faster R-CNN+, Faster R-CNN2, Faster R-CNN3, IITH DODO, JNU_Faster R-CNN, MFaster R-CNN, and MMN. Some algorithms construct feature pyramids to build high-level semantic feature maps at all scales, including DE-FPN, DFS, FPN+, FPN2, FPN3, and DDFPN. detectors, i.e., MSYOLO, SODLSY, YOLOv3+, YOLOv3++ and YOLOv3_DP, are improved from the one-stage YOLOv3 method . MMF and YOLO-R-CNN fuse multi-models of the Faster R-CNN and YOLOv3 methods. Keras-RetinaNet, RetinaNet2 and HAL-Retina-Net are based on RetinaNet . RDMS, RefineDet+ and R-SSRN are based on the RefineDet method . The top accuracy is achieved by the HAL-Retina-Net method, i.e., AP, which uses the SE module  and downsampling-upsampling operations  to learn both the channel and spatial attentions.
VisDrone 2019 challenge.
We have received detection methods from different institutes in this track, shown in Table II. methods are improved from Cascade R-CNN , i.e., Airia-GA-Cascade, Cascade R-CNN+, Cascade R-CNN++, DCRCNN, DPN, DPNet-ensemble, MSCRDet, SAMFR-Cascade RCNN and SGE-cascade R-CNN. detection methods, i.e., CenterNet, CenterNet-Hourglass, CN-DhVaSa, ConstraintNet, GravityNet and RRNet, are based on the anchor-free method CenterNet . detection methods, i.e., DA-RetinaNet, EHR-RetinaNet, FS-Retinanet, MOD-RETINANET and retinaplus, are improved from the anchor-based method RetinaNet . ACM-OD, BetterFPN and ODAC construct multi-scale feature pyramids using FPN . CNAnet designs the convolution neighbor aggregation mechanism for detection. HRDet+ is improved from HRDet , which connects the convolutions from high to low resolutions in parallel to generate discriminative high-resolution representations. TridentNet  aims to generate scale-specific feature using a parallel multi-branch architecture.
Some other methods use ensemble mechanism to improve the performance. DPNet-ensemble achieves the top accuracy with AP, which ensembles two object detectors based on Cascade R-CNN  using ResNet-50 and ResNet-101 as feature extractors with global context module  and deformable convolution . EnDet combines the results of YOLOv3  and Faster R-CNN . TSEN ensembles three two-stage methods including Faster R-CNN , Guided Anchoring  and Libra R-CNN . ERCNNs combines the results of Cascade R-CNN  and Faster R-CNN  with different kinds of backbones. Libra-HBR ensembles the improved SNIPER , Libra R-CNN  and Cascade R-CNN .
To further improve the accuracy, some methods jointly predict the masks and bounding boxes of objects. For example, DBCL  uses the bounding box annotations to train a segmentation model to produce accurate results. HTC-drone improves the hybrid task cascade algorithm  using the instance segmentation cascade. The S+D method is formed by the segmentation algorithm DeepLab  and the detection module in HRDet .
This benchmark is designed for public evaluation. state-of-the-art object detection methods are evaluated, i.e., FPN , RetinaNet , Light-RCNN , RefineDet , DetNet , Cascade R-CNN , and CornerNet , shown in Table III.
|Cascade R-CNN ||21.80||37.84||22.56||0.28||3.55||29.15||30.09|
4.4 Results and Analysis
Results on the test-challenge set.
Top object detectors in the VisDrone-DET 2018  and 2019  challenges are presented in Table III. In contrast to existing object detection datasets, e.g., MS COCO , Caltech , and UA-DETRAC ), one of the most challenging issues in the VisDrone-DET dataset is the extremely small scale of objects.
As shown in Table III, we find that HAL-Retina-Net and DPNet are the only two methods achieving more than AP in the VisDrone-DET 2018 challenge. Specifically, HAL-Retina-Net uses the Squeeze-and-Excitation  and downsampling-upsampling  modules to learn both the channel and spatial attentions on multi-scale features. To detect small scale objects, it removes higher convolutional layers in the feature pyramid. The second best detector DPNet uses the Feature Pyramid Networks (FPN)  to extract multi-scale features and uses ensemble mechanism to combine three detectors with different backbones, i.e., ResNet-50, ResNet-101 and ResNeXt. Similarly, DE-FPN and CFE-SSDv2 also employ multi-scale features, which rank in the third and fourth places with and AP scores, respectively. RDMS trains variants of RefineDet , i.e., three use SEResNeXt-50 and one uses ResNet-50 as the backbone network. Moreover, DDFPN, ranked in the
-th place, introduces deep back-projection super-resolution network to upsample the image using the deformable FPN architecture . Notably, most of the submitted methods use multi-scale testing strategy in evaluation, which is effective to improve performance.
In the VisDrone-DET 2019 challenge, DPNet-ensemble achieves the best results with AP score, It uses the global context module  to integrate context information and deformable convolution  to enhance the transformation modeling capability of the detector. RRNet and ACM-OD tie for the second place in ranking with AP score. RRNet is improved from  by integrating a re-regression module, formed by the ROIAlign module 
and several convolution layers. ACM-OD introduces an active learning strategy, which is conducted with data augmentation for better performance.
In summary, as shown in Table III, although the top detector DPNet-ensemble in the VisDrone-DET 2019 challenge is slightly inferior than the top detector HAL-Retina-Net in the VisDrone-DET 2018 challenge in terms of AP score, we can observe that the average AP score of the top methods in the VisDrone-DET 2019 challenge is greatly improved compared to that in the VisDrone-DET 2018 challenge. However, the top accuracy on this dataset is only , achieved by HAL-Retina-Net in the VisDrone-DET 2018 challenge, It indicates the difficulty of the collected dataset and the badly need of developing more robust methods for real-world applications.
Results on the test-dev set.
, which uses the Hourglass-104 backbone for feature extraction. In contrast to FPN and RetinaNet  with extra stages against the image classification task to handle objects with various scales, DetNet  re-designs the backbone network for object detection, which maintains the spatial resolution and enlarges the receptive field, achieving AP score. Meanwhile, RefineDet  with the VGG-16 backbone performs better than RetinaNet  with the ResNet-101 backbone, i.e., vs. in terms of AP score. This is because RefineDet  uses the object detection module to regress the locations and sizes of objects based on the coarsely adjusted anchors from the anchor refinement module.
Captured by the cameras equipped on drones, the VisDrone-DET dataset is extremely challenging due to scale variation, occlusion, and class imbalance. Compared to traditional object detection datasets, there are more issues worth exploring in drone captured visual data.
Annotation and evaluation protocol.
As shown in Fig. 4, there are often groups of objects heavily occluded in drone captured visual data (see the orange bounding boxes of bicycles). If we use Non-maximum Suppression (NMS) to suppress duplicate detections in detectors, the majority of true positive objects will be inevitably removed. In some real applications, it is unnecessary and impractical to locate each individual object in the crowd. Thus, it is more reasonable to use a large bounding box with a count number to represent the group of objects in the same category (see the white bounding box of bicycle). Meanwhile, if we use the new annotation remedy, we need to redesign the metric to evaluate detection algorithms, i.e., both the localization and counting accuracy should be considered in evaluation.
Current object detection methods use bounding boxes to indicate object instances, i.e., a -tuple , where and are the coordinate of the bounding box’s top-left corner, and and are the width and height of the bounding box. As shown in Fig. 4, it is difficult to predict the location and size of the pedestrian (see the yellow bounding box) due to occlusion and non-rigid deformation of human body. A possible way to mitigate such issue is to integrate coarse segmentation into object detection, which might be effective to remove the disturbance of background area enclosed in the bounding box of non-rigid objects, such as person and bicycle, see Fig. 4. In summary, this interesting problem is still far from being solved and worth to explore.
5 VID Track
The VID track aims to locate object instances from a pre-defined set of categories in the video sequences. That is, given a series of video clips, the algorithms are required to produce a set of bounding boxes of each object instance in each video frame (if any), with real-valued confidences. In contrast to DET track focusing on object detection in individual images, we deal with detection object instances in video clips, which contain temporal consistency in consecutive frames. Five categories of objects are considered in this track, i.e., pedestrian, car, van, bus, and truck. Similar to the DET track, some rarely occurring special vehicles (e.g., machineshop truck, forklift truck, and tanker) are ignored in evaluation. The AP score across different object categories and IoU thresholds of algorithm predictions in individual frames are used to evaluate the quality of the results.
5.1 Data Collection and Annotation
We provide challenging video clips in the VID track, including clips for training ( frames in total), for validation ( frames in total), for testing ( frames in total) and for testing ( frames in total). To clearly describe the data distribution, we plot the number of objects per frame vs. percentage of frames in Fig. 5, and the number of objects of different object categories in Fig. 6. As shown in Fig. 5, the class imbalance issue is extremely severe in the VID and MOT datasets, challenging the performance of algorithms. For example, in the training set, the number of car trajectories is more than of the number of car trajectories. Meanwhile, as shown in Fig. 5, the length of object trajectories varies dramatically, e.g., the maximal and minimal lengths of object trajectories are and , requiring the tracking algorithms to perform well in both short-term and long-term cases.
We manually annotate five categories of objects in each video clip, i.e., pedestrian, car, van, bus, and truck, and conduct the cross-checking to ensure the annotation quality. Similar to the DET track, we also provide the annotations of occlusion and truncation ratios of each object and ignored regions in each video frame. We present the annotation exemplars in the second row of Fig. 1.
5.2 Evaluation Protocol
For the VID track, each evaluated algorithm is required to generate a list of bounding box detections with confidence scores in each video frame. Motivated by the evaluation protocol in MS COCO  and ILSVRC , we use the AP, AP, AP, AR, AR, AR and AR scores to evaluate the results of video object detection algorithms, which is similar to the DET track. Notably, the AP score is used as the primary metric for ranking methods. Please see [80, 95] for more details.
|Codename||AP||Institutions||Contributions and References|
|CERTH-ODV||9.10||Centre for Research & Technology Hellas||Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris |
|CFE-SSDv2||21.57||Peking University||Qijie Zhao, Feng Ni, Yongtao Wang |
|EODST||16.54||Xidian University||Zhaoliang Pi, Yinan Wu, Mengkun Liu [81, 25]|
|FGFA+||16.00||Xidian University||Jie Gao, Yidong Bai, Gege Zhang, Dan Wang, Qinghua Ma |
|RD||14.95||Fraunhofer IOSB, Karlsruhe Institute of Technology||Oliver Acatay, Lars Sommer, Arne Schumann |
|RetinaNet_s||8.63||Beijing University of Posts and Telecommunications||Jianfei Zhao, Yanyun Zhao |
|AFSRNet||24.77||Beijing Institute of Technology, Samsung Inc||Ziming Liu, Jing Ge, Tong Wu, Lin Sun, Guangyu Gao [79, 145]|
|CN-DhVaSa||21.58||Siemens Technology and Services Private Limited||Dheeraj Reddy Pailla, Varghese Alex Kollerathu, Sai Saketh Chennamsetty |
|CornerNet-lite-FS||12.65||Ocean University of China||Xin Sun, Hongwei Xv, Meng Zhang, Zihe Dong, Lijun Du |
|DBAI-Det||29.22||DeepBlue Technology||Zhipeng Luo, Feng Ni, Bing Dong, Yuehan Yao, Zhenyu Xu |
|DetKITSY||20.43||Karlsruhe Institute of Technology, Sun Yat-sen University, VIPioneers (Huituo) Inc||Wei Tian, Jinrong Hu, Yuduo Song, Zhaotang Chen, Long Chen, Martin Lauer |
|DM2Det||13.52||KARI, KAIST||SungTae Moon, Dongoo Lee, Yongwoo Kim, SungHyun Moon |
|EODST++||18.73||Xidian University||Zhaoliang Pi, Yingping Li, Xier Chen, Yanchao Lian, Yinan Wu [81, 115, 24, 69]|
|FT||9.15||Northwestern Polytechnical University||Yunfeng Zhang, Yiwen Wang, Ying Li |
|FRFPN||16.50||Nanjing University of Science and Technology||Zhifan Zhu, Zechao Li [101, 151]|
|HRDet+||23.03||South China University of Technology||Jingkai Zhu, Weida Qin, Qiong Liu, Haitao Xiong |
|Libra-HBR||18.29||Zhejiang University||Chunfang Deng, Qinghong Zeng, Zhizhao Duan, Bolun Zhang [106, 94, 11]|
|Sniper+||18.16||Xi’an Jiaotong University||Xingjie Zhao, Ting Sun, Guizhong Liu |
|VCL-CRCNN||21.61||Tsinghua University||Zhibin Xiao |
VisDrone 2018 challenge.
We have received entries in the VID track of the VisDrone-2018 challenge, shown in Table IV. Four methods are directly derived from image object detectors, i.e., CERTH-ODV, CFE-SSDv2, RetinaNet_s, and RD. The EODST method is constructed based on SSD , and uses the ECO tracker  to exploit the temporal coherence. FGFA+ is modified from the video object detection framework  by enhancing contrast and brightness of frames. CFE-SSDv2 achieves the top accuracy (i.e., AP), which uses a comprehensive feature enhancement module to enhance the features for small objects.
VisDrone 2019 challenge.
As presented in Table IV, video detection methods are submitted in this track. Similar to VisDrone-VID2018 challenge, the majority of submissions are directly derived from object detectors on static images. For instance, there are methods based on Cascade R-CNN , i.e., DBAI-Det, DetKITSY and VCL-CRCNN. Libra-HBR combines improved SNIPER , Libra R-CNN  and cascade R-CNN . CN-DhVaSa and CornerNet-lite-FS are based on the anchor-free methods CenterNet  and CornerNet 
, respectively. AFSRNet integrates feature selected anchor-free head (FSAF) into RetinaNet  to improve the accuracy. FRCFPN is derived from Faster R-CNN  with data augmentation . EODST++ improves the method EODST in VisDrone-VID 2018 challenge, using SSD  and FCOS  for detection in individual frames, and ECO  and SiamRPN++  to track objects to recall false negatives in detection. FT improves Faster R-CNN  based on three-dimensional convolution to exploit temporal information for better performance.
We evaluate state-of-the-art video object detection methods, i.e., FGFA  and D&T , and state-of-the-art image object detection methods, i.e., Faster R-CNN , FPN , CornerNet , and CenterNet , in this track. Specifically, FPN  and Faster R-CNN  are anchor-based methods, and CornerNet  and CenterNet  are anchor-free methods. The FGFA  and D&T  methods attempt to exploit temporal coherence of objects in consecutive frames to improve the performance.
5.4 Results and Analysis
Results on the test-challenge set.
We report the evaluation results of the submissions in the VisDrone-VID 2018  and 2019  challenges in Table V. CFE-SSDv2 obtains the best AP score in the VisDrone-VID 2018 challenge, which is improved from SSD  by integrating a comprehensive feature enhancement module for accurate results, especially for small objects. Different from CFE-SSDv2, EODST exploits temporal information to associate object detections in individual frames using the ECO tracking algorithm , achieving the second best AP . FGFA+ ranks in the third place with AP, which is a variant of video object detection method FGFA  with various data augmentation strategies.
In the VisDrone-VID 2019 challenge, researchers propose more powerful algorithms, which benefit from several state-of-the-art detectors, such as HRDet , Cascade R-CNN , CenterNet , RetinaNet , FPN . All top detectors, i.e., DBAI-Det, AFSRNet, HRDet+, VCL-CRCNN and CN-DhVaSa, surpass the top detector CFE-SSDv2 in the VisDrone-VID 2018 challenge. We witness the significant improvement of the performance of video object detection methods. However, there still remains much room for improvement. DBAI-Det achieves the best results with AP, which is constructed based on Cascade R-CNN  with ResNeXt101 , and integrates the deformable convolution operation  and global context module  to improve the performance. AFSRNet ranks the second place with AP, which integrates the feature selected anchor-free head  into RetinaNet . HRDet+, VCL-CRCNN and CN-DhVaSa rank in the third, forth, and fifth places, which are improved from HRDet , Cascade R-CNN , and CenterNet , respectively. To deal with large scale variations of objects, other top detectors, such as DetKITSY and EODST++, employ multi-scale features and proposals for detection, which performs better than the state-of-the-art video object detector FGFA . Notably, most of the video object detection methods are computationally expensive for practical applications, whose running speed are less than fps on a workstation with GTX 1080Ti GPU.
Results on the test-dev set.
The evaluation results of state-of-the-art video object detection methods [149, 44], and state-of-the-art image object detection methods [78, 144, 65, 101] on the test-dev set are presented in Table V. We find that the two video object detectors performs much better than the four image object detectors. For example, the second best video object detector D&T  improves in AP score compared to the top image object detector FPN , which demonstrates the importance of exploiting the temporal information in video object detection. FGFA  leverages temporal coherence to enhance the features of objects for accurate results. D&T 
simultaneously solves detection and tracking with an end-to-end trained convolutional neural network, and a multi-task loss for frame-based object detection and across-frame track regression. However, how to exploit temporal information is still an open question for video object detection.
Different from the DET task, the accuracy of detection methods suffers from degenerated object appearances in videos such as motion blur, pose variations, and video defocus. Exploiting temporal coherence and aggregating features in consecutive frames might to be two effective ways to handle such issue.
A feasible way to exploit temporal coherence is using object trackers, e.g., ECO  and SiamRPN++ , into detection algorithms. Specifically, we can assign a tracker to each detected object instance in individual frames to guide detection in consecutive frames, which is effective to suppress false negatives in detection. Meanwhile, integrating re-identification module is another promising way to exploit temporal coherence for better performance, just as described in D&T .
Aggregating features in consecutive frames is also a useful way to improve the performance. As stated in FGFA , aggregating nearby features along the motion paths to leverage temporal coherence significantly improves the detection accuracy. Thus, we can take several consecutive frames as input, and feed them into deep neural networks to extract temporal salient features using D convolution operations or optical flow algorithm.
6 SOT Track
For the SOT track, we focus on generic single object tracking, also known as model-free tracking [123, 130, 143]. In particular, for an input video sequence and the initial bounding box of the target object in the first frame, the SOT track requires the algorithms to locate the target bounding boxes in the subsequent video frames. The tracking targets in these sequences include pedestrians, cars, buses, and animals.
6.1 Data Collection and Annotation
In 2018, we provide video sequences with fully annotated frames, split into four subsets, i.e., the training set ( sequences with frames in total), validation set ( sequences with frames in total), testing-challenge 2018 set ( sequences with frames in total), and testing-dev set ( sequences with frames in total). Notably, the testing-challenge 2018 subset is designed to evaluate the algorithms submitted in the VisDrone-SOT 2018 challenge competition. To thoroughly evaluate the performance of algorithms in long-term tracking, we add new collected sequences with frames in total in the test-challenge 2018 set to form the test-challenge 2019 set, which is used in the VisDrone-SOT 2019 challenge competition. The tracking targets in all these sequences include pedestrian, cars, and animals. The statistics of target objects, i.e., the aspect ratio in different frames, the area change ratio, and the sequence length are presented in Fig. 7.
The enclosing bounding box of target object in each video frame is annotated to evaluate the performance of trackers. To thoroughly analyze the tracking performance, we also annotate sequence attributes , i.e., aspect ratio change, background clutter, camera motion, fast motion, full occlusion, illumination variation, low resolution, out-of-view, partial occlusion, scale variation, similar object, and viewpoint change, described as follows.
Aspect ratio change: the fraction of ground truth aspect ratio in the first frame and at least one subsequent frame is outside the range .
Background clutter: the background near the target has similar appearance as the target.
Camera motion: abrupt motion of the camera.
Fast motion: motion of the ground truth bounding box is larger than pixels between two consecutive frames.
Full occlusion: the target is fully occluded.
Illumination variation: the illumination of the target changes significantly.
Low resolution: at least one ground truth bounding box has less than pixels.
Out-of-view: some portion of the target leaves the view.
Partial occlusion: the target is partially occluded.
Scale variation: the ratio of initial and at least one subsequent bounding box is outside the range .
Similar object: there are objects of similar shape or same type near the target.
Viewpoint change: viewpoint affects target appearance significantly.
6.2 Evaluation Protocol
Following the evaluation methodology in , we use the success and precision scores to evaluate the performance of trackers. The success score is defined as the area under the success plot. That is, with each bounding box overlap threshold in the interval , we compute the percentage of successfully tracked frames to generate the successfully tracked frames vs. bounding box overlap threshold plot. The overlap between the the tracker prediction and the ground truth bounding box is computed as , where and represent the intersection and union between the two regions, and computes the number of pixels within the region. Meanwhile, the precision score is computed as the percentage of frames whose estimated location is within the given threshold distance of the ground truth based on the Euclidean distance in the image plane. Here, we set the distance threshold to pixels in evaluation. Notably, the success score is used as the primary metric for ranking methods.
|Codename||Success Score||Institutions||Contributors and References|
|AST||56.2||Beihang University, Lancaster University, Shenyang Aerospace University||Chunlei Liu, Wenrui Ding, Jinyu Yang, Baochang Zhang, Jungong Han, Hanlin Chen |
|BTT||60.5||Shandong University||Ke Song, Xixi Hu, Wenhao Wang, Yaxuan Li, and Wei Zhang |
|C3DT||53.6||South China University of Technology||Haojie Li, Sihang Wu |
|CFCNN||55.2||Karlsruhe Institute of Technology||Wei Tian, Martin Lauer |
|CFWCRKF||50.6||Beijing University of Posts and Telecommunications||Shengyin Zhu, Yanyun Zhao |
|CKCF||32.3||Centre for Research & Technology Hellas||Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris |
|DCFNet||47.4||Civil Aviation University of China, Institute of Automation, Chinese Academy of Sciences||Jing Li, Qiang Wang, Weiming Hu |
Nanjing Artificial Intelligence Chip Research, IACAS, Institute of Automation, Chinese Academy of Sciences, Nanjing University of Information Science and Technology
|Jiaqing Fan, Yifan Zhang, Jian Cheng, Kaihua Zhang, Qingshan Liu |
|DeCoM||56.9||Seoul National University, NAVER Corp||Byeongho Heo, Sangdoo Yun, Jin Young Choi |
|IMT3||17.6||University of South Australia||Asanka G. Perera|
|LZZ-ECO||68.0||Xidian University||Xiaotong Li, Jie Zhang, Xin Zhang |
|OST||50.3||University of Ottawa||Yong Wang, Lu Ding, Robert Laganière, Xinbin Luo |
|TRACA+||45.7||Seoul National University, Samsung R&D Campus||Kyuewang Lee, Jongwon Choi, Jin Young Choi |
|SDRCO||56.3||Beijing University of Posts and Telecommunications, Tencent, Sun yat-sen university, Tsinghua University||Zhiqun He, Ruixin Zhang, Peizhen Zhang, Xiaohao He |
|SECFNet||51.1||National University of Defense Technology, Shanghai Jiao Tong University||Dongdong Li, Yangliu Kuai, Hao Liu, Zhipeng Deng, Juanping Zhao |
|STAPLE_SRCA||61.9||Xidian University||Wenhua Zhang, Yang Meng |
|VITALD||62.8||Harbin Institute of Technology, University of Chinese Academy of Sciences, Institute of Computing Technology, Chinese Academy of Sciences||Yuankai Qi, Yifan Yang, Weidong Chen, Kaiwen Duan, Qianqian Xu, Qingming Huang [93, 108]|
|ACNT||53.2||Jiangnan University, University of Surrey||Tianyang Xu, Xiaojun Wu, Zhenhua Feng, Josef Kittler |
|AST||51.9||Nanjing University of information science and technology||Kang Yang, Xianhai Wang, Ning Wang, Jiaqing Fan, Kaihua Zhang |
|ATOMFR||61.7||Xidian university||Wenhua Zhang, Haoran Wang, Jinliu Zhou |
|ATOMv2||46.8||Institute of Automation, Chinese Academy of Sciences||Lianghua Huang, Xin Zhao, Kaiqi Huang |
|DATOM_AC||54.1||Northwestern Polytechnical University||Xizhe Xue, Xiaoyue Yin, Shanrong Zou, Ying Li |
|DC-Siam||46.3||Northwestern Polytechnical University||Jinghao Zhou, Peng Wang [26, 69, 70]|
|DR-V-LT||57.9||Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences, University of Chinese Academy of Sciences||Shiyu Xuan, Shengyang Li |
|ED-ATOM||63.5||Institute of Information Engineering, Chinese Academy of Sciences, University of Chinese Academy of Sciences, CloudWalk Technology||Chunhui Zhang, Shengwei Zhao, Kangkai Zhang, Shikun Li, Hao Wen, Shiming Ge |
|flow_MDNet_RPN||52.6||Xi’an Jiaotong University||Han Wu, Xueyuan Yang, Yong Yang, Guizhong Liu |
|HCF||36.1||Yuneec Aviation Technology, University of Ottawa, Institute of Information Engineering, Chinese Academy of Sciences||Zhuojin Sun, Yong Wang, Chunhui Zhang |
|MATOM||40.9||Institute of Optics and Electronics, Chinese Academy of Sciences||Lijun Zhou, Qintao Hu |
|PTF||54.4||Xidian university||Ruohan Zhang, Jie Chen, Jie Gao, Xiaoxue Li, Lingling Shi |
|SE-RPN||41.9||Wuhan University||Xu Lei, Jinwang Wang |
|SiamDW-FC||38.3||Institute of Automation, Chinese Academy of Sciences||Zhipeng Zhang, Weiming Hu |
|SiamFCOT||47.2||Zhejiang University||Yinda Xu, Zeyu Wang |
|Siam-OM||59.3||Xidian university||Xin Zhang, Xiaotong Li, Jie Zhang [26, 150]|
|SOT-SiamRPN++||56.8||Zhejiang University||Zhizhao Duan, Wenjun Zhu, Xi Yu, Bo Han, Zhiyong Yu, Ting He |
|SMILE||59.4||Xidian university||Ruiyan Ma, Yanjie Gao, Yuting Yang, Wei Song, Yuxuan Li [26, 69]|
|SSRR||44.7||Nanjing University of information science and technology||Ning Wang, Kaihua Zhang |
|Stable-DL||38.2||University of Ottawa, Shanghai Jiao Tong University, Beihang University, INSKY Lab, Leotail Intelligent Tech||Yong Wang, Lu Ding, Robert Laganière, Jiuqing Wan, Wei Shi|
|TDE||37.2||Institute of Information Engineering, Chinese Academy of Sciences, University of Chinese Academy of Science, Yuneec Aviation Technology, University of Ottawa||Chunhui Zhang, Shengwei Zhao, Zhuojin Sun, Yong Wang, Shiming Ge |
|TIOM||55.3||Beijing University of Posts and Telecommunications||Shengyin Zhu, Yanyun Zhao |
VisDrone 2018 challenge.
We present the results and team information in this track in Table VI, including entries from different institutes. CFWCRKF, CKCF, DCST and STAPLE_SRCA are based on the correlation filters, while C3DT, VITALD, DeCom and BTT are improved from the deep MDNet method . Seven other trackers combine the CNN models and correlation filter algorithms, i.e., CFCNN, DCFNet, LZZ-ECO, OST, TRACA+, SDRCO and SECFNet. Notably, OST, CFCNN and LZZ-ECO use object detectors to perform target re-detection for more robustness. AST predicts the target using saliency map and IMT3 is based on the normalized cross correlation filter. The LZZ-ECO method produces the best results with success score, which uses YOLOv3  to re-detect the drifted target and ECO  to track the target object.
VisDrone 2019 challenge.
As shown in Table VI, there are trackers from different institutes submitted in this track. Among them, trackers are constructed based on ATOM , i.e., ACNT, AST, ATOMFR, ATOMv2, DATOM_AC, ED-ATOM, MATOM, SSRR and TIOM. Notably, ED-ATOM achieves the best performance with success score and precision score. PTF follows the ECO algorithm , and Siam-OM and SMILE use the Siamese networks based on ATOM . other trackers are also using the Siamese network architecture, including DC-Siam, DR-V-LT, SiamDW-FC, SiamFCOT and SOT-SiamRPN++.
state-of-the-art single-object tracking methods are evaluated for comparison in this track. We roughly divide them into three categories, i.e., the correlation filters based, the siamese network based, and the convolutional network based approaches, listed as follows.
Correlation filters based approach: KCF , CSRDCF , LCT , DSST , ECO , SRDCF , SCT , fDSST , Staple , Staple_CA , BACF , PTAV , STRCF , and HCFT 888Since HCFT  adaptively learns correlation filters on each convolutional layer to encode the target appearance, we category it into both the correlation filter and convolutional network based approaches..
6.4 Results and Analysis
Results on the test-challenge set.
The overall success and precision scores of top submissions in the VisDrone-SOT 2018  and 2019  challenges are shown in Fig. 8(a) and (b), respectively. Notably, several challenging factors in the collected dataset, such as background clutter, large scale variation, and occlusion, make the trackers easily to drift. To that end, some trackers integrate the state-of-the-art detectors to re-detect the target when drifting occurs. For example, in the VisDrone-SOT 2018 challenge, LZZ-ECO combines YOLOv3  and ECO  to achieve the best success score and precision score . VITALD trains RefineDet  as a reference for the VITAL tracker , which obtains the second best success score and the third best precision score . Another solution to deal with drifting problem is STAPLE_SRCA , which develops a sparse context-aware response scheme to recognize whether the target moves out of the scene or be covered by other objects, It obtains the third best success score and the second best precision score . DCST learns the spatio-temporal regularized correlation filters using color clustering based histogram model without the re-detection module, resulting in inferior results with success score and precision score.
We notice that the correlation filter based methods do not perform well in the VisDrone-SOT 2018 challenge. Thus, in the VisDrone-SOT 2019 challenge, the researchers shift their focus from correlation filter based methods to deep neural network based methods, such as ATOM  and Siamese networks [7, 50, 69]. Specifically, ATOMFR combines SENet  and ATOM  to capture the interdependencies within feature channels and suppress feature channels that are of little use to the current target size and location estimation, achieving the top accuracy on the test-challenge 2018 set with success score and precision score .
Another critical engine for the performance improvements is the creation and utilization of large-scale datasets (e.g., MS COCO , Got-10k , ImageNet DET/VID , LaSOT , TrackingNet , VOT  and YoutubeBB ) for deep neural network training. For example, ED-ATOM achieves the best results (i.e., success score and precision score) in the VisDrone-SOT 2019 challenge. This is because ED-ATOM is constructed based on ATOM  with the low-light image enhancement algorithm  and the online data augmentation scheme [137, 9]. Meanwhile, the model is trained on ImageNet DET/VID , MS COCO , Got-10k , and LaSOT .
Moreover, tracker combination is an effective strategy to improve the performance. Siam-OM uses ATOM  to handle short-term tracking, while DaSiam  with ResNet to handle long-term tracking, ranked in the forth place in the VisDrone-SOT 2019 challenge. SIMLE combines two state-of-the-art trackers ATOM  and SiamRPN++  to improve the performance, ranked in the fifth place. DR-V-LT integrates the distractor-aware verification network into SiamRPN++ , which is robust to similar objects challenge, ranked in the eighth place.
In addition, comparing the results of the submitted trackers on the test-challenge 2018 and test-challenge 2019 sets, we find that the tracking accuracy is significantly degraded. The best tracker ED-ATOM achieves success score and precision score on the test-challenge 2018 set vs. success score and precision score on the test-challenge 2019 set. It demonstrates the difficulties of the new collected long-term tracking sequences, and suggests the need to develop more effective trackers for challenging scenarios on drones.
Results on the test-dev set.
We evaluate state-of-the-art trackers on the test-dev set in Fig. 8(c). As shown in Fig. 8(c), ATOM  (marked as the orange cross in the top-right corner) obtains the best success score and the third best precision score. This is attributed to the network trained offline on large-scale datasets to directly predict the IoU overlap between the target and a bounding box estimate. However, it performs not well in terms of low resolution and out of view. MDNet  and SiamRPN++  rank the second and third places in terms of success score, respectively. In summary, training on large-scale datasets brings significant performance improvement of trackers.
The state-of-the-art SOT algorithms on the VisDrone-SOT dataset are inspired by the algorithms in the object detection and re-identification fields. They benefit a lot from offline training on large-scale datasets, such as MS COCO , Got-10k , ImageNet DET/VID , LaSOT , TrackingNet , VOT  and YoutubeBB . However, fast motion, low resolution, and occlusion still challenge the performance of the SOT algorithms.
Several SOT algorithms [69, 15, 122] formulate object tracking as the one-shot detection task, which use the bounding box in the first frame as the only exemplar. These methods rely on the pre-set anchor boxes to regress the bounding box of target in consecutive frames. However, the pre-defined anchor boxes can not adapt to various motion patterns and scales of targets, especially when the fast motion and occlusion occur. To this end, we can attempt to integrate the motion information or re-detection module to improve the accuracy of tracking algorithms.
is another challenging factor greatly affects tracking accuracy. Most of the state-of-the-art methods [26, 50, 69] merely focus on the appearance variations of target region, producing unstable and inaccurate results. We believe that exploiting context information surrounding the target and super-resolution technique can be helpful to improve the tracking performance.
happens frequently in tracking process, which is the obstacle to the accurate tracking results. Some previous algorithms [1, 12, 32, 33] attempt to use part-based representations to handle the appearance changes caused by occlusion. Meanwhile, using re-initialization module  is an effective strategy to get rid of occlusion, i.e., the re-initialization module is able to re-detect the target after reappearing in the scenes. In addition, predicting the motion patterns of the target based on its trajectory in history is also a promising way worth to explore.
7 MOT Track
The MOT track aims to recover the trajectories of objects in video sequences, which is an important problem in computer vision with many applications, such as surveillance, activity analysis, and sport video analysis. In the VisDrone-2018 challenge, we divide this track into two sub-tracks depending on whether using prior detection results in individual frames. Specifically, for one sub-track, a submitted algorithm is required to recover the trajectories of objects in video sequences without taking the object detection results as input. The evaluation protocol presented in  (i.e., the average precision (AP) of trajectories per object class) is used to evaluate the performance of trackers. In contrast, for the second sub-track, prior object detection results in individual frames are provided and the participating algorithm can work on top of the input detections. In the VisDrone-2019 challenge, we merge these two tracks, and do not distinguish submitted algorithms according to whether they use object detection in each video frame as input or not. The average precision (AP) of the recovered trajectories in  is used to evaluate the performance of submitted trackers. Notably, this track uses the same data as the VID track. Specifically, five categories of objects (i.e., pedestrian, car, van, bus, and truck) in video clips are considered in evaluation.
|Codename||AP||MOTA||Institutions||Contributors and References|
|Ctrack||16.12||30.80||Centre for Research & Technology Hellas||Emmanouil Michail, Konstantinos Avgerinakis, Panagiotis Giannakeris, Stefanos Vrochidis, Ioannis Kompatsiaris |
|deep-sort_d2||10.47||-||Beijing University of Posts and Telecommunications||Jianfei Zhao, Yanyun Zhao [79, 127]|
|FRMOT||-||33.10||Universidad Autónoma de Madrid||Elena Luna, Diego Ortego, Juan C. San Miguel, José M. Martínez |
|GOG_EOC||-||36.90||Harbin Institute of Technology, University of Chinese Academy of Sciences||Hongyang Yu, Guorong Li, Qingming Huang |
|MAD||7.27||-||Xidian University||Wei Song, Yuxuan Li, Zhaoliang Pi, Wenhua Zhang [100, 116]|
|SCTrack||-||35.80||University of Missouri-Columbia, U.S. Naval Research Laboratory||Noor M. Al-Shakarji, Filiz Bunyak, Guna Seetharaman, Kannappan Palaniappan |
|TrackCG||-||42.60||Karlsruhe Institute of Technology||Wei Tian, Zhiming Ma, Martin Lauer |
|V-IOU||-||40.20||Technische Universität Berlin||Erik Bochinski, Tobias Senst, Thomas Sikora |
|DBAI-Tracker||43.94||-||DeepBlue Technology (shanghai)||Zhipeng Luo, Yuehan Yao, Zhenyu Xu, Feng Ni, Bing Dong [11, 10, 96, 54]|
|Flow-Tracker||30.87||-||Xi’an Jiaotong University||Weiqiang Li, Jiatong Mu, Guizhong Liu [11, 109, 10]|
|GGDTRACK||23.09||-||Axis Communications, Centre for Methematical Sciences||Håkan Ardö, Mikael Nilsson [101, 4]|
|HMTT||28.67||-||Beijing University of Posts and Telecommunications||Siyang Pan, Zhihang Tong, Yanyun Zhao [10, 150, 144, 142]|
|IITD_DeepSort||13.88||-||Indian Institute of Information Technology, Indian Institute of Technology||Ajit Jadhav, Prerana Mukherjee, Vinay Kaushik, Brejesh Lall [79, 128]|
|OS-MOT||0.16||-||University of Ottawa, Shanghai Jiao Tong University, YUNEEC Aviation Technology, Institute of Information Engineering, Chinese Academy of Sciences, INSKY Lab, Leotail Intelligent Tech||Yong Wang, Lu Ding, Robert Laganière, Zhuojin Sun, Chunhui Zhang, Wei Shi |
|SCTrack||10.09||-||University of Technology, University of Missouri-Columbia, U.S. Naval Research Laboratory||Noor M. Al-Shakarji, Filiz Bunyak, Guna Seetharaman, Kannappan Palaniappan [101, 2]|
|SGAN||2.54||-||Harbin Institute of Technology, University of Chinese Academy of Sciences||Hongyang Yu, Guorong Li, Qingming Huang |
|T&D-OF||12.37||-||Dalian University of Technology||Xinyu Zhang, Xin Chen, Shuhao Chen, Chang Liu, Dong Wang, Huchuan Lu [22, 58, 17]|
|TNT_DRONE||27.32||-||University of Washington, Beijing University of Posts and Telecommunications||Haotian Zhang, Yanting Zhang, Gaoang Wang, Jenq-Neng Hwang |
|TrackKITSY||39.19||-||Karlsruhe Institute of Technology, Sun Yat-sen University||Wei Tian, Jinrong Hu, Yuduo Song, Zhaotang Chen, Long Chen, Martin Lauer [11, 112]|
|VCLDAN||7.5||-||Tsinghua University||Zhibin Xiao |
7.1 Evaluation Protocol
In the VisDrone-2018 challenge, the MOT track is divided into two sub-tracks depending on whether prior detection results are used in individual frames. For the multi-object tracking without detection input, we use the tracking evaluation protocol in  to evaluate the performance of the algorithms. That is, each algorithm is required to output a list of bounding box with confidence scores and the corresponding identities. We sort the tracklets (formed by the bounding box detections with the same identity) according to the average confidence of their bounding box detections. A tracklet is considered correct if the intersection over union (IoU) overlap with ground truth tracklet is larger than a threshold. Similar to , we use three thresholds in evaluation, i.e., , , and . The performance of an algorithm is evaluated by averaging the mean average precision (mAP) per object class over different thresholds.
For multi-object tracking with detection input, we follow evaluation protocol in  to evaluate the performance of the algorithms. That is, the average rank over metrics (i.e., MOTA, MOTP, IDF1, FAF, MT, ML, FP, FN, IDS, and FM) is used to rank the algorithms. The MOTA metric combines three error sources: FP, FN and IDS. The MOTP metric is the average dissimilarity between all true positives and the corresponding ground truth targets. The IDF1 metric indicates the ratio of correctly identified detections over the average number of ground truth and computed detections. The FAF metric indicates the average number of false alarms per frame. The FP metric describes the total number of tracker outputs which are the false alarms, and FN is the total number of targets missed by any tracked trajectories in each frame. The IDS metric describes the total number of times that the matched identity of a tracked trajectory changes, while FM is the times that trajectories are disconnected. Both the IDS and FM metrics reflect the accuracy of tracked trajectories. The ML and MT metrics measure the percentage of tracked trajectories less than and more than of the time span based on the ground truth respectively.
In the VisDrone-2019 challenge, we do not distinguish submitted algorithms according to whether they use object detection in each video frame as input or not. Similar to the evaluation protocol used in the multi-object tracking without detection input in the VisDrone-2018 challenge, we use the protocol in  to evaluate the performance of algorithms.
VisDrone 2018 challenge.
There are multi-object tracking algorithms submitted in this track, shown in Table VII. Ctrack aggregates the predicted events in grouped targets and uses the temporal constraints to stitch short tracklets . V-IOU  uses the spatial overlap to associate input detections in consecutive frames. GOG_EOC develops a context harmony model to create exchanging object context patches via the Siamese network, and tracks the objects using the algorithm in . SCTrack  uses a color correlation cost matrix to maintain object identities. TrackCG  achieves the best performance with MOTA score among all trackers using the public input detections. It first estimates the target state using the motion pattern of grouped objects to build short tracklets, and uses the graph model to generate long trajectories. Two other methods use private input detections, i.e., MAD, using YOLOv3  for detection and CFNet ) for association, and deep-sort_v2, using RetinaNet  for detection and Deep-SORT ) for association.
VisDrone 2019 challenge.
In this track, we have received entries for different institutes, shown in Table VII. Most of the submissions are based on the tracking-by-detection framework, i.e., the trackers exploit temporal coherence to associate detections in individual frames to recover the trajectories of targets. At first, several submissions use the state-of-the-art detectors, such as R-FCN , RetinaNet , Cascade R-CNN , and CenterNet  to generate object detections in individual frames. After that, some submitted methods use the single object tracking methods, such as KCF  and DaSiameseRPN , to recover false negatives of detectors. Some other methods, such as GGDTRACK, Flow-Tracker, OS-MOT, T&D-OF, TrackKITSY, and SGAN, attempt to exploit low-level or middle-level temporal information to improve the tracking performance. The HMTT, IITD_DeepSort, SCTrack, T&D-OF, TNT_DRONE, and VCLDCN methods use the metric learning algorithms to compute the similarities between detections in consecutive frames, which is effective in occlusion and miss detection challenges.
We evaluate multi-object tracking methods in this track for comparison, including GOG , IOUT , SORT  and MOTDT . Notably, the FPN  object detection method is used to generate the input detections in each individual frame.
7.3 Results and Analysis
Results on the test-challenge set.
We report the evaluation results of the trackers in the VisDrone-VDT 2018  and VisDrone-MOT 2019  challenges with the evaluation protocols  and  in Table VIII and IX, respectively. As shown in Table VIII, in the subtrack without using prior input detections in the VisDrone-VDT 2018 challenge, Ctrack achieves the best AP score by aggregating the prediction events in grouped targets and stitching the tracks by temporal constraints. In this way, the targets in crowded scenarios are able to be recovered after being occluded. In the VisDrone-MOT 2019 Challenge, the submitted algorithms achieve significant improvements, e.g., DBAI-Tracker improves the top AP score by , i.e., vs. . Notably, the top three trackers, i.e., DBAI-Tracker, TrackKITSY and Flow-Tracker, use Cascade R-CNN  to generate detections in individual frames, and integrate the temporal information, e.g., FlowNet  and IoU tracker  to complete association. Similarly, HMTT combines CenterNet , IoU tracker  and DaSiameseRPN  for multiple object tracking, ranked in the forth place in the challenge.
For the sub-track using provided input detections, i.e., generated by Faster R-CNN  in the VisDrone-MOT 2018 challenge, TrackCG achieves the best MOTA and IDF1 scores. V-IOU achieves slightly inferior MOTA and IDF1 scores than TrackCG, but produces the best IDS score, i.e., . It associates detections based on spatial overlap, i.e., intersection-over-union, in consecutive frames. We speculate that the overlapping based measurement is reliable enough in drone captured videos from high altitude, which do not contain large displacements of objects. GOG_EOC obtains the best FAF, FP and FM scores, i.e., , , and , which uses both the detection overlap and context harmony degree to measure the similarities between detections in consecutive frames. SCTrack designs a color correlation cost matrix to maintain object identities. However, the color information is not reliable enough, resulting in inferior results, i.e., ranked in the forth place in terms of MOTA (). FRMOT is an online tracker using the Hungarian algorithm for associating detections, leading to relative large IDS () and FM () scores.
Results on the test-dev set.
We evaluate multi-object tracking on the test-dev set with the evaluation protocols  and , shown in Table VIII and IX, respectively. Notably, FPN  is used to generate object detections in individual frames for the sub-track using prior input detections.
GOG  and IOUT  benefit from global information of whole sequences and spatial overlap between frame detections, achieving the best tracking results in terms of both evaluation protocols  and . SORT  approximates the inter-frame displacements of each object with a linear constant velocity model, which is independent of object categories and camera motion, significantly degrading its performance. MOTDT  computes the similarities between objects using appearance model trained on other large-scale person re-identification datasets without fine-tuning, leading to inferior accuracy.
Most of the MOT algorithms formulate the tracking task as a data association problem, which aims to associate object detections in individual frames to generate object trajectories. Thus, the accuracy of object detection in individual frames significantly influence the performance of MOT. Intuitively, integrating object detection and tracking into a unified framework is promising to improve the performance. In the following, we discuss two potential research directions to further boost the performance.
For the data association problem, similarity computation between different detections in individual frames is crucial for the tracking performance. The appearance and motion information should be considered in computing the similarities. For example, a Siamese network offline trained on the ImageNet VID dataset  can be used to exploit temporal discriminative features of objects. The Siamese network can be finetuned in tracking process to further improve the accuracy. Meanwhile, several low-level and mid-level motion features are also effective and useful for the MOT algorithms, such as KLT and optical flow.
is another effective way to improve the MOT performance. For example, based on the scene understanding module, we can infer the enter or exit ports in the scenes. The information of the enter and exit ports is a strong priori for the trackers to distinguish occlusion, termination, or re-appearing of the target. Meanwhile, the tracker is also able to suppress false trajectories based on general knowledge and scene understanding, e.g., the vehicles are only driven on the road rather on the building. In summary, this area is worth further studying to improve the MOT performance.
We introduce a new large-scale benchmark, VisDrone, to facilitate the research of object detection and tracking on drone captured imagery. With over worker hours, a vast collection of object instances are gathered, annotated, and organized to drive the advancement of object detection and tracking algorithms. We place emphasis on capturing images and video clips in real life environments. Notably, the dataset is recorded over different cites in China with various drone platforms, featuring a diverse real-world scenarios. We provide a rich set of annotations including more than million annotated object instances along with several important attributes. The VisDrone benchmark is made available to the research community through the project website: www.aiskyeye.com. The best submissions in the four tracks are still far from satisfactory in real applications.
We would like to thank Jiayu Zheng and Tao Peng for valuable and constructive suggestions to improve the quality of this paper.
-  (2006) Robust fragments-based tracking using the integral histogram. In CVPR, pp. 798–805. Cited by: §6.5.
-  (2017) Robust multi-object tracking with semantic color correlation. In AVSS, pp. 1–7. Cited by: §7.2, TABLE VII.
-  (2016) Social LSTM: human trajectory prediction in crowded spaces. In CVPR, pp. 961–971. Cited by: TABLE VII.
-  (2019) Multi target tracking by learning from generalized graph differences. CoRR abs/1908.06646. Cited by: TABLE VII.
-  (2017) Okutama-action: an aerial view video dataset for concurrent human action detection. In CVPRWorkshops, pp. 2153–2160. Cited by: TABLE I, §2.2.
-  (2016) Staple: complementary learners for real-time tracking. In CVPR, pp. 1401–1409. Cited by: §6.3, TABLE VI.
-  (2016) Fully-convolutional siamese networks for object tracking. In ECCV, pp. 850–865. Cited by: §6.3, §6.4.
-  (1992) Auction algorithms for network flow problems: A tutorial introduction. Comp. Opt. and Appl. 1 (1), pp. 7–66. Cited by: TABLE VII.
-  (2018) Unveiling the power of deep tracking. In ECCV, pp. 493–509. Cited by: §6.4.
-  (2017) High-speed tracking-by-detection without using image information. In AVSS, pp. 1–6. Cited by: §7.2, §7.2, §7.3, §7.3, TABLE VII, TABLE VIII, TABLE IX.
-  (2018) Cascade R-CNN: delving into high quality object detection. In CVPR, pp. 6154–6162. Cited by: §4.3, §4.3, §4.3, TABLE II, TABLE III, §5.3, §5.4, TABLE IV, §7.2, §7.3, TABLE VII.
-  (2014) Robust deformable and occluded object tracking with dynamic graph. TIP 23 (12), pp. 5497–5509. Cited by: §6.5.
-  (2019) GCNet: non-local networks meet squeeze-excitation networks and beyond. CoRR abs/1904.11492. Cited by: §4.3, §4.4, §5.4.
-  (2016) Visual object tracking performance measures revisited. TIP 25 (3), pp. 1261–1274. Cited by: §1.
-  (2019) Fast visual object tracking with rotated bounding boxes. CoRR abs/1907.03892. Cited by: §6.5.
-  (2019) Hybrid task cascade for instance segmentation. In CVPR, Cited by: §4.3, TABLE II.
-  (2018) Real-time multiple people tracking with deeply learned candidate selection and person re-identification. In ICME, pp. 1–6. Cited by: §7.2, §7.3, TABLE VII, TABLE VIII, TABLE IX.
-  (2017) An equalized global graph model-based approach for multicamera object tracking. TCSVT 27 (11), pp. 2367–2381. Cited by: §2.1.
Context-aware deep feature compression for high-speed visual tracking. In CVPR, pp. 479–488. Cited by: §6.3.
-  (2018) Context-aware deep feature compression for high-speed visual tracking. In CVPR, Cited by: TABLE VI.
-  (2016) Visual tracking using attention-modulated disintegration and integration. In CVPR, pp. 4321–4330. Cited by: §6.3.
-  (2016) R-FCN: object detection via region-based fully convolutional networks. In NeurIPS, pp. 379–387. Cited by: TABLE II, §7.2, TABLE VII.
-  (2017) Deformable convolutional networks. In ICCV, pp. 764–773. Cited by: §4.3, §4.4, §4.4, §5.4.
-  (2017) ECO: efficient convolution operators for tracking. In CVPR, pp. 6931–6939. Cited by: §5.3, §5.5, TABLE IV, §6.3, §6.3.
-  (2017) ECO: efficient convolution operators for tracking. In CVPR, pp. 6931–6939. Cited by: §5.3, §5.4, TABLE IV, §6.3, §6.4, TABLE VI.
-  (2018) ATOM: accurate tracking by overlap maximization. CoRR abs/1811.07628. Cited by: §6.3, §6.3, §6.4, §6.4, §6.4, §6.4, §6.5, TABLE VI.
-  (2014) Accurate scale estimation for robust visual tracking. In BMVC, Cited by: §6.3.
-  (2015) Learning spatially regularized correlation filters for visual tracking. In ICCV, pp. 4310–4318. Cited by: §6.3.
-  (2017) Discriminative scale space tracking. TPAMI 39 (8), pp. 1561–1575. Cited by: §6.3.
-  (2009) ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255. Cited by: §2.1.
-  (2012) Pedestrian detection: an evaluation of the state of the art. TPAMI 34 (4), pp. 743–761. Cited by: TABLE I, §1, §2.1, §4.4.
-  (2016) Online deformable object tracking based on structure-aware hyper-graph. TIP 25 (8), pp. 3572–3584. Cited by: §2.1, §6.5.
-  (2017) Geometric hypergraph learning for visual tracking. TCYB 47 (12), pp. 4182–4195. Cited by: §6.5.
-  (2018) The unmanned aerial vehicle benchmark: object detection and tracking. In ECCV, pp. 375–391. Cited by: TABLE I, §2.2.
-  (2019) VisDrone-det2019: the vision meets drone object detection in image challenge results. In ICCV Workshops, Cited by: §4.4.
-  (2019) VisDrone-sot2019: the vision meets drone single object tracking challenge results. In ICCV Workshops, Cited by: §6.4.
-  (2009) Monocular pedestrian detection: survey and experiments. TPAMI 31 (12), pp. 2179–2195. Cited by: §2.1.
-  (2016) The thermal infrared visual object tracking VOT-TIR2016 challenge results. In ECCVWorkshops, pp. 824–849. Cited by: §2.1.
-  (2012) The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. Note: http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html Cited by: TABLE I.
-  (2015) The pascal visual object classes challenge: A retrospective. IJCV 111 (1), pp. 98–136. Cited by: §2.1.
-  (2010) The pascal visual object classes (VOC) challenge. IJCV 88 (2), pp. 303–338. Cited by: §2.1.
-  (2018) LaSOT: A high-quality benchmark for large-scale single object tracking. CoRR abs/1809.07845. Cited by: §1, §6.4, §6.5.
-  (2017) Parallel tracking and verifying: A framework for real-time and high accuracy visual tracking. In ICCV, pp. 5487–5495. Cited by: §6.3.
-  (2017) Detect to track and track to detect. In ICCV, pp. 3057–3065. Cited by: §5.3, §5.4, §5.5, TABLE V.
-  (2015) The thermal infrared visual object tracking VOT-TIR2015 challenge results. In ICCVWorkshops, pp. 639–651. Cited by: §2.1.
-  (2009) PETS2009: dataset and challenge. In AVSS, pp. 1–6. Cited by: §2.1, §2.1.
-  (2017) Need for speed: A benchmark for higher frame rate object tracking. In ICCV, pp. 1134–1143. Cited by: TABLE I, §2.1.
-  (2017) Learning background-aware correlation filters for visual tracking. In ICCV, pp. 1144–1152. Cited by: §6.3, TABLE VI.
-  (2012) Are we ready for autonomous driving? the KITTI vision benchmark suite. In CVPR, pp. 3354–3361. Cited by: TABLE I, §1, §2.1, §2.1.
-  (2017) Learning dynamic siamese network for visual object tracking. In ICCV, pp. 1781–1789. Cited by: §6.3, §6.4, §6.5.
-  (2018) Deep back-projection networks for super-resolution. In CVPR, pp. 1664–1673. Cited by: §4.4.
-  (2017) Mask R-CNN. In ICCV, pp. 2980–2988. Cited by: §4.4.
-  (2017) Correlation filters with weighted convolution responses. In ICCVWorkshops, pp. 1992–2000. Cited by: TABLE VI.
-  (2015) High-speed tracking with kernelized correlation filters. TPAMI 37 (3), pp. 583–596. Cited by: §6.3, TABLE VI, §7.2, TABLE VII.
-  (2017) Drone-based object counting by spatially regularized regional proposal network. In ICCV, Cited by: TABLE I, §1, §2.2.
-  (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §4.3, §4.4, §6.4.
-  (2018) GOT-10k: A large high-diversity benchmark for generic object tracking in the wild. CoRR abs/1810.11981. Cited by: §6.4, §6.5.
-  (2017) FlowNet 2.0: evolution of optical flow estimation with deep networks. In CVPR, pp. 1647–1655. Cited by: TABLE VII.
-  (2012) Tracking-learning-detection. TPAMI 34 (7), pp. 1409–1422. Cited by: §6.5.
-  (2016) Analysing domain shift factors between videos and images for object detection. TPAMI 38 (11), pp. 2327–2334. Cited by: §2.1.
DroneSURF: benchmark dataset for drone-based face recognition. In FG, pp. 1–7. Cited by: TABLE I.
-  (2016) The visual object tracking VOT2016 challenge results. In ECCVWorkshops, pp. 777–823. Cited by: TABLE I, §2.1.
-  (2018) The sixth visual object tracking VOT2018 challenge results. In ECCVWorkshops, pp. 3–53. Cited by: §2.1, §6.4, §6.5.
-  (2015) The visual object tracking VOT2015 challenge results. In ICCVWorkshops, pp. 564–586. Cited by: §2.1.
-  (2018) CornerNet: detecting objects as paired keypoints. In ECCV, pp. 765–781. Cited by: §4.3, §4.4, TABLE III, §5.3, §5.4, TABLE V.
-  (2019) CornerNet-lite: efficient keypoint based object detection. CoRR abs/1904.08900. Cited by: §5.3, TABLE IV.
-  (2015) MOTChallenge 2015: towards a benchmark for multi-target tracking. CoRR abs/1504.01942. Cited by: TABLE I, §1, §2.1.
-  (2015) NUS-PRO: a new visual tracking challenge. In TPAMI, pp. 1–15. Cited by: §2.1.
-  (2018) SiamRPN++: evolution of siamese visual tracking with very deep networks. In CVPR, Cited by: §5.3, §5.5, TABLE IV, §6.3, §6.4, §6.4, §6.4, §6.5, §6.5, TABLE VI.
-  (2018) High performance visual tracking with siamese region proposal network. In CVPR, pp. 8971–8980. Cited by: TABLE VI.
-  (2018) Learning spatial-temporal regularized correlation filters for visual tracking. In CVPR, pp. 4904–4913. Cited by: §6.3, TABLE VI.
-  (2017) Visual object tracking for unmanned aerial vehicles: A benchmark and new motion models. In AAAI, pp. 4140–4146. Cited by: §2.2.
-  (2019) Scale-aware trident networks for object detection. CoRR abs/1901.01892. Cited by: §4.3, TABLE II.
-  (2017) Light-head R-CNN: in defense of two-stage object detector. CoRR abs/1711.07264. Cited by: §4.3, TABLE II, TABLE III.
-  (2018) DetNet: design backbone for object detection. In ECCV, pp. 339–354. Cited by: §4.3, §4.4, TABLE III.
-  (2015) Encoding color information for visual tracking: algorithms and benchmark. TIP 24 (12), pp. 5630–5644. Cited by: TABLE I, §2.1.
-  (2018) Planar object tracking in the wild: A benchmark. In ICRA, pp. 651–658. Cited by: TABLE I.
-  (2017) Feature pyramid networks for object detection. In CVPR, pp. 936–944. Cited by: §4.3, §4.3, §4.4, §4.4, TABLE II, TABLE III, §5.3, §5.4, §5.4, TABLE V, §7.2, §7.3.
-  (2017) Focal loss for dense object detection. In ICCV, pp. 2999–3007. Cited by: §4.3, §4.3, §4.3, §4.4, TABLE II, TABLE III, §5.3, §5.4, TABLE IV, §7.2, §7.2, TABLE VII.
-  (2014) Microsoft COCO: common objects in context. In ECCV, pp. 740–755. Cited by: TABLE I, §1, §2.1, §4.2, §4.4, §5.2, §6.4, §6.5.
-  (2016) SSD: single shot multibox detector. In ECCV, pp. 21–37. Cited by: