VisDrone-CC2020: The Vision Meets Drone Crowd Counting Challenge Results

Crowd counting on the drone platform is an interesting topic in computer vision, which brings new challenges such as small object inference, background clutter and wide viewpoint. However, there are few algorithms focusing on crowd counting on the drone-captured data due to the lack of comprehensive datasets. To this end, we collect a large-scale dataset and organize the Vision Meets Drone Crowd Counting Challenge (VisDrone-CC2020) in conjunction with the 16th European Conference on Computer Vision (ECCV 2020) to promote the developments in the related fields. The collected dataset is formed by 3,360 images, including 2,460 images for training, and 900 images for testing. Specifically, we manually annotate persons with points in each video frame. There are 14 algorithms from 15 institutes submitted to the VisDrone-CC2020 Challenge. We provide a detailed analysis of the evaluation results and conclude the challenge. More information can be found at the website: <>.


Vision Meets Drones: Past, Present and Future

Drones, or general UAVs, equipped with cameras have been fast deployed w...

Mapping Areas using Computer Vision Algorithms and Drones

The goal of this paper is to implement a system, titled as Drone Map Cre...

A Flow Base Bi-path Network for Cross-scene Video Crowd Understanding in Aerial View

Drones shooting can be applied in dynamic traffic monitoring, object det...

Automated Corrosion Detection Using Crowd Sourced Training for Deep Learning

The automated detection of corrosion from images (i.e., photographs) or ...

Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark

To promote the developments of object detection, tracking and counting a...

A Unified Multi-Task Learning Framework of Real-Time Drone Supervision for Crowd Counting

In this paper, a novel Unified Multi-Task Learning Framework of Real-Tim...

Scarce Data Driven Deep Learning of Drones via Generalized Data Distribution Space

Increased drone proliferation in civilian and professional settings has ...

1 Introduction

Crowd counting aims to estimate the number of objects,

e.g., pedestrians [45], vehicles [15], commodity [3], animals [1], and cells [21], in images precisely. It has wide applications in video surveillance, crowd analysis, and traffic monitoring, to name a few.

With the developments of deep learning techniques in recent years, many researchers formulate the crowd counting problem as the regression of density maps using deep neural networks. For example, Zhang

et al.[45] design a multi-column network architecture with three branches to deal with different scales of objects. After that, various algorithms [22, 23, 42, 19] achieve significant advances on several challenging datasets captured on surveillance scenes, such as UCF_CC_50 [17], WorldExpo [43], ShanghaiTech [45], and UCF-QNRF [18].

In contrast to images captured on surveillance scenarios, drone-captured video sequences involve difference challenges, including wide viewpoint variations, small objects and clutter background, which puts forward higher requirements of crowd counting algorithms. However, there are few datasets focus on such scenarios in the community. To advance the developments in crowd counting, we organize the Vision Meets Drone Crowd Counting (VisDrone-CC2020) challenge, which is one track of the “Vision Meets Drone: A Challenge” held on August 28, 2020, in conjunction with the 16th European Conference on Computer Vision (ECCV 2020). In particular, we provide a dataset, which is recorded by various drone-mounted cameras in different scenarios across different cities in China (i.e., Tianjin, Guangzhou, Daqing, and Hong Kong). The objects of interest are pedestrian. We invite researchers to submit the results of counting algorithms and share their latest research in the workshop. There are algorithms from institutes considered in the VisDrone-CC2020 Challenge. The detailed evaluation results can be found on the challenge website:

2 Related Work

In this section, we review the related crowd counting datasets and algorithms briefly. More details can be found in the survey [13].

2.1 Crowd Counting Datasets

Recently, numerous datasets have been proposed to deal with the challenges in crowd counting, such as scale variations, background clutter, and illumination variation in the wild. The most frequently used crowd counting datasets include UCF_CC_50 [17], WorldExpo [43], ShanghaiTech [45], and UCF-QNRF [18]. UCF_CC_50 [17] contains only images from different scenes with various densities and different perspective distortions. WorldExpo [43] is collected from videos of Shanghai 2010 WorldExpo, and include frames in total. ShanghaiTech [45] may be the most popular dataset (Part A and Part B) and composed of images with annotations. UCF-QNRF [18] is a large-scale high-resolution dataset that contains images and about million people heads.

However, the above-mentioned datasets are of relatively small size, which limits the power of deep learning. To avoid overfitting, recent proposed datasets collect more large-scale data in both the number of images and the number of persons, e.g., GCC [35] and Crowd Surveillance [41]. The GCC dataset [35] is a diverse synthetic dataset collected from scenes in Grand Theft Auto V (GTA5). It includes images and persons, with resolution of . The Crowd Surveillance [41] dataset contains high-resolution images and annotated people. Instead of surveillance scenes, in this work, we focus on crowd counting on the drone-captured scenes. Specifically, our proposed VisDrone-CC2020 dataset is formed by images, including images for training, and images for testing, which contains more than annotated people heads. In Table 1, we summarize the statistical comparison between our dataset and the previous datasets.

2.2 Crowd Counting Methods

The majority of early crowd counting methods [38, 20, 34] rely on sliding-window detectors to scan still images or video frames to detect the pedestrians based on the hand-crafted features. However, these methods are easily affected by heavy occlusion, scale and viewpoint variations on crowded scenarios. Benefited from the great success of deep learning, many modern methods [45, 22] formulate crowd counting problem as regression of density maps by the networks. Zhang et al. [45] develop a multi-column architecture with three branches to deal with different scales of objects. To handle small objects, Li et al. [22] employ dilated convolution layers to expand the receptive field while maintaining the resolution as backend network. In [23], a multi-scale deformable network is proposed to generate high-quality density maps, which is more effective to capture the crowd features and more resistant to various noises. To capture interdependence of pixels in density maps, Zhang et al. [42] propose a Relational Attention Network with a self-attention mechanism. Jiang et al. [19] develop the trellis encoder-decoder network including a multi-scale encoder and a multi-path decoder to generate high-quality density estimation maps.

To achieve better performance, some recent methods exploit unlabeled or synthetic data for training. Based on the generated synthetic data [35]

, the supervised learning and domain adaptation strategies are proposed to improve the counting accuracy significantly. In

[25], the ranked image sets are generated from unlabeled data for counting applications suffering from a shortage of labeled data. Sam et al. [29]

propose the Grid Winner-Take-All autoencoder to learn features from unlabeled images such that weight update of neurons in convolutional output maps is restricted to the maximally activated neuron in a fixed spatial cell.

3 The VisDrone-CC2020 Challenge

The VisDrone-CC2020 Challenge aims to count pedestrian heads from video frames taken from drones. Participants are required to submit their algorithm and evaluate on the released VisDrone-CC2020 dataset. They are allowed to use external training data to improve the model. However, it is forbidden to submit different variants of the same algorithm. Meanwhile, the submission with detailed algorithm description obtains the authorship in the ECCV 2020 workshop proceeding.

Dataset Type Frames Max Min Ave Total Year
UCSD [5] surveillance 2008
UCF_CC_50 [17] surveillance 2013
Mall [26] surveillance 2013
WorldExpo [43] surveillance 2015
Shanghaitech A [45] surveillance 2016
Shanghaitech B [45] surveillance 2016
AHU-Crowd [16] surveillance 2016
CARPK [15] aerial 2017
Smart-City [44] surveillance 2018
UCF-QNRF [18] surveillance 2018
FDST [11] surveillance 2019
GCC [35] synthetic 2019
Crowd Surveillance [41] surveillance - - 2019
VisDrone-CC2020 aerial 2020
Table 1: Comparison of existing crowd counting datasets. We summarize the maximal, minimal, average and total count in the datasets.

3.1 Dataset

The VisDrone-CC2020 dataset is formed by images with the resolution of . As shown in Fig. 1, the data is captured by various drone-mounted cameras to keep diversity, for different scenarios across different cities in China (i.e., Tianjin, Guangzhou, Daqing, and Hong Kong). Moreover, we divide the dataset into the training and testing subsets, with and images, respectively. To avoid overfiting to particular scenes, we collect the images in the training and testing subsets at different but similar locations. To analyze the performance of algorithms thoroughly, we define visual attributes, described as follows.

  • Scale indicates the size of objects. We define categories of scales including Large (the diameter of objects pixels) and Small (the diameter of objects pixels).

  • Illumination has significant influence on the appearance of objects. We define kinds of illumination conditions, i.e., Cloudy, Sunny, and Night.

  • Density is the number of objects in each frame. According to the average number of objects in each frame, we divide the dataset into density levels. Crowded density indicates that the number of objects in each frame is larger than , and Sparse density indicates that the number of objects in each frame is less than .

Figure 1: Annotation exemplars in the VisDrone-CC2020 dataset. Different color indicates different person head.

3.1.1 Comparison to previous datasets.

Compared with the previous datasets focusing on crowd counting on surveillance scenes, our proposed dataset brings new challenges on drone-captured scenes as follows.

  • Compared to objects in video sequences recorded on surveillance scenes, the scales of objects in our dataset are extremely small (even less than pixels) because of high shooting altitude by drones. It is difficult for the model to extract sufficient appearance information to describe the objects.

  • In our dataset, the crowds are scattered on video frames (see Fig. 1). Each crowd contains a few to dozens of people.

  • Since the crowds are dynamically scattered on video frames, each crowd is surrounded by various backgrounds. The clutter background is another challenge in the proposed dataset.

3.2 Evaluation Protocol

Following the previous works [43, 45], each algorithm is evaluated through computing the number of people heads, mean absolute error (MAE) and mean squared error (MSE) between the predicted number of people heads and ground-truth in evaluation, which are defined as follows.


where is the number of video clips, is the number of frames in the -th video. and are the ground-truth and estimated number of people in the -th frame of the -th video clip, respectively. MAE and MSE describe the accuracy and robustness of the estimation, where MAE is the primary metric to rank the counting algorithms.

4 Results and Analysis

In this section, we evaluate the crowd counting methods submitted in the VisDrone-CC2020 Challenge and discuss the results thoroughly in terms of different attributes. Then, we point out several potential research direction in this field.

4.1 Submitted Methods

We have received entries in the VisDrone-CC2020 Challenge, of which submitted the results with correct format and complete algorithm description. In the following we brief overview the submitted algorithms included in the crowd counting task of VisDrone2020 Challenge and provide the corresponding descriptions in the Appendix 0.A.

The majority of the submitted algorithms are improved from state-of-the-art methods such as AutoScale [39], CSRNet [22] and SANet [4]. FPNCC (0.A.1) is based on AutoScale [39]. BVCC (0.A.2) is a double-stream network that extracts optical flow and frame difference information. algorithms are variants of CSRNet [22], including PDCNN (0.A.4), CSRNet+ (0.A.6), SCNet (0.A.8), CSR-SSOF (0.A.9) and Soft-CSRNET (0.A.10). To extract multi-scale features of the target object and incorporate larger context, M-SFANet (0.A.7) improves SFANet [33] by adding two modules called ASPP and CAN. MILLENNIUM (0.A.12) uses multi-view data (i.e., real-world RGB image and the corresponding crowd heatmap) to construct two deep neural networks for crowd counting. DevaNetv2 (0.A.5) employs attentional mechanism and feature pyramids to deal with different scales of people heads. SANet (0.A.13) is a new encoder-decoder based Scale Aggregation Network [4] to extract multi-scale features with scale aggregation modules and generate high-resolution density maps by using a set of transposed convolutions. Besides, two submissions are state-of-the-art methods trained on the VisDrone-CC2020 dataset, i.e., CFF (0.A.3) and CANet (0.A.11). CFF (0.A.3) proposes supervised focus from segmentation to focus on areas of interest and from global density to learn a matching global density. CANet (0.A.11) combines features obtained using multiple receptive field sizes and learns the importance of each such feature at each image location [24]. ResNet-FPN101 (0.A.14) is a baseline method by using ResNet-101 backbone to regress the density maps.

Figure 2: Comparison of submissions in the VisDrone-CC2020 Challenge.

4.2 Overall Results

The overall results of all submissions are shown in Fig. 2. FPNCC (0.A.1) obtains the best overall MAE score of and MSE score of . This is attributed to the proposed Learning to Scale (L2S) module to rescale the dense regions into similar density levels, which mitigates imbalance of density values in the dataset. BVCC (0.A.2) ranks the second place by using a double-stream network, which introduces the external synthetic data generated by GTA5 [35]. After that, CFF (0.A.3) benefits from two kinds of point annotations (i.e., segmentation, global density) as supervision for density-based counting, achieving MAE score of and MSE score of . PDCNN (0.A.4) focuses on performing accurate counting under different illumination conditions, which obtains similar performance as CFF (0.A.3). After that, the following two methods focus on enhancing the training data. DevaNetv2 (0.A.5) (rank 5) splits each image into sub-images, where each sub-image is processed by several data argumentation steps such as random rotation, random flip, random color process (include brightness, saturation, contrast), and normalization. Finally, sub-images are randomly chosen to merge into one new image. In constrast, CSRNet+ (0.A.6) (rank 6) upsamples the training sequences to equalize the illumination distribution, which is pretrained on the DLR-Aerial Crowd Dataset [2].

4.3 Attribute-based Results

For thorough evaluation, we report the results in terms of different attributes in Table 2. It can be seen that the best five performers achieve the best performance on various subsets. Specifically, improved from state-of-the-art L2S method [39], FPNCC (0.A.1) obtains the best MAE score in attribute subsets, i.e., Small, Cloudy, Sunny, and Sparse. With the help of the synthetic dataset [35] to simulate diverse environments, BVCC (0.A.2) achieves the best MSE score in attribute subsets including Cloudy and Crowded. PDCNN (0.A.4) achieves the best MSE score in terms of Large and Night attributes. This is because PDCNN (0.A.4) uses different networks to handle different scenarios in day and night illumination. DevaNetv2 (0.A.5) obtains the best MAE score of in the Large attribute, which show the effectiveness of the data argumentation strategy.

Method Large Small Cloudy Sunny Night Crowded Sparse
FPNCC(0.A.1) 13.74 18.37 10.27 13.15 10.83 15.70 11.04 13.16 15.39 18.63 15.40 19.77 9.49 12.28
BVCC(0.A.2) 13.48 18.03 11.61 13.76 11.78 15.55 12.32 15.29 14.19 16.37 13.66 18.23 11.61 13.87
CFF(0.A.3) 13.73 18.38 13.58 16.58 13.28 17.73 13.35 15.99 15.32 18.58 18.69 21.97 10.72 13.93
PDCNN(0.A.4) 13.01 15.66 14.32 17.18 14.84 17.62 13.43 16.56 11.36 13.10 15.79 18.80 12.63 15.16
DevaNetv2(0.A.5) 12.84 15.84 17.18 22.19 13.59 17.97 15.54 18.66 20.82 26.65 14.29 17.35 16.12 21.23
CSRNet+(0.A.6) 14.52 20.38 16.31 22.21 12.06 17.55 13.59 16.91 30.23 35.75 17.82 22.70 14.31 20.78
M-SFANet(0.A.7) 20.59 27.08 12.65 16.08 15.56 19.02 11.22 14.81 25.88 34.16 17.24 21.40 15.01 21.04
SCNet(0.A.8) 19.64 37.02 14.64 22.15 13.77 22.49 11.53 14.78 35.47 55.69 18.40 26.61 15.62 30.34
CSR-SSOF(0.A.9) 21.73 30.23 18.34 25.86 12.43 18.53 15.94 19.66 49.02 52.90 17.47 24.23 20.98 29.51
Soft-CSRNET(0.A.10) 29.31 44.10 25.21 32.92 15.37 20.12 19.73 23.34 75.53 79.15 21.86 26.02 29.74 43.16
CANet(0.A.11) 30.92 38.33 24.36 38.61 11.97 16.95 23.40 27.22 79.18 80.92 24.06 29.25 28.67 42.95
MILLENNIUM(0.A.12) 47.39 53.70 54.46 76.89 65.24 85.26 44.71 51.83 24.65 32.00 90.33 103.25 29.22 35.36
SANet(0.A.13) 56.81 66.56 57.91 64.52 44.77 52.55 69.91 77.17 70.67 73.68 70.14 78.74 50.13 56.16
ResNet-FPN(0.A.14) 98.64 109.53 74.18 85.09 80.94 94.27 84.51 96.65 91.95 97.54 131.11 133.18 56.67 64.56
Table 2: Results on the VisDrone-CC2020 dataset.

4.4 Discussion

As presented in Table 2, it can be seen that the best method FPNCC (0.A.1) achieves MAE score. That is, there are errors in average to count the people heads. It is still not satisfying in real applications. We summarize some topics worth to explore in crowd counting on drone-captured scenes as follows.

  • Groundtruth Density Map. The majority of existing methods convert point based groundtruth to density map using a Gaussian kernel model training. Although the geometry-adaptive kernel-based density map generation [45] is widely used on surveillance scenes, it may fail on the drone-captured scenarios where the people crowd is relative sparse.

  • Unsupervised Learning. Since point-level annotation is expensive to collect, some methods leverage unlabeled data or synthetic external data to improve the crowd counting performance. To narrow the gap between different datasets, we believe the domain adaptation technique will attract much interest in the field.

  • Head Localization. Besides crowd counting, head localization is also an important task in safety control application. However, the submitted algorithms focus on estimating the number of people heads in a frame rather than accurate location. Previous works [27, 18, 37] usually output the localization map and post-process the map by finding the local maximums based on a threshold. The two sub-tasks should be complementary and support each other.

5 Conclusions

In this paper, we summarize the results of all submitted crowd counting algorithms in the VisDrone-CC2020 Challenge. To evaluate the performance of algorithms, we provide a dataset formed by images, i.e., images for training, and images for testing. We provide annotated coordinates for people. Specifically, crowd counting algorithms from instittues are submitted to the VisDrone-CC2020 Challenge. The top three performers are FPNCC (0.A.1), BVCC (0.A.2), and CFF (0.A.3), achieving , and MAE scores, respectively. However, there still remains much room for improvement such as the localization accuracy of head. For future work, we plan to extend the dataset with more attributes and annotations to advance the state-of-the-art. We hope our work can largely boost the development of crowd counting on the drone-captured scenes.


This work was supported in part by the National Natural Science Foundation of China under Grant 61876127 and Grant 61732011, in part by Natural Science Foundation of Tianjin under Grant 17JCZDJC30800.

Appendix 0.A Submitted Crowd Counting Algorithms

In this appendix, we provide a short summary of all crowd counting algorithms that were considered in the VisDrone-CC2020 Challenge.

0.a.1 Feature Pyramid Network for Crowd Counting (FPNCC)

Dingkang Liang, Chenfeng Xu, Yongchao Xu, Xiang Bai

FPNCC is based on AutoScale [39] (an extension of L2S [40]) with the VGG16-based FPN backbone, which automatically scales dense regions into similar and appropriate density levels (see Fig. 3

). Meanwhile, we separate the overlapped blobs and decompose the original accumulated density values in density map. To preserve sufficient spatial information for accurate counting, we discard the last pooling layer and all following fully connected layers, as well as the pooling layer between stage4 and stage5. Note that we exploit the pre-trained model based on ImageNet Database. Besides using the traditional MSE Loss, we also use SSIM loss to improve the high-quality density map to aid the final counting performance.

Figure 3: The framework of FPNCC.

0.a.2 Bi-Path Video Crowd Counting (BVCC)

Zhiyuan Zhao, Tao Han, Junyu Gao, Qi Wang, Xuelong Li,, {gjy3035, crabwq},

To deal with the challenges such as varying scale, unstable exposure, and scene migration, BVCC is proposed to automatically understand the crowd from the visual data collected from drones. First, to alleviate the background noise generated in cross-scene testing, a double-stream crowd counting model is proposed, which extracts optical flow and frame difference information as an additional branch. Besides, to improve the generalization ability of the model at different scales and time, we randomly combine a variety of data transformation methods to simulate some unseen environments. To tackle the crowd density estimation problem under extreme dark environments, we introduce synthetic data generated by GTAV [35].

0.a.3 Counting with Focus for Free (CFF)

Wei Xu

CFF [30] proposes two kinds of free supervision including segmentation maps and global density. Besides, an improved kernel size estimator is proposed to facilitate density estimation and the focus from segmentation. During training, we augment the images by randomly cropping patches. Code is available at

0.a.4 Parallel Dilated Convolution Neural Network (PDCNN)

Pei Lyu, Lei Zhao, Jieru Wang, Yingnan Lin
{, michael.zhao,, lynn.lin}

PDCNN is the illumination-aware counting model based on CSRNet [22]. Since there are three categories of illumination conditions in the dataset (cloudy, sunny and night), we first distinguish the day and night illumination. Then, we use different networks to handle different scenarios. We use geometry-adaptive kernels to tackle the different congested scenes. We use Gaussian Kernel to blur each head annotation. We crop patches from each image at different location for data augmentation.

0.a.5 Crowd Counting with Attentional Mechanism and Feature Pyramids (DevaNetv2)

Ye Tian, Chenzhen Duan, Xiaoqing Xu, Zhiwei Wei
{19s151092, 18s151541, 19s051052, 19s051024}

In the field of aerial image crowd counting, people usually use grid density maps as the label with “unprejudiced” neural network block, but there are a series of problems. On one hand, the scale and shape of people will change too much, according to the different shooting angles and flight heights of the UAV. On the other hand, the luminance of the image will change, because of the various shooting time. In order to solve the above problems, we use Gaussian density maps and useful data enhancement methods to improve the robustness of our model. We use attentional mechanism and feature pyramids to make model adjust the predicting results based on the global information. We also design an auxiliary loss to reduce the difficulty of model optimization. The network is implemented by the framework111

0.a.6 CSRNet on Drone-based Scenarios (CSRNet+)

Florian Krüger, Thomas Golda
{florian.krueger, thomas.golda}

CSRNet+ is modified from CSRNet [22] to perform crowd counting on drone-based scenarios. We pretrain our model using only the training set from DLR-Aerial Crowd Dataset (DLRACD) [2]. Then we finetune our model with the VisDrone2020-CC training dataset. For training on VisDrone2020-CC training dataset, we annotate the ground sampling distances (GSD) for the training set by hand and then generated density maps from those as in [2]. Furthermore, we upsample the training sequences to equalize the illumination distribution. For night sequences, we only double them since there are only in the training set.

0.a.7 Mutil-Scale Aware based SFANet (M-SFANet)

Zhipeng Luo, Bin Dong, Yuehan Yao, Zhenyu Xu
{luozp, dongb, yaoyh, xuzy}

We use M-SFANet [33], which modifies neural network architectures based on SFANet [46] for crowd counting. Specifically, the method integrates two modules to improve the performance, i.e., atrous spatial pyramid pooling (ASPP) [6] for multi-scale featurs and context-aware network (CAN) [24] for corresponding contextual information. During training stage, the MSE loss is used for density map and cross entropy loss is used for attention map. The whole dataset is split into folds, which means that independent models are trained. During inference stage, we use separated min clip to deal with videos in night scenes.

0.a.8 Scaled Cascade Network for crowd counting on Drone data (SCNet)

Omar Elharrouss, Noor Almaadeed, Khalid Abualsaud, Amr Mohamed, Tamer Khattab, Ali Al-Ali, Somaya Al-Maadeed

We propose a crowd counting method based on deep convolutional neural networks (CNN) by extracting high-level features to generate density maps that represent an estimation of the crowd count with respect to the scale variations in the scene. Specifically, SCNet is a CNN-based model by adding a cascade network after the frontend block inspired by those used in SPN

[7], CSRNet [22], and AGRD [28]. Cascade block is inspired by the Bi-directional cascade network in [14] used for edge detection. To handle the scale variations a Scale Enhancement Module (SEM) is introduced. The architecture employs sequential dilated convolution blocks with different kernels.

0.a.9 CSRNet on Segmented Scenes with Optical Flow (CSR-SSOF)

Siyang Pan, Shidong Liu, Binyu Zhang, Yanyun Zhao
{pansiyang, lsd215, jl-lagrange, zyy}

CSR-SSOF is derived from CSRNet [22] with several additional modules offering improvements. First, optical flow is computed based on the Gunnar Farneback algorithm [12] to extract temporal information. RGB images concatenated with corresponding optical flow are fed into the modified CSRNet which takes 5-channel data as input. Furthermore, we replace the last layer with the density map estimator (DME) in SANet [4] in order to generate high-resolution density maps. The patch-based test scheme in SANet is also applied in our method. Considering the interference of irrelevant regions in which crowds rarely appear, an improved semantic segmentation model HRNetV2 [31] is implemented to block out building areas. We extract the contours of segmentation maps following [32] and fill the small holes. The corrosion expansion morphology method is used to smooth and narrow down the boundaries of building areas. Besides, V channel of the HSV model is chosen for the division of day and night sequences and we take a series of measures to cope with the night ones. Night images are enhanced by RetinexNet [36]

at the beginning to make the buried details visible for counting. In addition, we assume buildings in night sequences exist in pixels with below-average V values, so we directly binarize the original night images accordingly to perform segmentation. All final counting results are obtained by the integral of density maps within non-building areas. With partial parameters pre-trained on ImageNet, the modified CSRNet is finetuned on ShanghaiTech Part B

[45] dataset and VisDrone2020 train set successively. For segmentation, HRNetV2 is firstly trained on UDD5 [8]

dataset making use of Cityscapes

[9] pre-trained weight. Then we manually annotate some images on VisDrone2020 train set distinguishing building areas and use the annotation to finetune our model. For the night image enhancement, we use RetinexNet model pre-trained on LOL [36] dataset and RAISE [10] dataset.

0.a.10 Soft Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes (Soft-CSRNET)

Bakour Imene, Bouchali Hadia Nesma
{imene.bakour, hadianesma.bouchali}

Soft-CSRNET is modified on the network for Congested Scene Recognition called CSRNet


to provide a data-driven and deep learning method that can understand highly congested scenes and perform accurate count estimation as well as present high-quality density maps. We try several possible configurations on the network to be able to optimize it and render it in real time, for that we focus on the accuracy of the counting, removing the minimum possible resolution of the density maps estimated The proposed network is composed of two major components: a convolutional neural network (CNN) as the front-end for 2D feature extraction and a dilated CNN for the back-end, which uses dilated kernels to deliver larger reception fields and to replace pooling operations. Our Soft-CSRNET contains fewer parameters and convolutional layers in the backend.

0.a.11 Context-Aware Crowd Counting (CANet)

Laihui Ding

CANet [24] is an end-to-end network that leverages multiple receptive field scales to learn and combine feature at each image location. Thus the contextual information is encoded for predicting accurate crowd density.

0.a.12 MultI-view fuLly convoLutional nEural Network for crowd couNting In drone-captUred iMages (MILLENNIUM)

Giovanna Castellano, Ciro Castiello, Marco Cianciotta, Corrado Mencar, Gennaro Vessio
{giovanna.castellano, ciro.castiello, marco.cianciotta, corrado.mencar,

We couple multi-view data and multi-functional deep learning for efficient crowd counting in aerial images. Specifically, we exploit the real-world RGB image and the corresponding crowd heatmap as multiple views of the same scene containing a crowd to create a powerful regression model for crowd counting. Two deep neural networks, one for each view, are jointly trained so that their weights are updated at the same time. After training, only the network that processes the real-world images is retained as final model for crowd counting. This provides an accurate light-weight model that is suitable to meet the limited computational resources of a UAV. The final model is able to provide the crowd count with an average processing speed of about frames per second.

0.a.13 Scale Aggregation Network (SANet)

Shuang Qiu, Zhijian Zhao
{qiushuang, zhaozhijian}

SANet is the encoder-decoder based Scale Aggregation Network [4]. Specifically, the encoder is used to exploit multi-scale features by scale aggregation while the decoder is used to generate high-resolution density maps by transposed convolutions. To consider local correlation in density maps, both Euclidean loss and local pattern consistency loss are used to train the network for better performance. We finetune the network on additional datasets such as Shanghaitech B [45], and Crowd surveillance [41].

0.a.14 Residual Network Feature Pyramid Network_101 (ResNet-FPN101)

Muhammad Saqib, Sultan Daud Khan,

ResNet-FPN101 is the baseline crowd counting method. Specifically, we convert the dot-level annotation to bounding box-level annotation. We train and evaluate the network with ResNet-FPN-101 as a backbone architecture.


  • [1] C. Arteta, V. S. Lempitsky, and A. Zisserman (2016) Counting in the wild. In ECCV, Vol. 9911, pp. 483–498. Cited by: §1.
  • [2] R. Bahmanyar, E. Vig, and P. Reinartz (2019) MRCNet: crowd counting and density map estimation in aerial and ground imagery. CoRR abs/1909.12743. Cited by: §0.A.6, §4.2.
  • [3] Y. Cai, L. Wen, L. Zhang, D. Du, W. Wang, and P. Zhu (2020) Rethinking object detection in retail stores. CoRR abs/2003.08230. Cited by: §1.
  • [4] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In ECCV, V. Ferrari, M. Hebert, C. Sminchisescu, and Y. Weiss (Eds.), Vol. 11209, pp. 757–773. Cited by: §0.A.13, §0.A.9, §4.1.
  • [5] A. B. Chan, Z. J. Liang, and N. Vasconcelos (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In CVPR, Cited by: Table 1.
  • [6] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2018) DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40 (4), pp. 834–848. Cited by: §0.A.7.
  • [7] X. Chen, Y. Bin, N. Sang, and C. Gao (2019) Scale pyramid network for crowd counting. In WACV, pp. 1941–1950. Cited by: §0.A.8.
  • [8] Y. Chen, Y. Wang, P. Lu, Y. Chen, and G. Wang (2018) Large-scale structure from motion with semantic constraints of aerial images. In PRCV, J. Lai, C. Liu, X. Chen, J. Zhou, T. Tan, N. Zheng, and H. Zha (Eds.), Vol. 11256, pp. 347–359. Cited by: §0.A.9.
  • [9] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele (2016) The cityscapes dataset for semantic urban scene understanding. In CVPR, pp. 3213–3223. Cited by: §0.A.9.
  • [10] D. Dang-Nguyen, C. Pasquini, V. Conotter, and G. Boato (2015) RAISE: a raw images dataset for digital image forensics. In MMSys, W. T. Ooi, W. Feng, and F. Liu (Eds.), pp. 219–224. Cited by: §0.A.9.
  • [11] Y. Fang, B. Zhan, W. Cai, S. Gao, and B. Hu (2019)

    Locality-constrained spatial transformer network for video crowd counting

    In ICME, pp. 814–819. Cited by: Table 1.
  • [12] G. Farnebäck (2003) Two-frame motion estimation based on polynomial expansion. In SCIA, J. Bigün and T. Gustavsson (Eds.), Vol. 2749, pp. 363–370. Cited by: §0.A.9.
  • [13] G. Gao, J. Gao, Q. Liu, Q. Wang, and Y. Wang (2020) CNN-based density estimation and crowd counting: A survey. CoRR abs/2003.12783. Cited by: §2.
  • [14] J. He, S. Zhang, M. Yang, Y. Shan, and T. Huang (2019) Bi-directional cascade network for perceptual edge detection. In CVPR, pp. 3828–3837. Cited by: §0.A.8.
  • [15] M. Hsieh, Y. Lin, and W. H. Hsu (2017) Drone-based object counting by spatially regularized regional proposal network. In ICCV, Cited by: §1, Table 1.
  • [16] Y. Hu, H. Chang, F. Nian, Y. Wang, and T. Li (2016) Dense crowd counting from still images with convolutional neural networks. J. Visual Communication and Image Representation 38, pp. 530–539. Cited by: Table 1.
  • [17] H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In CVPR, pp. 2547–2554. Cited by: §1, §2.1, Table 1.
  • [18] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Máadeed, N. M. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, pp. 544–559. Cited by: §1, §2.1, Table 1, 3rd item.
  • [19] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. S. Doermann, and L. Shao (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR, pp. 6133–6142. Cited by: §1, §2.2.
  • [20] B. Leibe, E. Seemann, and B. Schiele (2005) Pedestrian detection in crowded scenes. In CVPR, pp. 878–885. Cited by: §2.2.
  • [21] V. S. Lempitsky and A. Zisserman (2010) Learning to count objects in images. In NeurIPS, J. D. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Culotta (Eds.), pp. 1324–1332. Cited by: §1.
  • [22] Y. Li, X. Zhang, and D. Chen (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, pp. 1091–1100. Cited by: §0.A.10, §0.A.4, §0.A.6, §0.A.8, §0.A.9, §1, §2.2, §4.1.
  • [23] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In CVPR, pp. 3225–3234. Cited by: §1, §2.2.
  • [24] W. Liu, M. Salzmann, and P. Fua (2019) Context-aware crowd counting. In CVPR, pp. 5099–5108. Cited by: §0.A.11, §0.A.7, §4.1.
  • [25] X. Liu, J. van de Weijer, and A. D. Bagdanov (2019) Exploiting unlabeled data in cnns by self-supervised learning to rank. TPAMI 41 (8), pp. 1862–1878. Cited by: §2.2.
  • [26] C. C. Loy, S. Gong, and T. Xiang (2013) From semi-supervised to transfer counting of crowds. In ICCV, pp. 2256–2263. Cited by: Table 1.
  • [27] Z. Ma, L. Yu, and A. B. Chan (2015) Small instance detection by integer programming on object density maps. In CVPR, pp. 3689–3697. Cited by: 3rd item.
  • [28] X. Pan, H. Mo, Z. Zhou, and W. Wu (2020) Attention guided region division for crowd counting. In ICASSP, pp. 2568–2572. Cited by: §0.A.8.
  • [29] D. B. Sam, N. N. Sajjan, H. Maurya, and R. V. Babu (2019) Almost unsupervised learning for dense crowd counting. In AAAI, pp. 8868–8875. Cited by: §2.2.
  • [30] Z. Shi, P. Mettes, and C. Snoek (2019) Counting with focus for free. In ICCV, pp. 4199–4208. Cited by: §0.A.3.
  • [31] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang, W. Liu, and J. Wang (2019) High-resolution representations for labeling pixels and regions. CoRR abs/1904.04514. Cited by: §0.A.9.
  • [32] S. Suzuki and K. Abe (1985) Topological structural analysis of digitized binary images by border following. CVGIP 30 (1), pp. 32–46. Cited by: §0.A.9.
  • [33] P. Thanasutives, K. Fukui, M. Numao, and B. Kijsirikul (2020) Encoder-decoder based convolutional neural networks with multi-scale-aware modules for crowd counting. In ICPR, pp. 2382–2389. Cited by: §0.A.7, §4.1.
  • [34] M. Wang and X. Wang (2011) Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR, pp. 3401–3408. Cited by: §2.2.
  • [35] Q. Wang, J. Gao, W. Lin, and Y. Yuan (2019) Learning from synthetic data for crowd counting in the wild. In CVPR, pp. 8198–8207. Cited by: §0.A.2, §2.1, §2.2, Table 1, §4.2, §4.3.
  • [36] C. Wei, W. Wang, W. Yang, and J. Liu (2018) Deep retinex decomposition for low-light enhancement. In BMVC, pp. 155. Cited by: §0.A.9.
  • [37] L. Wen, D. Du, P. Zhu, Q. Hu, Q. Wang, L. Bo, and S. Lyu (2019) Drone-based joint density map estimation, localization and tracking with space-time multi-scale attention network. CoRR abs/1912.01811. Cited by: 3rd item.
  • [38] B. Wu and R. Nevatia (2005) Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV, pp. 90–97. Cited by: §2.2.
  • [39] C. Xu, D. Liang, Y. Xu, S. Bai, W. Zhan, M. Tomizuka, and X. Bai (2019) AutoScale: learning to scale for crowd counting. CoRR abs/1912.09632. Cited by: §0.A.1, §4.1, §4.3.
  • [40] C. Xu, K. Qiu, J. Fu, S. Bai, Y. Xu, and X. Bai (2019) Learn to scale: generating multipolar normalized density maps for crowd counting. In ICCV, pp. 8381–8389. Cited by: §0.A.1.
  • [41] Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding (2019) Perspective-guided convolution networks for crowd counting. In ICCV, pp. 952–961. Cited by: §0.A.13, §2.1, Table 1.
  • [42] A. Zhang, J. Shen, Z. Xiao, F. Zhu, X. Zhen, X. Cao, and L. Shao (2019) Relational attention network for crowd counting. In ICCV, pp. 6787–6796. Cited by: §1, §2.2.
  • [43] C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In CVPR, pp. 833–841. Cited by: §1, §2.1, §3.2, Table 1.
  • [44] L. Zhang, M. Shi, and Q. Chen (2018) Crowd counting via scale-adaptive convolutional neural network. In WACV, pp. 1113–1121. Cited by: Table 1.
  • [45] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, pp. 589–597. Cited by: §0.A.13, §0.A.9, §1, §1, §2.1, §2.2, §3.2, Table 1, 1st item.
  • [46] L. Zhu, Z. Zhao, C. Lu, Y. Lin, Y. Peng, and T. Yao (2019) Dual path multi-scale fusion networks with attention for crowd counting. CoRR abs/1902.01115. Cited by: §0.A.7.