Drone-based Joint Density Map Estimation, Localization and Tracking with Space-Time Multi-Scale Attention Network

by   Longyin Wen, et al.
JD.com, Inc.
Tianjin University

This paper proposes a space-time multi-scale attention network (STANet) to solve density map estimation, localization and tracking in dense crowds of video clips captured by drones with arbitrary crowd density, perspective, and flight altitude. Our STANet method aggregates multi-scale feature maps in sequential frames to exploit the temporal coherency, and then predict the density maps, localize the targets, and associate them in crowds simultaneously. A coarse-to-fine process is designed to gradually apply the attention module on the aggregated multi-scale feature maps to enforce the network to exploit the discriminative space-time features for better performance. The whole network is trained in an end-to-end manner with the multi-task loss, formed by three terms, i.e., the density map loss, localization loss and association loss. The non-maximal suppression followed by the min-cost flow framework is used to generate the trajectories of targets' in scenarios. Since existing crowd counting datasets merely focus on crowd counting in static cameras rather than density map estimation, counting and tracking in crowds on drones, we have collected a new large-scale drone-based dataset, DroneCrowd, formed by 112 video clips with 33,600 high resolution frames (i.e., 1920x1080) captured in 70 different scenarios. With intensive amount of effort, our dataset provides 20,800 people trajectories with 4.8 million head annotations and several video-level attributes in sequences. Extensive experiments are conducted on two challenging public datasets, i.e., Shanghaitech and UCF-QNRF, and our DroneCrowd, to demonstrate that STANet achieves favorable performance against the state-of-the-arts. The datasets and codes can be found at https://github.com/VisDrone.


page 3

page 8


Detection, Tracking, and Counting Meets Drones in Crowds: A Benchmark

To promote the developments of object detection, tracking and counting a...

Dual Path Multi-Scale Fusion Networks with Attention for Crowd Counting

The task of crowd counting in varying density scenes is an extremely dif...

A Unified Multi-Task Learning Framework of Real-Time Drone Supervision for Crowd Counting

In this paper, a novel Unified Multi-Task Learning Framework of Real-Tim...

Indirect-Instant Attention Optimization for Crowd Counting in Dense Scenes

One of appealing approaches to guiding learnable parameter optimization,...

Hybrid Graph Neural Networks for Crowd Counting

Crowd counting is an important yet challenging task due to the large sca...

Attention to Head Locations for Crowd Counting

Occlusions, complex backgrounds, scale variations and non-uniform distri...

Rethinking Counting and Localization in Crowds:A Purely Point-Based Framework

Localizing individuals in crowds is more in accordance with the practica...

1 Introduction

Drones, or general unmanned aerial vehicles (UAVs), equipped with cameras have been fast deployed to a wide range of applications, such as video surveillance for crowd control and public safety [21]

. In recent years, many massive stampedes have taken place around the world that claimed many victims, making the automatic density map estimation, counting and tracking in crowds on drones an important task. These tasks have recently drawn great attention from the computer vision research community.

Despite great progress has been achieved in recent years, these algorithms still have room for improvement to deal with video sequences captured by drones, due to various challenges, such as view point and scale variations, background clutter, and small scales. Developing and evaluating crowd counting and tracking algorithms for drones are impeded by the lack of publicly available large-scale benchmarks. Some recent efforts [45, 10, 43] have devoted to construct datasets for crowd counting. However, these datasets are still limited in sizes and scenarios covered. They only focus on crowd counting with still images by surveillance cameras, due to difficulties in data collection and annotation for drone-based crowd counting and tracking.

To fill this gap, we collect a large-scale drone-based dataset for density map estimation, crowd localization and tracking. Our DroneCrowd dataset consists of video clips formed by total frames with a resolution of pixels, captured by various drone-mounted cameras, for different scenarios across different cities in China (i.e., Tianjin, Guangzhou, Daqing, and Hong Kong). These video clips are manually annotated with more than million head annotations and several video-level attributes. To the best of our knowledge, this is the largest and most thoroughly annotated density map estimation, localization, and tracking dataset to date, see Table 1.

Dataset Type Resolution Frames Max count Min count Ave count Total count Trajectory Year
UCF_CC_50 [9] image - 2013
Shanghaitech A [45] image - 2016
Shanghaitech B [45] image 2016
AHU-Crowd [7] image 2016
CARPK [6] image 2017
Smart-City [43] image 2018
UCF-QNRF [10] image - 2018
UCSD [2] video 2008
Mall [19] video 2013
WorldExpo [42] video 2015
FDST [4] video 2019
Ours video 2019
  • This dataset includes video sequences captured by surveillance cameras. Only sampled video frames are annotated with a total of labeled heads of people.

Table 1: Comparison of the DroneCrowd dataset with existing datasets.

To handle this challenging dataset, we design a new space-time multi-scale attention network (STANet) to solve the density map estimation, localization, and tracking on drones. Specifically, we first aggregate multi-scale feature maps in sequential frames to exploit the temporal coherency, and then generate the enhanced space-time multi-scale features for the prediction of density and localization maps as well as association between consecutive frames. Meanwhile, we gradually apply the attention module on the aggregated feature maps to enforce the network to exploit discriminative space-time features for better performance. The whole network is trained in an end-to-end manner with the multi-task loss, formed by three terms, i.e., the density map loss, localization loss and association loss. After that, we use the non-maximal suppression method to localize the targets, which post-processes the localization results for each video frame by finding the local peaks or maximums based on a threshold. The min-cost flow algorithm [24] is further used to associate the nearest localized head points to generate the trajectories of people’s heads in the video. Extensive experiments are carried out on two challenging public datasets (i.e., Shanghaitech [45] and UCF-QNRF [10]) and our DroneCrowd dataset to demonstrate the effectiveness of our method.

Contributions. (1) We design a space-time multi-scale attention network to solve the density map estimation, localization, and tracking tasks simultaneously, which gradually apply the attention module on the aggregate multi-scale feature maps to enforce the network to exploit discriminative space-time features for better performance. (2) We present a large-scale drone-based dataset for density map estimation, localization, and tracking in dense crowd, which significantly surpasses existing datasets in terms of data type and volume, annotation quality, and difficulty. (3) Extensive experiments are carried out on three challenging datasets to validate the effectiveness of our STANet method.

Figure 1: Some annotated example frames in the DroneCrowd dataset. Different color indicates different object instance and the corresponding trajectory. The video-level attributes are presented on the top-left corner in each video frame.

2 Related Work

Existing datasets.

To date, there only exists a handful of crowd counting, density map estimation, crowd localization, or crowd tracking datasets. UCF_CC_50 [9] is formed by images containing annotated humans, with the head counts ranging from to . Shanghaitech [45] includes images with a total number of labeled people. Recently, UCF-QNRF [10] is released with images and million annotated people’s heads in various scenarios. Hsieh et al. [6] present a drone-based car counting dataset, which approximately contains cars captured in different parking lots. However, these datasets are still limited in sizes and scenarios covered, due to the difficulties in data collection and annotation.

To evaluate counting algorithms in videos, Chan et al. [2] present the UCSD counting dataset with a resolution of . It includes low density crowd and counting difficulty. Mall [19] consists of frames with a resolution of . Similar to UCSD, it is collected by the surveillance camera in a single location. Zhang et al. [42] present the WorldExpo dataset with annotated video frames in total, which is captured in different scenes during 2010 Shanghai WorldExpo. Recently, Fang et al. [4] collect a video dataset with frames and annotated heads captured from different scenes. In contrast to the aforementioned datasets, our DroneCrowd dataset is a large-scale drone-based dataset for density map estimation, crowd localization and tracking, which consists of video sequences with more than million head annotations on people trajectories.

Crowd counting and density map estimation.

The majority of early crowd counting methods [39, 14, 36] rely on sliding-window detector to scan still images or video frames to detect the pedestrians based on their hand-crafted appearance features. However, the detector-based methods are easily affected by heavy occlusion, scale and viewpoint variations in crowded scenarios.

Recently, some methods [15, 45, 22, 29, 1, 16, 18] formulate crowding counting as the estimation of density maps. Lempitsky and Zisserman [15] learn to infer the density estimation by a minimization of a regularized risk quadratic cost function. Zhang et al. [45] use the multi-column CNN network to estimate the crowd density map, which learns the features for different head sizes by each column CNN. Oñoro-Rubio and López-Sastre [22]

design the Hydra CNN, which learns a multi-scale non-linear regression model using a pyramid of image patches extracted at multiple scales to generate the final density prediction. Sam

et al. [29] develop the switching CNN model to handle the variations of crowd density. Cao et al. [1] propose an encoder-decoder network, where the encoder extracts multi-scale features with scale aggregation and the decoder generates high-resolution density maps using transposed convolutions. Li et al. [16] employ dilated convolution layers to enlarge receptive fields and extract deeper features without losing resolutions. In contrast to existing methods, our STANet uses a coarse-to-fine process, which sequentially applies the attention module on multi-scale feature maps to enforce the network to exploit discriminative features.

In terms of crowd counting in videos, spatio-temporal information is critical to improve the counting accuracy. Xiong et al. [40] design a convolutional LSTM model to fully capture both spatial and temporal dependencies for crowd counting. Zhang et al. [44]

combine fully convolutional neural networks and LSTM by residual learning to perform vehicle counting. Different from these two methods, our STANet combines multi-scale feature maps in sequential frames and outputs the enhanced features by deformable convolution, which is effective in exploiting the temporal coherency across frames for better performance.

Crowd localization and tracking.

Besides crowd counting, crowd localization and tracking are also important tasks in safety control scenarios. Rodriguez et al. [26] formulate an energy minimization framework by jointly optimizing the density and the location of people, with the temporal-spatial constraints of person tracks in video. Ma et al. [20] first obtain local counts from sliding windows over the density map and then use integer programming to recover the locations of individual objects. In [10], crowd counting and localization tasks are simultaneously solved with a CNN model trained by a composition loss.

3 DroneCrowd Dataset

Figure 2: (a) The distribution of the number of objects per frame, (b) the distribution of the length of object trajectories, and (c) the attribute statistics, of the training and testing sets in the DroneCrowd dataset.

3.1 Data Collection and Annotation

Our DroneCrowd dataset is captured by drone-mounted cameras (DJI Phantom 4, Phantom 4 Pro and Mavic), covering a wide range of scenarios, e.g., campus, street, park, parking lot, playground and plaza. The videos are recorded at frames per seconds (FPS) with a resolution of pixels. As presented in Figure 2 (a) and (b), the maximal and minimal numbers of people in each video frame are and respectively, and the average number of objects is . More than thousands of head trajectories of people are annotated with more than million head points in individual frames of video clips. In terms of annotation, over domain experts annotate and double-check the dataset using the vatic software [35] for more than two months. Figure 1 shows some frames with annotated trajectories of people heads in video sequences.

We divide the DroneCrowd dataset into the training and testing sets, with and sequences, respectively. Notably, training videos are taken at different locations from testing videos to reduce the chances of algorithms to overfit to particular scenes. The DroneCrowd dataset contains video sequences with large variations in scale, viewpoint, and background clutters. To analyze the performance of algorithms thoroughly, we define three video-level attributes of the dataset, i.e.,

Illumination. Under different illumination conditions, the objects assume different in appearance. We consider three categories of illumination conditions in DroneCrowd, including Cloudy, Sunny, and Night.

Altitude is the flying height of drones, which significantly affects the scales of objects. Referring the scales of objects, we define two altitude levels, namely High () and Low ().

Density indicates the number of objects in each frame. Based on the average number of objects in each frame, we divide the dataset into two density levels, i.e., Crowded (with the number of objects in each frame larger than ), and Sparse (with the number of objects in each frame less than ). The distribution of video sequences based on the attributes is shown in Figure 2 (c).

3.2 Evaluation Metrics and Protocols

Density map estimation.

Following the previous works [42, 45, 10], the density map estimation task aims to compute per-pixel density at each location in the image, while preserving spatial information about distribution of people. We use the mean absolute error (MAE) and mean squared error (MSE) to evaluate the performance, i.e.,


where is the number of video clips, is the number of frames in the -th video. and are the ground-truth and estimated number of people in the -th frame of the -th video clip, respectively. As stated in [45], MAE and MSE describe the accuracy and robustness of the estimation.

Crowd localization.

According to [10], the ideal approach for crowd counting is to detect all people in an image and generate the number of detections, which is critical in several applications such as safety and surveillance. Specifically, each evaluated algorithm is required to output a series of detected points with confidence scores for each test image. The estimated localizations determined by the confidence threshold are associated to the ground-truth localizations using greedy method. Then, we compute the mean average precision (L-mAP) at various distance thresholds ( pixels) to evaluate the localization results. We also report the performance with three specific distance thresholds, i.e., L-AP, L-AP, and L-AP pixels. These criteria penalize missing detection of people as well as duplicate detections (i.e., two detection results for the same people).

Crowd tracking.

Crowd tracking requires an evaluated algorithm to recover the trajectories of people in video sequence. We use the tracking evaluation protocol in [23] to evaluate the algorithms. Specifically, each tracker is required to output a series of head points with confidence scores and the corresponding identities. We sort the tracklets, formed by the detected locations with the same identity, based on the average confidence of their detections. A tracklet is considered to be correct if the matched ratio between the predictions and ground-truth tracklets is larger than a threshold. Similar to [23], we use three thresholds in evaluation, i.e., , , and . The matching distance threshold between the predicted and ground-truth locations on the tracklets is set to pixels. The mean average precision (T-mAP) scores over different thresholds (i.e., T-AP, T-AP, and T-AP) are used to measure the performance. Please refer to [23] for more details.

Figure 3: The architecture of our space-time multi-scale attention network for crowd counting. The pink rectangles indicate the convolution groups in VGG-16. The light blue rectangle indicates the deformable convolution layer [47].

4 Space-Time Multi-Scale Attention Network

Our STANet combines multi-scale feature maps in sequential frames, see Figure 3. Meanwhile, we gradually use the attention module on the combined feature maps to enforce the network to focus on the discriminative space-time features. Finally, the non-suppression and min-cost flow association algorithms [24] are used to localize the heads of people and generate their trajectories in video sequence.

Network architecture.

As shown in Figure 3, our STANet method is constructed on the first four groups of convolution layers in the VGG-16 network [32], the backbone network of STANet, to extract the multi-scale features. Motivated by [27], we use the U-Net style architecture to fuse multi-scale features for prediction. Meanwhile, to exploit the temporal coherency, we merge the multi-scale features of the -th frame, and concatenate the features of the -th frame and the -th frame, where is a predefined parameter determining the frame gap between the two frames in temporal coherency111For the time index , we use the feature of the first frame to exploit the temporal coherency.. We gradually apply the spatial attention module [38] on multi-scale features to enforce the network to focus on the discriminative features (see the black dashed bounding box in Figure 3). After each spatial attention module, one convolution layer is used to compress the number of channels for efficiency. The multi-scale feature maps of the network are concatenated, merged by the channel and spatial attention modules and one convolution layer to predict the final density and localization maps. Besides, one convolution layer is used to exploit the appearance features from the shared backbone in consecutive frames. Then the targets with the same identification are associated based on the normalized features.

Multi-task loss function.

The overall loss function consists of density map loss, localization loss and association loss, which is formulated as follows.


where is the batch size and is the index of sample. and are the estimated and ground-truth density map, while and are the estimated and ground-truth localization map. and are the feature distance between the same targets and different targets in consecutive frames, respectively. , and are balancing factors for the three terms.

Specifically, we use the same pixel-wise Euclidean loss on multi-scale density and localization maps, making different branches (i.e., S1, S2, S3, and S4 in Figure 3) in the network focus on different scales of objects to generate more accurate prediction. For example, the density loss is computed as


where and are the width and height of the map, is the number of scales in the network. and are the estimated and ground-truth density map at pixel location of the -th training sample with scale , respectively. is the pre-set weight used to balance the losses of different scales of density maps. Notably, following [45], the geometry-adaptive Gaussian kernel method is used to generate the ground-truth density map . Similar to [46], we also generate localization maps using a fixed Gaussian kernel . If the two Gaussians overlap, we take the maximal values.

Inspired by [5], we train the association head using the batch hard triplet loss, which samples hard positives and hard negatives for each target. The loss is computed as


where is the margin between and and is the number of targets in the current frame. Each target with the id contains an association feature.

Method Venue & Year Shanghaitech Part A [45] Shanghaitech Part B [45] UCF-QNRF [10]
MCNN [45] CVPR 2016
C-MTL [33] AVSS 2017
SwitchCNN [29] CVPR 2017
CP-CNN [34] ICCV 2017 - -
SaCNN [43] WACV 2018 - -
ACSCP [30] CVPR 2018 - -
IG-CNN [28] CVPR 2018 - -
Deep-NCL [31] CVPR 2018 - -
CSRNet [16] CVPR 2018 - -
CL-CNN [10] ECCV 2018 - - - -
ic-CNN [25] ECCV 2018 - -
SANet [1] ECCV 2018 - -
SFCN [37] CVPR 2019
ADCrowdNet [17] CVPR 2019 - -
TEDnet [11] CVPR 2019
Ours -
Table 2: Comparison of our approach with the state-of-the-art methods on three public datasets.

Data augmentation.

We randomly flip and crop the training images to increase diversity in training data. Due to limited computation resources, for the image size larger than , we first resize the image such that its size is smaller than . Then we equally divide it into patches, and use the divided patches for training.


In (4), the margin is set as , and the pre-set weights are set as , and for balance. The pre-set weight in (3) is set as empirically. The Gaussian normalization method is used to randomly initialize the parameters in the other (de)convolution layers. We set the batch size to in training. The network is trained with the learning rate of in the first epochs, and trained with the learning rate of in the epochs using the Adam optimization algorithm [12].

Localization and tracking.

After obtaining the localization map of each frame, we use the non-maximal suppression method to localize the heads of people in each frame based on a preset threshold . That is, we find the local peaks or maximums density values larger than on the predicted localization map of each video frame to determine the head locations of people. Then, we calculate the Euclidean distance between different pairs of heads in sequential frames and use the min-cost flow algorithm [24] to associate the nearest head points to generate their trajectories.

5 Experiments

We evaluate our method on three challenging datasets. The experiments are conducted on a workstation with Intel E5-2609 CPU, 32GB RAM, and NVIDIA GeForce GTX 1080Ti GPUs.

5.1 Public Datasets

As presented in Table 2, we evaluate our STANet method on Shanghaitech [45] and UCF-QNRF [10]. Since they only focus on crowd counting on images, we remove the association head in our STANet method for evaluation.

The Shanghaitech dataset

[45] is formed by images, with a total of annotated people, which is divided into Part A ( images) and Part B ( images). Table 2 shows the errors of our STANet method as well as state-of-the-art methods. As show in Table 2, our method performs favorably against the state-of-the-arts with MAE and MSE in Part A, and MAE and MSE in Part B. ADCrowdNet [17] performs better than our method in MAE ( vs. ) of Part A, but worse than our method in MAE ( vs. ) of Part B.

The UCF-QNRF dataset

[10] contains challenging images with annotated people, which is divided into training ( images) and testing sets ( images). We compare the proposed method with state-of-the-art methods (i.e., [45, 33, 29, 10, 37, 11]) in Table 2. Our method achieves MAE and MSE , surpassing most state-of-the-art methods, which demonstrates that our method produces more accurate density maps.

5.2 DroneCrowd Dataset

Besides the above two public datasets, we also evaluate the proposed method on our DroneCrowd dataset for crowd counting, localization and tracking. We report the density map estimation results and speed, i.e., frame-per-second (FPS), of the proposed STANet method and state-of-the-art methods in Table 3. All codes of the evaluated methods are publicly available or provided by the authors of the corresponding publications222Since there are no public codes for other video based methods [40, 44], we do not evaluate them on the DroneCrowd dataset.. All methods are trained on the training set and evaluated on the testing set. Every frames are sampled from the video clips in the training set to train the evaluated methods.

Meanwhile, we choose the top two previous methods in density map estimation, and post-process the predicted localization maps by finding the local peaks or maximums based on a preset threshold to solve the crowd localization task in Table 4. For our STANet method, we directly post-process the predicted localization maps to localize the targets. After that, we use the min-cost flow algorithm [24] to recover the people’s trajectories. The evaluation results of the crowd tracking task are shown in Table 5. Some qualitative results are shown in Figure 4 and more results can be found in the supplementary materials.

Method Speed Overall High Low Cloudy Sunny Night Crowded Sparse
MCNN [45]
C-MTL [33]
MSCNN [41]
LCFCN [13]
SwitchCNN [29]
ACSCP [30]
CSRNet [16]
StackPooling [8]
DA-Net [48]
STANet (w/o ms)
STANet (w/o loc)
STANet (w/o ass)
Table 3: Estimation errors of the density map on the DroneCrowd dataset.
Figure 4: Qualitative results of STANet on three sequences in our DroneCrowd. Best view in color version.

Density map estimation.

As shown in Table 3, our STANet performs favorably against the state-of-the-art methods, with an improvement of MAE and MSE in comparison to the second best CSRNet [16] in the overall testing set, i.e., MAE of STANet vs. MAE of CSRNet [16], and MSE of STANet vs. MSE of CSRNet [16]. The third best MCNN [45] achieves the third best MAE score of with the speed of FPS. This suggests that our method generates more accurate and robust density maps in different scenarios.

To further analyze the results, we report the performance on several break-down subsets based on the video-level attributes, i.e., the High and Low subsets based on the Altitude attribute, the Cloudy, Sunny, and Night subsets based on the Illumination, and the Crowd and Sparse subsets based on the Density attribute. As shown in the sixth column of Table 3, LCFCN [13] and AMDCN [3] fail to perform well in the Crowd subset, producing the two worst MAE and MSE scores, i.e., and MAE, and and MSE, respectively. We speculate that this may be due to several reasons. LCFCN [13] uses a loss function to encourage the network to output a segmentation blob for each object in crowd counting. However, in crowd scenarios, each object may contain only few pixels, making it difficult to separate objects accurately. AMDCN [3] uses multiple columns of large dilation convolution operations, which inevitably integrate considerable background noise, affecting the accuracy in density map estimation. In contrast, MCNN [45] uses multi-column CNNs to learn the features adaptive to variations in object size due to perspective effect or image resolution. Meanwhile, STANet achieves the best result by gradually applying the attention module on the combined multi-scale feature maps. This phenomenon strongly demonstrates the effectiveness and importance of exploiting multi-scale features in density map estimation.

Methods L-mAP L-AP L-AP L-AP
STANet-L (w/o ms)
STANet-L (w/o loc)
STANet-L (w/o ass)
Table 4: Crowd localization accuracy on DroneCrowd.

To study the influence of each module in STANet, we construct three variants and evaluate them on the DroneCrowd dataset, i.e., STANet (w/o ass), STANet (w/o loc), and STANet (w/o ms), in Table 3. Specifically, for a fair comparison, we use the same parameter settings and input size in evaluation. All variants are trained on the training set and tested on the testing set. STANet (w/o ass) indicates the method that removes the association head from STANet. STANet (w/o loc) indicates the method that removes the localization head from STANet (w/o ass). STANet (w/o ms) denotes the method that further removes the multi-scale features in prediction, i.e., only using the first four groups of convolution layers in VGG16.

As shown in Table 3, our STANet achieves better results than its variants. After removing the association head, the MSE score on overall set increases ( vs. ), demonstrating that temporal association helps improve the robustness. If we remove the localization head, the errors increase in MAE ( vs. ). This performance drop validates the importance of the localization head. If we further remove the multi-scale feature module, the MAE score is increased from to . This sharp decline (i.e., ) in accuracy demonstrates that the multi-scale features significantly promote the performance in density map estimation. In addition, we notice that STANet (w/o ass) performs better than STANet in the Cloudy subset. We speculate that this phenomenon is caused by the inaccuracy of temporal information in cloudy scenarios. The similar cases are also observed in the Sparse subset.

Crowd localization.

We conduct non-maximal suppression to localize people’s heads in videos. Specifically, we find the local peaks or maximums based on a preset threshold on the predicted localization map in each frame. The crowd localization results of the two methods with top results (i.e., MCNN [45], CSRNet [16]) in density map estimation and our STANet variants are shown in Table 4, named as MCNN-L, CSRNet-L, STANet-L (w/o ms), STANet-L (w/o loc), STANet-L (w/o ass), and STANet-L, respectively.

As shown in Table 4, STANet achieves the best localization accuracy of L-mAP, surpassing the second best CSRNet [16] L-mAP. It indicates that our method is not only able to predict the distributions of objects in the scenes, but also generates relatively more accurate localizations of each object instance. Without the association head, the L-mAP score decreases , indicating that temporal coherence helps improve the localization accuracy slightly. If we remove both association and localization heads, the L-mAP score decreases . It demonstrates that the localization head enforces the network to focus on more discriminative features to localize the people’s heads. If we further remove the multi-scale feature design, the L-mAP score drops from to , which validates that multi-scale features play a critical role on the performance in crowd localization.

Crowd tracking.

Moreover, we also evaluate the crowd tracking results on our DroneCrowd dataset in Table 5. Specifically, we construct six crowd tracking methods, i.e. MCNN-T, CSRNet-T, STANet-T (w/o ms), STANet-T (w/o loc), STANet-T (w/o ass), and STANet-T, using the min-cost flow method [24] to associate the location points generated by the corresponding crowd localization methods.

As shown in Table 5, we notice that our STANet-T produces the best results with the top T-mAP score, i.e., , that is higher than the second best method CSRNet-T. STANet-T (w/o ass) produces comparable results with STANet-T ( vs. ). We speculate that the association head is effective to use temporal association information to recover the trajectories of people. The T-mAP score of STANet-T (w/o loc) decreases compared with STANet-T (w/o ass), and STANet-T (w/o ms) only achieves T-mAP score . These results indicate that association and localization heads and multi-scale representation are critical in crowd tracking. However, all the results are still far from satisfactory in real applications. It indicates that DroneCrowd is extremely challenging for crowd tracking and much effort is needed to develop more effective methods in real scenarios.

Methods T-mAP T-AP T-AP T-AP
STANet-T (w/o ms)
STANet-T (w/o loc)
STANet-T (w/o ass)
Table 5: Crowd tracking accuracy on DroneCrowd.

6 Conclusion

In this work, we propose the STANet method to jointly solve density map estimation, localization, and tracking in crowds of video clips captured by drones. To better evaluate the performances of density map estimation, localization, and tracking on drones, we collect and annotate a new dataset, DroneCrowd, which consists of video clips with high resolution frames and more than million head annotations. This is the largest dataset to date in terms of annotated trajectories of heads for density map estimation, crowd localization, and tracking on drones. Our model performs favorable against the state-of-the-art crowd counting methods on the three challenging datasets, demonstrating its effectiveness. We hope the dataset and the proposed method can facilitate the research and development in crowd counting, localization and tracking on drones.


  • [1] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In ECCV, pp. 757–773. Cited by: §2, Table 2.
  • [2] A. B. Chan, Z. J. Liang, and N. Vasconcelos (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In CVPR, Cited by: Table 1, §2.
  • [3] D. Deb and J. Ventura (2018) An aggregated multicolumn dilated convolution network for perspective-free counting. In CVPR Workshops, pp. 195–204. Cited by: §5.2, Table 3.
  • [4] Y. Fang, B. Zhan, W. Cai, S. Gao, and B. Hu (2019)

    Locality-constrained spatial transformer network for video crowd counting

    In ICME, pp. 814–819. Cited by: Table 1, §2.
  • [5] A. Hermans, L. Beyer, and B. Leibe (2017) In defense of the triplet loss for person re-identification. CoRR abs/1703.07737. Cited by: §4.
  • [6] M. Hsieh, Y. Lin, and W. H. Hsu (2017) Drone-based object counting by spatially regularized regional proposal network. In ICCV, Cited by: Table 1, §2.
  • [7] Y. Hu, H. Chang, F. Nian, Y. Wang, and T. Li (2016) Dense crowd counting from still images with convolutional neural networks. J. Visual Communication and Image Representation 38, pp. 530–539. Cited by: Table 1.
  • [8] S. Huang, X. Li, Z. Cheng, Z. Zhang, and A. G. Hauptmann (2018) Stacked pooling: improving crowd counting by boosting scale invariance. CoRR abs/1808.07456. Cited by: Table 3.
  • [9] H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In CVPR, pp. 2547–2554. Cited by: Table 1, §2.
  • [10] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Máadeed, N. M. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, pp. 544–559. Cited by: Table 1, §1, §1, §2, §2, §3.2, §3.2, Table 2, §5.1, §5.1.
  • [11] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. S. Doermann, and L. Shao (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR, pp. 6133–6142. Cited by: Table 2, §5.1.
  • [12] D. P. Kingma and J. Ba (2014) Adam: A method for stochastic optimization. CoRR abs/1412.6980. Cited by: §4.
  • [13] I. H. Laradji, N. Rostamzadeh, P. O. Pinheiro, D. Vázquez, and M. W. Schmidt (2018) Where are the blobs: counting by localization with point supervision. In ECCV, Cited by: §5.2, Table 3.
  • [14] B. Leibe, E. Seemann, and B. Schiele (2005) Pedestrian detection in crowded scenes. In CVPR, pp. 878–885. Cited by: §2.
  • [15] V. S. Lempitsky and A. Zisserman (2010) Learning to count objects in images. In NeurIPS, pp. 1324–1332. Cited by: §2.
  • [16] Y. Li, X. Zhang, and D. Chen (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, pp. 1091–1100. Cited by: §2, Table 2, §5.2, §5.2, §5.2, Table 3.
  • [17] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In CVPR, pp. 3225–3234. Cited by: Table 2, §5.1.
  • [18] W. Liu, M. Salzmann, and P. Fua (2018) Context-aware crowd counting. CoRR abs/1811.10452. Cited by: §2.
  • [19] C. C. Loy, S. Gong, and T. Xiang (2013) From semi-supervised to transfer counting of crowds. In ICCV, pp. 2256–2263. Cited by: Table 1, §2.
  • [20] Z. Ma, L. Yu, and A. B. Chan (2015) Small instance detection by integer programming on object density maps. In CVPR, pp. 3689–3697. Cited by: §2.
  • [21] N. H. Motlagh, M. Bagaa, and T. Taleb (2017) UAV-based iot platform: A crowd surveillance use case. IEEE Communications Magazine 55 (2), pp. 128–134. Cited by: §1.
  • [22] D. Oñoro-Rubio and R. J. López-Sastre (2016)

    Towards perspective-free object counting with deep learning

    In ECCV, pp. 615–629. Cited by: §2.
  • [23] E. Park, W. Liu, O. Russakovsky, J. Deng, F. Li, and A. Berg Large Scale Visual Recognition Challenge 2017. Note: http://image-net.org/challenges/LSVRC/2017 Cited by: §3.2.
  • [24] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes (2011) Globally-optimal greedy algorithms for tracking a variable number of objects. In CVPR, pp. 1201–1208. Cited by: §1, §4, §4, §5.2, §5.2.
  • [25] V. Ranjan, H. Le, and M. Hoai (2018) Iterative crowd counting. In ECCV, pp. 278–293. Cited by: Table 2.
  • [26] M. Rodriguez, I. Laptev, J. Sivic, and J. Audibert (2011) Density-aware person detection and tracking in crowds. In ICCV, pp. 2423–2430. Cited by: §2.
  • [27] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: §4.
  • [28] D. B. Sam, N. N. Sajjan, R. V. Babu, and M. Srinivasan (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing CNN. In CVPR, pp. 3618–3626. Cited by: Table 2.
  • [29] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In CVPR, pp. 4031–4039. Cited by: §2, Table 2, §5.1, Table 3.
  • [30] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang (2018) Crowd counting via adversarial cross-scale consistency pursuit. In CVPR, pp. 5245–5254. Cited by: Table 2, Table 3.
  • [31] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M. Cheng, and G. Zheng (2018) Crowd counting with deep negative correlation learning. In CVPR, pp. 5382–5390. Cited by: Table 2.
  • [32] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556. Cited by: §4.
  • [33] V. A. Sindagi and V. M. Patel (2017) CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In AVSS, pp. 1–6. Cited by: Table 2, §5.1, Table 3.
  • [34] V. A. Sindagi and V. M. Patel (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, pp. 1879–1888. Cited by: Table 2.
  • [35] C. Vondrick, D. J. Patterson, and D. Ramanan (2013) Efficiently scaling up crowdsourced video annotation - A set of best practices for high quality, economical video labeling. IJCV 101 (1), pp. 184–204. Cited by: §3.1.
  • [36] M. Wang and X. Wang (2011) Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR, pp. 3401–3408. Cited by: §2.
  • [37] Q. Wang, J. Gao, W. Lin, and Y. Yuan (2019) Learning from synthetic data for crowd counting in the wild. In CVPR, pp. 8198–8207. Cited by: Table 2, §5.1.
  • [38] S. Woo, J. Park, J. Lee, and I. S. Kweon (2018) CBAM: convolutional block attention module. In ECCV, pp. 3–19. Cited by: §4.
  • [39] B. Wu and R. Nevatia (2005) Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV, pp. 90–97. Cited by: §2.
  • [40] F. Xiong, X. Shi, and D. Yeung (2017) Spatiotemporal modeling for crowd counting in videos. In ICCV, pp. 5161–5169. Cited by: §2, footnote 2.
  • [41] L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang (2017) Multi-scale convolutional neural networks for crowd counting. In ICIP, pp. 465–469. Cited by: Table 3.
  • [42] C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In CVPR, pp. 833–841. Cited by: Table 1, §2, §3.2.
  • [43] L. Zhang, M. Shi, and Q. Chen (2018) Crowd counting via scale-adaptive convolutional neural network. In WACV, pp. 1113–1121. Cited by: Table 1, §1, Table 2.
  • [44] S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura (2017) FCN-rlstm: deep spatio-temporal neural networks for vehicle counting in city cameras. In ICCV, pp. 3687–3696. Cited by: §2, footnote 2.
  • [45] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, pp. 589–597. Cited by: Table 1, §1, §1, §2, §2, §3.2, Table 2, §4, §5.1, §5.1, §5.1, §5.2, §5.2, §5.2, Table 3.
  • [46] X. Zhou, D. Wang, and P. Krähenbühl (2019) Objects as points. CoRR abs/1904.07850. Cited by: §4.
  • [47] X. Zhu, H. Hu, S. Lin, and J. Dai (2019) Deformable convnets V2: more deformable, better results. In CVPR, pp. 9308–9316. Cited by: Figure 3.
  • [48] Z. Zou, X. Su, X. Qu, and P. Zhou (2018) DA-net: learning the fine-grained density distribution with deformation aggregation network. Access 6, pp. 60745–60756. Cited by: Table 3.