## 1 Introduction

Drones, or general unmanned aerial vehicles (UAVs), equipped with cameras have been fast deployed to a wide range of applications, such as video surveillance for crowd control [DBLP:conf/cvpr/ZhouWT12] and public safety [DBLP:journals/cm/MotlaghBT17]

. In recent years, many massive stampedes have taken place around the world that claimed many victims, making the automatic density map estimation, counting and tracking in crowds on drones important tasks, which draw great attention from the computer vision community.

Despite significant progress, crowd counting and tracking algorithms still have room for improvement to deal with drone-captured videos due to various challenges, such as view point and scale variations, background clutter, and small scales. Developing and evaluating these algorithms for drones are impeded by the lack of publicly available large-scale benchmarks. Some recent efforts [DBLP:conf/cvpr/ZhangLWY15, DBLP:conf/cvpr/ZhangZCGM16, DBLP:conf/eccv/IdreesTAZARS18, DBLP:conf/wacv/ZhangSC18, DBLP:conf/icmcs/FangZCGH19, DBLP:journals/corr/abs-2001-03360] have devoted to construct datasets for crowd counting. However, the majority of them focus on crowd counting with still images or inconsistent frames by surveillance cameras, due to difficulties in data collection and annotation for drone-based crowd counting and tracking.

To fill this gap, we collect a large-scale drone-based dataset for density map estimation, crowd localization and tracking. Our DroneCrowd dataset consists of video clips formed by total frames, captured by various drone-mounted cameras, in different scenarios across different cities in China (i.e., Tianjin, Guangzhou, Daqing, and Hong Kong). These video clips are annotated with more than million head annotations and several video-level attributes. To the best of our knowledge, this is the largest and most thoroughly annotated density map estimation, localization, and tracking dataset to date, see Table 1.

Dataset | Type | Trajectory | Resolution | Frames | Max count | Min count | Ave count | Total count | Year |
---|---|---|---|---|---|---|---|---|---|

UCF_CC_50 [DBLP:conf/cvpr/IdreesSSS13] | image | - | 2013 | ||||||

Shanghaitech A [DBLP:conf/cvpr/ZhangZCGM16] | image | - | 2016 | ||||||

Shanghaitech B [DBLP:conf/cvpr/ZhangZCGM16] | image | 2016 | |||||||

AHU-Crowd [DBLP:journals/jvcir/HuCNWL16] | image | 2016 | |||||||

CARPK [DBLP:conf/iccv/HsiehLH17] | image | 2017 | |||||||

Smart-City [DBLP:conf/wacv/ZhangSC18] | image | 2018 | |||||||

UCF-QNRF [DBLP:conf/eccv/IdreesTAZARS18] | image | - | 2018 | ||||||

NWPU [DBLP:journals/corr/abs-2001-03360] | image | 2020 | |||||||

UCSD [DBLP:conf/cvpr/ChanLV08] | video | 2008 | |||||||

Mall [DBLP:conf/iccv/LoyGX13] | video | 2013 | |||||||

WorldExpo [DBLP:conf/cvpr/ZhangLWY15] | video | 2015 | |||||||

FDST [DBLP:conf/icmcs/FangZCGH19] | video | 2019 | |||||||

DroneCrowd | video | 2021 |

To handle this challenging dataset, we design a Space-Time Neighbor-Aware Network (STNNet) as a strong baseline, which solves the density map estimation, localization, and tracking simultaneously. Specifically, the proposed STNNet is formed by four modules, i.e., the feature extraction subnetwork, followed by the density map estimation heads, the localization, and the association subnets. The feature extraction subnetwork first uses two-branch CNNs to extract multi-scale features, and then computes the correlations between the extracted features in consecutive two frames to exploit the temporal relations. Using density map estimation heads, we can estimate the density of objects in video frames to perform crowd counting. Inspired by object detection [DBLP:conf/nips/RenHGS15, DBLP:journals/pami/RenHG017, DBLP:conf/cvpr/ZhangWBLL18], we introduce the localization subnet, formed by the classification and regression branches, to output accurate locations of targets in each individual frames. To exploit the temporal consistency, the association subnet is designed to predict motion offests of targets in consecutive frames for tracking. Besides, we develop the neighboring context loss by integrating spatial-temporal context of neighboring targets to guide the training of association subnet. Specifically, the neighboring context loss penalizes large displacements of the relative positions of adjacent objects in temporal domain, and guides the association subnet to generate accurate motion offsets. The whole network is trained in an end-to-end manner with the multi-task loss and Adam optimizer [DBLP:journals/corr/KingmaB14]. After that, multi-object tracking methods [DBLP:conf/cvpr/PirsiavashRF11, DBLP:conf/cvpr/AlahiGRRLS16] are used to predict long trajectories of targets. Compared with state-of-the-art algorithms, extensive experiments on our DroneCrowd dataset demonstrate the effectiveness of the proposed STNNet method for density map estimation, crowd localization and tracking tasks.

Contributions. (1) We collect a large-scale drone captured dataset for density map estimation, localization, and tracking in dense crowd, which significantly surpasses existing datasets in terms of data type and volume, annotation quality, and difficulty. (2) We propose a space-time neighbor-aware network to solve the density map estimation, localization and tracking tasks simultaneously. (3) To exploit the spatial-temporal context, we design the neighboring context loss to penalize large displacements of the relative positions of adjacent objects in temporal domain for network training.

## 2 Related Work

Existing datasets. To date, there only exists a handful of crowd counting, crowd localization, or crowd tracking datasets. UCF_CC_50 [DBLP:conf/cvpr/IdreesSSS13] is formed by images containing annotated humans, with the head counts ranging from to . Shanghaitech [DBLP:conf/cvpr/ZhangZCGM16] includes images with a total number of labeled people. Recently, UCF-QNRF [DBLP:conf/eccv/IdreesTAZARS18] is released with images and million annotated people’s heads in various scenarios. Hsieh et al. [DBLP:conf/iccv/HsiehLH17] present a drone-based car counting dataset, which approximately contains cars captured in different parking lots. Recently, Wang et al. [DBLP:journals/corr/abs-2001-03360] collect a large-scale congested crowd counting and localization dataset, which includes more than images and million annotated heads with points and boxes. However, these datasets are still limited in sizes and scenarios covered.

To evaluate counting algorithms in videos, Chan et al. [DBLP:conf/cvpr/ChanLV08] present the UCSD counting dataset including low density crowd and counting difficulty. Similar to the UCSD dataset, Mall [DBLP:conf/iccv/LoyGX13] is collected by the surveillance camera in a single location. Zhang et al. [DBLP:conf/cvpr/ZhangLWY15] present the WorldExpo dataset with annotated frames in total, which is captured in different scenes during 2010 Shanghai WorldExpo. Fang et al. [DBLP:conf/icmcs/FangZCGH19] collect a video dataset with frames and annotated heads captured from different scenes. In contrast to the aforementioned datasets, our DroneCrowd dataset is a large-scale drone-captured dataset for density map estimation, crowd localization and tracking, which consists of sequences with more than million head annotations on people trajectories.

Crowd counting and density map estimation. Modern crowd counting methods [DBLP:conf/nips/LempitskyZ10, DBLP:conf/cvpr/ZhangZCGM16, DBLP:conf/cvpr/SamSB17, DBLP:conf/eccv/CaoWZS18, DBLP:conf/cvpr/LiZC18, DBLP:conf/cvpr/LiuSF19, DBLP:conf/aaai/LuoYLNJZC20, DBLP:conf/nips/WangLSH20] formulate crowding counting as density map estimation. Lempitsky and Zisserman [DBLP:conf/nips/LempitskyZ10] learn to infer the density estimation by a minimization of a regularized risk quadratic cost function. Zhang et al. [DBLP:conf/cvpr/ZhangZCGM16] use the multi-column CNN network to estimate the crowd density map, which learns the features for different head sizes by each column CNN. Sam et al. [DBLP:conf/cvpr/SamSB17] develop the switching CNN model to handle the variations of crowd density. Cao et al. [DBLP:conf/eccv/CaoWZS18] propose an encoder-decoder network, where the encoder extracts multi-scale features with scale aggregation and the decoder generates high-resolution density maps using transposed convolutions. Li et al. [DBLP:conf/cvpr/LiZC18] employ dilated convolution layers to enlarge receptive fields and extract deeper features without losing resolutions. Liu et al. [DBLP:conf/cvpr/LiuSF19] adaptively encodes the scale of the contextual information for accurate crowd density prediction. In [DBLP:conf/iros/LiuLSF19], the physically-inspired temporal consistency constraints are considered in the network to handle the viewpoint changes by drones. Besides, Luo et al. [DBLP:conf/aaai/LuoYLNJZC20]

propose the hybrid graph neural network to capture dependencies among multi-scale counting and localization features. To avoid hurting the generalization bound of a model, Wang

et al. [DBLP:conf/nips/WangLSH20] propose the optimal transport to measure the similarity between the normalized predicted and ground-truth density maps.In terms of crowd counting in videos, spatio-temporal information is critical to improve the counting accuracy. Xiong et al. [DBLP:conf/iccv/XiongSY17] design a convolutional LSTM model to fully capture both spatial and temporal dependencies for crowd counting. Zhang et al. [DBLP:conf/iccv/ZhangWCM17]

combine fully convolutional neural networks and LSTM by residual learning to perform vehicle counting. Liu

et al. [DBLP:conf/eccv/LiuSF20] first compute people flows between consecutive frames and then estimate the densities from these flows. Different from existing methods, our STNNet can output both crowd density and target locations in crowds using the proposed localization subnet.Crowd localization and tracking. Besides crowd counting, crowd localization and tracking are also important tasks in safety control scenarios. Rodriguez et al. [DBLP:conf/iccv/RodriguezLSA11] formulate an energy minimization framework by jointly optimizing the density and location, with the temporal-spatial constraints of person tracks in video. Ma et al. [DBLP:conf/cvpr/MaYC15] first obtain local counts from sliding windows over the density map and then use integer programming to recover the locations of individual objects. In [DBLP:conf/eccv/IdreesTAZARS18], crowd counting and localization tasks are simultaneously solved with a CNN model trained by a composition loss. In contrast, our method captures context information among neighbouring targets and estimate motion offsets of targets between consecutive frames, trained by the proposed neighboring context loss.

Attribute | Min count | Max count | Avg count | Frames |
---|---|---|---|---|

Small | ||||

Large | ||||

Cloudy | ||||

Sunny | ||||

Night | ||||

Crowded | ||||

Sparse |

## 3 DroneCrowd Dataset

### 3.1 Data Collection and Annotation

Our DroneCrowd dataset is captured by drone-mounted cameras (i.e., DJI Phantom 4, Phantom 4 Pro and Mavic), covering a wide range of scenarios, e.g., campus, street, park, parking lot, playground and plaza^{1}^{1}1We strictly comply with local laws and regulations in China when using unmanned aircraft/drones, and avoid restricted areas to capture videos. Since the scales of objects are extremely small, no identity information such as faces and vehicle plates could be retrieved. After careful check, we confirm that all data in our dataset would not leak any personal information.. The videos are recorded at frames per seconds (FPS) with a resolution of pixels. As presented in Figure 2 (a) and (b), the maximal and minimal numbers of people in each video frame are and respectively, and the average number of objects is . Moreover, more than thousands of head trajectories of people are annotated with more than million head points in individual frames of video clips. Over domain experts annotate and double-check the dataset using the vatic software [DBLP:journals/ijcv/VondrickPR13] for more than two months. Figure 1 shows some frames of video clips with annotated trajectories of people heads.

We divide DroneCrowd into the training and testing sets, with and sequences, respectively. Notably, training videos are taken at different locations from testing videos to reduce the chances of algorithms to overfit to particular scenes. It contains video sequences with large variations in scale, viewpoint, and background clutters. To analyze the performance of algorithms thoroughly, we define three video-level attributes of the dataset, described as follows. (1) Illumination: under different illumination conditions, the objects are assumed to be different in appearance. Three categories of illumination conditions are considered in our dataset, including Cloudy, Sunny, and Night. (2) Scale indicates the size of objects. Two categories of scales are defined, including Large (the diameter of objects pixels) and Small (the diameter of objects pixels). (3) Density indicates the number of objects in each frame. Based on the average number of objects in each frame, we divide the dataset into two density levels, i.e., Crowded (with the number of objects in each frame larger than ), and Sparse (with the number of objects in each frame less than ). The statistics on different attributes are shown in Figure 2 (c) and Table 2.

### 3.2 Evaluation Metrics and Protocols

Density map estimation. Following the previous works [DBLP:conf/cvpr/ZhangLWY15, DBLP:conf/cvpr/ZhangZCGM16, DBLP:conf/eccv/IdreesTAZARS18], the density map estimation task aims to compute per-pixel density at each location in the image, while preserving spatial information about distribution of people. We use the mean absolute error (MAE) and mean squared error (MSE) to evaluate the performance, i.e., , and , where is the number of video clips, is the number of frames in the -th video. and are the ground-truth and estimated number of people in the -th frame of the -th video clip, respectively. As stated in [DBLP:conf/cvpr/ZhangZCGM16], MAE and MSE describe the accuracy and robustness of the estimation respectively.

Crowd localization. The goal of crowd localization is to detect the locations of all people in an image. Each evaluated crowd localization algorithm is required to output a series of detected points with confidence scores for each test image. The estimated locations determined by the confidence threshold are associated to the ground-truth locations using greedy method. Then, we compute the L-mAP at various distance thresholds ( pixels) to evaluate the localization results. We also report the performance with three specific distance thresholds, i.e., L-AP, L-AP, and L-AP pixels. These criteria penalize missing detection of people as well as duplicate detections.

Crowd tracking. Crowd tracking requires an algorithm to recover the trajectories of people in video sequence, which is evaluated on the metric in [isvrc-2017]. Specifically, each tracker is required to output a series of head points with confidence scores and the corresponding identities. We sort the tracklets, formed by the locations with the same identity, based on the average confidence of their detections. A tracklet is considered to be correct if the matched ratio between the predictions and ground-truth tracklets is larger than a threshold. We use thresholds in evaluation, i.e., , , and . The matching distance threshold between the predicted and ground-truth locations on the tracklets is set to pixels. The T-mAP scores over different thresholds (i.e., T-AP, T-AP, and T-AP) are used to measure the performance.

## 4 Our Method

Our STNNet sequentially takes a pair of frames as input, and outputs the density maps, the locations, and the motion offsets of objects in these two frames, see Figure 3. After that, the association method is used to generate long trajectories of objects in videos.

### 4.1 Network Architecture

As shown in Figure 3, the Siamese feature extraction subnetwork in our STNNet is constructed on the first groups of convolution layers in the parameters shared two-branch VGG-16 network [DBLP:journals/corr/SimonyanZ14a] to extract multi-scale features. Inspired by [DBLP:conf/miccai/RonnebergerFB15], the U-Net style architecture is used to fuse multi-scale features for prediction. Using density map estimation heads, we can determine the number of targets based on multi-scale features. Meanwhile, the correlation operation [DBLP:conf/cvpr/IlgMSKDB17] is conducted on the extracted features to exploit the temporal coherence at different stage. In addition, the localization and association subnets are introduced to predict the locations of target points and the corresponding motion offsets, which are described as follows.

Localization subnet.

The localization subnet consists of the classification and regression branches. To generate accurate locations of objects, we tile the object proposal in each pixel. The classification branch aims to predict the probability of each proposal to be an object, and the regression branch aims to generate the accurate locations of the positive proposals. As shown in Figure

4, we fuse multi-scale feature maps (i.e., , , ) with both channel and spatial attention [DBLP:conf/eccv/WooPLK18] for each branch. After that, we resize multi-scale feature maps and then output the fused classification and regression maps. The classification map denotes the probability that each proposal contains an object and the regression map contains the regressed offsets of the positive proposals. Finally, we perform non-maximal suppression to predict the accurate object locations.Association subnet. As mentioned above, we introduce the association subnet to predict the motion offsets of each object to complete the tracking task. As shown in Figure 5(a), given the top scored post-processed object proposals generated by the localization subnet in the -th frame and the fused multi-scale correlation features, we use stacked PointConv [DBLP:conf/cvpr/WuQL19]

and multi-layer perceptron (MLP) operations to construct the association subnet to generate the motion offsets in a circle,

i.e., from the -th frame to -th frame and vice versa. Note that, only the nearest points are considered in each PointConv operation.### 4.2 Multi-Task Loss Function

We use the multi-task loss to guide the training of our STNNet method, which consists of three terms, including the neighboring context loss , the localization loss , and the density loss , i.e.,

(1) |

where is the index of the batch, is the batch size. and are the predicted and ground-truth density maps. and are the predicted and ground-truth labels (i.e., objects or background) of the object proposals, and are the predicted and ground-truth offsets of the object proposals. and are the predicted and ground-truth locations of the objects, and is the prediction motion offsets of the objects. In the following sections, we would like to discuss each loss term in details.

Density loss. Inspired by [DBLP:conf/cvpr/ZhangZCGM16], we use the pixel-wise Euclidean loss for the density loss. The geometry-adaptive Gaussian kernel method is used to generate the ground-truth density map . The density loss term is computed as

(2) |

where and are the values of the predicted and ground-truth density maps at of layer at time of the -th batch, and is the parameter to balance the influence of each layer.

Localization loss. Motivated by object detection [DBLP:conf/nips/RenHGS15, DBLP:journals/pami/RenHG017, DBLP:conf/cvpr/ZhangWBLL18], the localization loss is formed by the classification loss and regression loss. We tile the point proposals on each pixel and match the proposal to the ground-truth points. If the proposal locates in the neighboring regions of the ground-truth points, we assign it to be the positive proposal (i.e., for the proposal at in the -th layer in the -th batch); otherwise the background (i.e., ). Thus, the localization loss is computed as

(3) |

where and are the predicted and ground-truth labels at of layer . and are the predicted and ground-truth offsets at of layer . We use the log loss to compute , and the squared loss to compute . Notably, the regression loss is only activated for the positive proposals.

Neighboring context loss. In crowded scenes, the objects are generally clustered in a small region and usually share similar motion patterns in consecutive frames. To exploit the motion consistency of neighboring objects, we design a neighboring context loss, which is formed by two parts, i.e., the temporal prediction constraint, and the relation constraint, see Figure 5(b).

Specifically, the temporal prediction constraint enforces the proposals in the consecutive frames projected by the predicted motion offsets to approach the ground-truth points. Let be the location of the -th proposal at time , be the object location in the neighboring region of the proposal at , and be the predicted offset corresponding to the proposal at from time to . Thus, the temporal prediction constraint aims to minimize the -norm of the differences, i.e.,

. Meanwhile, the relation constraint enforces the relation vectors between the target and neighboring objects to approach to the relation vectors of their corresponding associated ground-truth points. Let

be the relation vector^{2}

^{2}2The relation vector is computed as . between the target and neighboring objects projected to the second frame, and be the relation vector between the ground-truth points at and . Thus, the relation constraint aims to minimize . The cycle strategy is used to compute the neighboring context loss, i.e.,

(4) |

where and are the projected targets.

### 4.3 Optimization

To increase diversity in training data, we randomly flip and crop the training images. Due to limited computation resources, we equally divide each frame into patches, and use the divided patches with the resolution of for training. For the Pointconv layer, we use nearest points to capture the context information. In (2), the pre-set weights are set to . The matching threshold between the proposals and ground-truth points is set to pixels. Meanwhile, the threshold used to determine the neighboring regions of pixels is set to pixels. The total number of proposal objects is set to . In addition, we set the batch size in the training phase.

Two-stage training. We use the two stage strategy to train our network. For the first stage, we remove the association subnet and train the network to generate accurate density map and object proposals. After that, we fixed the parameters in the density map estimation heads, and add the association subnet to fine-tune the whole network. We use the Adam optimization algorithm [DBLP:journals/corr/KingmaB14] with the learning rate in both stages.

## 5 Experiment

As discussed above, we conduct the experiment on our DroneCrowd for crowd counting, localization and tracking. We report the density map estimation results and speeds of STNNet and existing methods. Meanwhile, the ablation study is conducted to verify the effectiveness of important components in our method. Besides, some visual results are shown in Figure 6.

Method | Speed | Overall | Large | Small | Cloudy | Sunny | Night | Crowded | Sparse | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

FPS | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | |

MCNN [DBLP:conf/cvpr/ZhangZCGM16] | |||||||||||||||||

C-MTL [DBLP:conf/avss/SindagiP17] | |||||||||||||||||

MSCNN [DBLP:conf/icip/ZengXCQZ17] | |||||||||||||||||

LCFCN [DBLP:conf/eccv/LaradjiRPVS18] | |||||||||||||||||

SwitchCNN [DBLP:conf/cvpr/SamSB17] | |||||||||||||||||

ACSCP [DBLP:conf/cvpr/ShenXNWHY18] | |||||||||||||||||

AMDCN [DBLP:conf/cvpr/DebV18] | |||||||||||||||||

StackPooling [DBLP:journals/corr/abs-1808-07456] | |||||||||||||||||

DA-Net [DBLP:journals/access/ZouSQZ18] | |||||||||||||||||

CSRNet [DBLP:conf/cvpr/LiZC18] | |||||||||||||||||

CAN [DBLP:conf/cvpr/LiuSF19] | |||||||||||||||||

DM-Count [DBLP:conf/nips/WangLSH20] | |||||||||||||||||

STNNet (w/o loc) | |||||||||||||||||

STNNet |

Methods | L-mAP | L-AP | L-AP | L-AP |
---|---|---|---|---|

MCNN [DBLP:conf/cvpr/ZhangZCGM16] | ||||

CAN [DBLP:conf/cvpr/LiuSF19] | ||||

CSRNet [DBLP:conf/cvpr/LiZC18] | ||||

DM-Count [DBLP:conf/nips/WangLSH20] | ||||

STNNet (w/o loc) | ||||

STNNet (w/o ass) | ||||

STNNet (w/o rel) | ||||

STNNet (w/o cyc) | ||||

STNNet |

Methods | T-mAP | T-AP | T-AP | T-AP |
---|---|---|---|---|

MCNN [DBLP:conf/cvpr/ZhangZCGM16] | ||||

CAN [DBLP:conf/cvpr/LiuSF19] | ||||

CSRNet [DBLP:conf/cvpr/LiZC18] | ||||

DM-Count [DBLP:conf/nips/WangLSH20] | ||||

STNNet (w/o loc) | ||||

STNNet (w/o ass) | ||||

STNNet (w/o rel) | ||||

STNNet (w/o cyc) | ||||

STNNet |

Density map estimation. As shown in Table 3, our STNNet performs favorably against the state-of-the-art methods, with an improvement of MAE and MSE in comparison to the second best DM-Count [DBLP:conf/nips/WangLSH20] in the overall testing set. It indicates that our method generates more accurate and robust density maps in different scenarios. To further analyze the results, we report the performance on several subsets based on the video-level attributes (see Section 3.1). LCFCN [DBLP:conf/eccv/LaradjiRPVS18] and AMDCN [DBLP:conf/cvpr/DebV18] perform not well in the Crowd subset, producing the two worst MAE and MSE scores. This is maybe because LCFCN [DBLP:conf/eccv/LaradjiRPVS18]

uses a loss function to encourage the network to output a segmentation blob for each object in crowd counting. However, in drone-captured scenarios, each object may contain only few pixels, making it difficult to separate objects accurately. AMDCN

[DBLP:conf/cvpr/DebV18] uses multiple columns of large dilation convolution operations, which inevitably integrates considerable background noise, affecting the accuracy in density map estimation. In contrast, MCNN [DBLP:conf/cvpr/ZhangZCGM16] uses multi-column CNNs to learn the features adaptive to variations in object size due to perspective effect or image resolution, resulting in better performance. CAN [DBLP:conf/cvpr/LiuSF19] achieves the best performance in both Cloudy and Crowded subsets by exploiting multi-scale contextual information in density maps. DM-Count [DBLP:conf/nips/WangLSH20] obtains the best MAE and MSE scores in the Sunny subset without imposing Gaussians to annotations. Our STNNet achieves the best result in other four subsets, which demonstrates the effectiveness and importance of exploiting multi-scale features in density map estimation.Furthermore, to study the effectiveness of the localization subnet in STNNet for density map estimation, we construct a variant of STNNet, i.e., STNNet (w/o loc), which removes the localization subnet from STNNet. As shown in Table 3, our STNNet achieves better results than STNNet (w/o loc) by decreasing MAE score and MSE score, which validates the importance of the localization subnet.

Crowd localization. As presented in Table 5, we compare the localization results of methods with top density estimation results (i.e., MCNN [DBLP:conf/cvpr/ZhangZCGM16], CSRNet [DBLP:conf/cvpr/LiZC18], CAN [DBLP:conf/cvpr/LiuSF19], and DM-Count [DBLP:conf/nips/WangLSH20]) and our STNNet variants, i.e., STNNet (w/o loc), STNNet (w/o ass) and STNNet (w/o cyc). STNNet (w/o loc) denotes the method that removes both the association and localization subnets from STNNet, STNNet (w/o ass) denotes the method that removes the association subnet from STNNet, and STNNet (w/o cyc) denotes the method that only considers the forward motion offsets in neighboring context loss computation. Meanwhile, for the density map estimation based methods such as MCNN, CSRNet, CAN, DM-Count, and STNNet (w/o loc), similar to [DBLP:conf/eccv/IdreesTAZARS18], we post-process the predicted density maps to find local peaks using a preset threshold.

As shown in Table 5, we find that STNNet achieves the best accuracy with L-mAP and surpasses the second best DM-Count [DBLP:conf/nips/WangLSH20] L-mAP. It indicates that our method can generate more accurate localizations of each target. Compared to STNNet (w/o cyc), STNNet improves the localization accuracy by , which shows the effectiveness of cycle strategy in the neighboring context loss for the localization task. Without the association subnet, the L-mAP score decreases ( of STNNet vs. ), indicating that temporal coherence facilitates improve the localization accuracy. If we remove both association and localization subnets, the L-mAP score decreases more than . It demonstrates that the localization subnet enforces the network to focus on more discriminative features to localize people’s heads.

Crowd tracking. For object tracking, two association methods, i.e., the min-cost flow method [DBLP:conf/cvpr/PirsiavashRF11] and the social-LSTM method [DBLP:conf/cvpr/AlahiGRRLS16], are used to generate long trajectories of objects. To validate the effectiveness of STNNet for crowd tracking, we compare it to several methods including MCNN, CSRNet, CAN, DM-Count, STNNet (w/o loc), STNNet (w/o ass), STNNet (w/o cyc) and STNNet. It is worth mentioning that STNNet (w/o loc) performs crowd tracking based on the localized points from density maps, similar to MCNN, CSRNet, CAN, and DM-Count. Without predicting motion offsets, STNNet (w/o ass) directly associates the targets from the localization results. STNNet (w/o cyc) and STNNet first connect short tracklets in two consecutive frames based on the predicted offsets, and then generate long trajectories using the same data association methods [DBLP:conf/cvpr/PirsiavashRF11, DBLP:conf/cvpr/AlahiGRRLS16].

From Table 5, we notice that STNNet achieves T-mAP score, which is higher than the second best DM-Count. Meanwhile, STNNet (w/o cyc) produces higher T-mAP score than our method. STNNet (w/o ass) produces inferior results than STNNet, i.e., vs. . The T-mAP score of STNNet (w/o loc) decreases compared to STNNet (w/o ass). These results indicate that association and localization subnets are critical in crowd tracking. However, these results are still far from satisfactory. Besides, we find that the method using social-LSTM [DBLP:conf/cvpr/AlahiGRRLS16] performs comparably with that using min-cost flow [DBLP:conf/cvpr/PirsiavashRF11]. It indicates that it is possible to predict the motion patterns of objects based on the observed trajectories. In summary, our DroneCrowd dataset is extremely challenging for crowd tracking and much effort is needed to develop more effective methods in real scenarios.

Effectiveness of neighboring context loss. To further demonstrate the effectiveness of the relation constraint in the neighboring context loss, we construct a variant STNNet (w/o rel) by removing the relation constraint in STNNet (w/o cyc). As shown in Table 5 and 5, STNNet (w/o rel) produces and L-mAP and T-mAP scores, respectively. STNNet (w/o cyc) improves and L-mAP and T-mAP scores compared with STNNet (w/o rel).

## 6 Conclusion

In this work, we propose the STNNet method to jointly solve density map estimation, localization, and tracking in drone-captured crowded scenes. Notably, we design the neighboring context loss to capture relations among neighboring targets in consecutive frames, which is effective for localization and tracking. To better evaluate the performances on drones, we collect and annotate a new dataset, DroneCrowd. To the best of our knowledge, it is the largest dataset to date in terms of annotated trajectories of heads for density map estimation, crowd localization, and tracking on drones. We hope the dataset and the proposed method can facilitate the research and development in crowd localization, tracking and counting on drones.