Crowd counting and crowd density estimation play essential roles in safety monitoring and behavior analysis especially in the case of mass events. They can lead to early detection of congestion or security-related abnormalities informing and helping organizers and decision makers to avoid crowd disasters [Idrees2018]. Closed-Circuit Television (CCTV) surveillance cameras have been conventionally used for crowd monitoring and they have become ubiquitous in recent years providing large number of images with various perspectives, scales, and illumination conditions. However, for mass events spread over wide open areas with thousands of people attending, monitoring the crowd from above using aerial imagery (e.g., using airborne platforms) was shown to be advantageous due to the wider field of view and smaller occlusion effects as compared to CCTV images [Cui2017]. Nevertheless, in spite of the increasing volume of available aerial images due to the advances in airborne and UAV platforms, crowd counting and density estimation datasets and methods for aerial imagery are still lacking in the domain. Therefore, as one of the two main contributions of this work, we introduce a novel crowd dataset, the DLR Aerial Crowd Dataset (DLR-ACD), which is composed of 33 large aerial images (the average image size is 36195226 pixels) acquired by standard DSLR cameras installed on an airborne platform on a helicopter. The images come from 16 flight campaigns, i.e. different mass events, and the dataset contains 226,291 person annotations. Figure 1 shows example images from DLR-ACD. To the best of our knowledge, DLR-ACD is the first aerial image crowd dataset and, with it, we hope to promote research on aerial crowd analysis. The dataset will be released at https://www.dlr.de/eoc/en/desktopdefault.aspx/tabid-12760/22294_read-58354/.
Despite the many benefits, crowd counting and density estimation is still a complex task in practice. For example, detecting and counting people in low resolution surveillance images, in which each person may only cover a few pixels and occlusion is frequent, is very difficult even for human experts [Kang2019, Liu2018]. Therefore, developing automatic methods for precise crowd counting and density estimation is of high interest. In recent years, methods based on Convolutional Neural Networks (CNNs) have achieved promising results [Boominathan2016, Zhang2016, Sam2017, Sam2018, Sindagi2018, Zeng2017, Cao2018, Wang2018, Idrees2018, Huang2017, Liu2018b, Shi2018, Ranjan2018]. In contrast to using preset features, CNN-based methods learn features during the training which allows them to better cope with images with arbitrary perspectives, scales, and crowd densities [Sindagi2017].
Considering the advantages of the existing crowd counting and density estimation methods and the new challenges presented by our aerial dataset such as extremely small objects and complex backgrounds, in this work, we propose Multi-Resolution Crowd Network (MRCNet) which relies on a pre-trained VGG-16 model as an encoder and a combination of bilinear upsampling and convolution layers as a decoder. In addition, in order to preserve as much high-resolution signal as possible while extracting multi-scale features, the encoder and decoder are connected through lateral connections with element-wise addition similar to the Feature Pyramid Network (FPN) architecture [Lin2017]
. While high-level features provide contextual information, the low-level features extract detailed information. In order to collect information from people with various sizes and in different crowd density conditions, MRCNet propagates information in all levels from the bottom to the top. Taking advantage of this structure, MRCNet shows high robustness and transferability by achieving superior crowd counting and density map estimation results on both aerial and CCTV-based datasets. It outperforms the state-of-the-art methods on the ShanghaiTech dataset[Zhang2016], a challenging CCTV-based benchmark, by achieving the smallest Mean Absolute count Error (MAE). In addition, on our proposed DLR-ACD dataset, MRCNet achieves the smallest counting MAE and the highest F1-score in person detection.
In summary, our main contributions are:
We introduce a novel aerial crowd dataset, the so-called DLR-ACD.
We propose MRCNet which is able to deal with the existing challenges in both aerial and CCTV-based crowd images.
MRCNet achieves state-of-the-art results on the DLR-ACD and ShanghaiTech datasets.
2 Related Works
There is a vast literature on crowd counting and crowd density estimation in computer vision[Sindagi2018]. In order to deal with scale variation, a number of previous works proposed employing networks with multi-column architectures to obtain receptive fields with various sizes and extract features at different scales. A three-column architecture developed by Zhang et al [Zhang2016], the so-called MCNN, and a combination of shallow and deep networks proposed by Boominathan et al [Boominathan2016] are examples of multi-column CNNs. As an approach toward improving the network performance in the presence of significant crowd density variations, Sam et al [Sam2017] proposed a switching multi-column CNN in which each column is trained separately on image patches of specific crowd densities. Ranjan et al [Ranjan2018] proposed ic-CNN, a two-branch network with a feed-forward structure, which incorporates low-resolution prediction maps for generating high-resolution crowd density maps.
Despite the improvement achieved by the multi-column approaches, their scale-invariance highly depends on the number of columns and their receptive field sizes [Sindagi2018]. Furthermore, their depths are usually limited as they present heavy computational overhead. Therefore, single-column architectures have been favoured in most of the recent crowd counting networks. Zeng et al [Zeng2017] proposed a single column CNN, the so-called MSCNN, composed of scale aggregation blocks to tackle scale variations. Later Cao et al [Cao2018] used more sophisticated scale aggregation blocks in a deeper CNN together with a composition loss (Euclidean and local pattern consistency loss), the so-called SANet, and improved the count accuracy significantly. Wang et al [Wang2018] also relied on scale aggregation blocks in designing scale aware residual modules for SCNet, a single column crowd counting CNN. In order to tackle the variations in crowd density and appearance, Sam et al [Sam2018] proposed a growing CNN, the so-called IG-CNN, in which a base CNN is recursively split into two child-CNNs each becoming an expert on certain crowd types during training.
The advantages of multi-task CNNs have inspired Sindagi et al [Sindagi2017]
to develop a cascaded multi-task CNN, the so-called CMTL, which simultaneously classifies crowds into different density levels and estimates density maps. Idreeset al [Idrees2018] developed a CNN that solves crowd counting, density estimation, and localization simultaneously. The network consists of a series of DenseNet [Huang2017] blocks with a composition of multiple intermediate losses, each optimizing the network for a ground-truth crowd density map smoothed with a different Gaussian kernel, and a final count loss.
A number of previous works took advantage of pre-trained networks as back-bones for crowd counting networks. In CSRNet, Li et al [Li2018] coupled VGG-16 with a dilated CNN as the back-end to obtain larger receptive fields and thus, better count accuracy. Later Liu et al [Liu2018b] combined the high and low level features by employing the VGG-16 network and FPN. This allows preserving and propagating the fine-grained details of small targets and also incorporating a large degree of context information. Dealing with the over-fitting problem of crowd counting networks, Shi et al [Shi2018] proposed a learning strategy based on deep negative correlation learning applied to a modified VGG-16 network [Simonyan2015] which results in generalizable features through learning a pool of regressors to estimate the crowd density.
Our proposed MRCNet is based on a single-column encoder-decoder architecture similar to SegNet [Badrinarayanan2017], U-Net [Ronneberger2015], and Li et al [Liu2018b]
. It uses a pre-trained VGG-16 as encoder. MRCNet is different from SegNet and U-Net in the decoder structure and the lateral connections. While the lateral connections of SegNet and U-Net are based on max-pooling indices and concatenations of earlier convolutional layers, respectively, MRCNet takes advantage of FPN-based lateral connections; however, it is different from[Liu2018b] in the number and wiring of the lateral connections and the decoder structure (e.g., the technique and the level of upsampling). It is different from [Li2018] in the number of the used convolutional blocks and the decoder structure. MRCNet considers crowd counting and density map estimation as two interrelated tasks and performs a multi-resolution prediction. It is different from the multi-task networks [Idrees2018, Sindagi2017] in the network structure and the task formulations.
3 DLR’s Aerial Crowd Dataset
DLR’s Aerial Crowd Dataset (DLR-ACD) is a collection of 33 large RGB aerial images with average size of 36195226 pixels acquired through 16 different flight campaigns performed between 2011 and 2017. The aerial images were captured at various mass events and over urban scenes involving crowds, such as sport events, city centers, open-air fairs and festivals. The images were recorded using a camera system composed of three standard DSLR cameras (a nadir-looking and two side-looking cameras) mounted on an airborne platform installed on a helicopter flying at an altitude between 500 m to 1600 m. The different flight altitudes resulted in a range of spatial resolutions (or ground sampling distances – GSD) from 4.5 cm/pixel to 15 cm/pixel and we also consider different viewing angles. Furthermore, the images were selected so that they represent different crowd densities and crowd behavior from the sparse moving crowds in city centers to the very dense (mostly) stationary ones at concerts. The dataset was labeled manually with point-annotations on individual people taking about 80 hours, and resulted in 226,291 person annotations, ranging from 285 to 24,368 annotations per image. Crowd annotation in aerial images is a challenging task due to the large image sizes as well as the large number and the small size of the people in the images. While in dense crowd areas, discriminating each person from adjacent people is difficult, in sparse crowd areas localizing and discriminating each person from similar-looking objects is also challenging and time consuming.
Table 1 shows detailed information about the images and the annotations. Our images come from four types of events: sports events, fairs (e.g. trade fairs, Oktorberfest, etc.), and (music) festivals. To ensure that all scenes are covered in our train/test splits and that images from the same campaign are either in the training or in the test set, the dataset was manually split into 19 training and 14 test images, and the splits were not randomized. The counts in the training and test sets are 138,151 and 88,140 persons.
|Images||Scenes||Campaigns||GSD (cm/pixels)||Size (pixels)||Person Count||Train/Test|
Table 2 and Figure 2 show the statistics of existing crowd datasets as well as our dataset. Among them, DLR-ACD is the first dataset that provides aerial views of crowds and therefore presents different challenges than existing datasets. Its images are much larger in size which might lead to higher computational costs and memory requirements. For example, when converted to match the image size of the widely-used ShanghaiTech-A dataset, DLR-ACD is larger. In addition, as Figure 2 shows, most of the images in DLR-ACD contain a large number of people () which is very different from the other crowd datasets. Furthermore, crowd densities vary significantly within and between images due to their large fields of view.
|Dataset||Data||Average||Number of||Total Person||Average||Maximum||Minimum|
|Collection||Image Size (px)||Images||Count||Count||Count||Count|
|UCSD [Chan2008]||CCTV cameras||158238||2,000||49,885||25||46||11|
|UCF_CC_50 [Idrees2013]||Web search||21012888||50||63,974||1,279||4,633||94|
|WorldExpo’10 [Zhang2015]||CCTV cameras||576720||3,980||225,216||56||334||1|
|ShanghaiTech-A [Zhang2016]||Web search||589868||482||241,677||501||3,139||33|
|ShanghaiTech-B [Zhang2016]||CCTV cameras||7681024||716||88,488||123||578||9|
|UCF-QNRF [Idrees2018]||Web search||20132902||1,535||1,251,642||815||12,865||49|
The Multi-Resolution Crowd Network (MRCNet) utilizes an encoder-decoder structure to extract image features and generate crowd density maps. It takes a single image of arbitrary size and, in a fully-convolutional manner, predicts two density maps, one with 1/4 of the input image size for the people counting task and the other one with the input image size for the density map estimation task. For the encoder, MRCNet relies on a pre-trained VGG-16 network [Simonyan2015]
(without batch normalization) composed of five CNN blocks, where the spatial size is reduced by half after each block using a max-pooling layer. The decoder is composed of five CNN blocks, each preceded by an up-sampling layer based on bilinear interpolation which increases the spatial size by a factor of two. Figure3
illustrates the network structure. The number of feature maps and the used convolution kernel sizes are given below each box. After each convolution, ReLU nonlinearity is applied except for the layers with 11 kernels.
Dealing with the diverse backgrounds of the crowd images (and of the aerial images in particular), using a deep CNN with multiple pooling layers helps reducing the influence of high frequency and small irrelevant background objects by increasing the receptive field size and extracting more contextual information. However, this could also remove the relevant details of the target objects (people) which is critical due to their small sizes. Therefore, MRCNet employs an FPN-based mechanism to combine the contextual information of the higher-level features and the detailed information provided by the lower-level features by element-wise adding the feature maps from the earlier stages to the ones in the later stages. This also helps avoiding the vanishing gradient problem. Using convolution layers with 1
1 kernels on each lateral connection allows linear transformation and dimension reduction in the filter space.
While most of the proposed methods focus on the counting task, MRCNet considers crowd counting and density map estimation as two interrelated tasks. To this end, being inspired by the effective composition of multiple losses in [Idrees2018, Ranjan2018, Sindagi2017], MRCNet takes advantage of two losses at different resolutions for the counting and density map estimation tasks. It has been shown by a number of previous works [Zhang2016, Zeng2017, Sam2018, Liu2018b] that crowd counting could be performed without upsampling the prediction maps to the size of input images. This reduces the prediction complexity as the fine details do not have to be predicted. Taking this into account, MRCNet generates a low-resolution prediction (1/4 times smaller than the input image) at an earlier stage of the decoder and compares it to a downsampled ground-truth, optimizing the pixel-wise Mean Squared Error (MSE) loss. The network is supposed to predict the image count in this stage and should use the rest of the decoder for predicting the full-resolution crowd density maps, with a high localization precision, while keeping the count close to the ground truth by optimizing the loss, another pixel-wise MSE loss. MSE is widely used by crowd counting networks. The total loss is then computed as: , where is empirically set to 0.0001. The number of network parameters is 20.3 M.
5 Results and Discussion
In this section, MRCNet is evaluated on the DLR-ACD and ShanghaiTech datasets. In the training, the Adam optimization algorithm with a learning rate of
was employed and the batch sizes were empirically set to 60 and 40 for the DLR-ACD and ShanghaiTech datasets, respectively. In addition, apart from the VGG’s parameters, all network parameters were randomly initialized by a Gaussian distribution with a zero mean and a standard deviation of 0.01.
5.1 Evaluation Metrics
Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) are the metrics that have been widely used for evaluating crowd counting performance. However, these metrics treat all images equally without considering the differences between their true counts. This could be problematic for the datasets with large variations of the image counts. For example, in the DLR-ACD dataset, I and I contain 285 and 24,368 people, respectively. If the prediction for each image has a count error of 1K, MAE and RMSE consider them as equal errors for both images. Nevertheless, 1K for I is about 350% off its true count whereas for I
it is only about 4%. Thus, in order to represent model performances on each image in single-image crowd counting scenarios, evaluation metrics should also consider the image differences such as in their counts. In this work, for analyzing model counting performance on the DLR-ACD dataset, in addition to the conventional metrics, we use Mean Normalized Absolute Error (MNAE), that evaluates the predictions for each image differently by considering the image’s true count. Assumingand as the ground-truth and predicted counts for image (), respectively, the metrics are computed as:
For person detection, people locations are extracted by detecting local maxima in the predicted high-resolution density maps. For MRCNet, the number of people (or maxima) to be extracted is given by the person count estimated from the low-resolution prediction (i.e. the output of the earlier stage of the decoder). To evaluate the detections, we use the standard precision, recall, and F1-score. A detection is counted as true positive, if it lies within half a meter of a ground-truth person location, which is quite a strict requirement. Here, we assume that the GSD (in pixel/m) of the aerial image is known at test time.
5.2 Experiments on the DLR-ACD dataset
In order to generate the ground truth density maps, we adopt the standard approach of smoothing the person locations with a small 2D Gaussian [Zhang2016], a procedure which is aimed at avoiding the imbalance between the number of positive and negative samples (a pixel on each person versus all other image pixels). As the spatial resolution (GSD) of our aerial images are known and the distortions caused by the homography between the ground and image planes are negligible, the Gaussian smoothing is adapted according to the GSDs of the images. To this end, assuming that the area covered by each person (looking from above) is roughly a square of 0.50.5 m, the standard deviation (in pixels) of the Gaussian is computed as: , where the GSD is in meters and each person will lie within 3 of the Gaussian distribution. Since the Gaussian kernels being used are normalized (), a sum over the ground-truth density map gives the total person count in the image.
For training, the images were tiled evenly into patches of 320320 pixels with 50% overlap, which resulted in 11,908 patches. Then, from each patch, two samples of size 256256 were randomly cropped, where one sample was used as it was and the other was randomly augmented. For the augmentation, three rotations (90, 180, and 270), two flips (left-right and up-down), and two scaling (up- and downsample) were considered on a random basis.
As the qualitative results in Figure 4 show, MRCNet performs well in estimating high-resolution crowd maps in both dense and sparse crowd scenarios. However, it misses some people due to background clutter. Furthermore, the quantitative results of Table 3 demonstrate that MRCNet outperforms other methods by a better count estimation (lowest MNAE) and a higher quality of the estimated density maps for detection tasks (highest F1-score). In addition, considering the MRCNet’s number of parameters (20.3 M), its average inference time is 0.03 ms per image patch of 256256 pixels.
|Liu et al||833.3||0.25||1085.9||0.45||0.44||0.44|
5.3 Experiments on the ShanghaiTech dataset
We also trained and validated MRCNet on the ShanghaiTech crowd dataset [Zhang2016], which is one of the most widely used crowd benchmarks. This dataset is composed of two parts: Part-A contains 482 images (300 training and 182 test images) and Part-B contains 716 images (400 training and 316 test images). Statistics about this dataset can be seen in Table 2 and Figure 2. In order to generate ground truth density maps, we followed the approach proposed in [Zhang2016]. In order to avoid over-fitting, we randomly cropped 20 patches of size 224224 from each training image. Then, as data augmentation, we applied left-right flipping to 30% of the patches on a random basis. In addition, the images were converted into gray scale.
Table 4 shows crowd counting performance of MRCNet compared to the state of the art on the ShanghaiTech dataset. MRCNet outperforms all other methods on Part-A by achieving the lowest MAE and RMSE values, and achieves competitive results on Part-B.
|Switching CNN [Sam2017]||1:4||90.4||135.0||21.6||33.4|
|Liu et al [Liu2018b]||1:4||67.6||110.6||10.1||18.8|
This work proposed Multi-Resolution Crowd Network (MRCNet), a convolutional neural network for accurate crowd counting and density map estimation in aerial and ground imagery. MRCNet considers crowd counting and density map estimation as two interrelated tasks addressed at different resolutions. In addition, a novel aerial crowd dataset, the so-called DLR-ACD, was introduced which promotes crowd monitoring and management from aerial imagery. The superior performance of MRCNet on the DLR-ACD and ShanghaiTech (a ground imagery benchmark) datasets were shown through quantitative and qualitative results. Furthermore, results demonstrated that the estimated crowd maps can be used also for person detection thanks to their high-resolution and accuracy.