DecideNet: Counting Varying Density Crowds Through Attention Guided Detection and Density Estimation

12/18/2017 ∙ by Jiang Liu, et al. ∙ Xi'an Jiaotong University Carnegie Mellon University 0

In real-world crowd counting applications, the crowd densities vary greatly in spatial and temporal domains. A detection based counting method will estimate crowds accurately in low density scenes, while its reliability in congested areas is downgraded. A regression based approach, on the other hand, captures the general density information in crowded regions. Without knowing the location of each person, it tends to overestimate the count in low density areas. Thus, exclusively using either one of them is not sufficient to handle all kinds of scenes with varying densities. To address this issue, a novel end-to-end crowd counting framework, named DecideNet (DEteCtIon and Density Estimation Network) is proposed. It can adaptively decide the appropriate counting mode for different locations on the image based on its real density conditions. DecideNet starts with estimating the crowd density by generating detection and regression based density maps separately. To capture inevitable variation in densities, it incorporates an attention module, meant to adaptively assess the reliability of the two types of estimations. The final crowd counts are obtained with the guidance of the attention module to adopt suitable estimations from the two kinds of density maps. Experimental results show that our method achieves state-of-the-art performance on three challenging crowd counting datasets.



There are no comments yet.


page 1

page 3

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The crowd counting task in the computer vision community aims at obtaining number of individuals appearing in specific scenes. It is the essential building block for high-level crowd analysis, including crowd monitoring


, scene understanding

[41] and public safety management [5].

Figure 1: Ablation studies of detection and regression based crowd counting on the ShanghaiTech PartB (SHB) dataset [42]. Detection reliability decreases along with the increased crowd density, resulting in underestimated counts in those areas. Counts from density estimation tend to be overestimated in scenes with low densities. (a) Visualization of the detection results on a image from a Faster R-CNN [25] detector. (b) The density map on the same image from a CNN regression network [28]. (c) The median object detection scores from the detector used in (a) versus the ground-truth counts. (d) The predictions from the network used in (b) versus the true crowd counts.

Various methods have been proposed to tackle this problem. They could generally be classified into detection and regression based approaches. The detection based methods

[8, 18, 34, 10, 43, 13] employ object detectors to localize the position for each person. The number of detections is then treated as the crowd count. Early works [8, 34, 10]

employ low-level features as region descriptors, followed by a classifier for classification. Benefiting from the recent progress in object detection using deep neural networks

[14, 25, 24, 12], in ideal images with relatively large individual sizes and sparse crowd densities, detection based counting could surpass human performance. Different from crowd counting by detection, regression based methods [19, 23, 40, 28, 42, 2] obtain the crowd count without explicitly detecting and localizing each individual. Preliminary works directly learn the mapping between features of image patches to crowd counts [19, 23, 40]

. Recent regression based works improve the performance with Convolutional Neural Network (CNN)

[2, 28, 42, 21, 22] to output density maps of image patches. Integrating over the map will give the count for the patch. Regression based methods usually work well in crowded patches since they can capture the general density information by benefiting from the rich context in local patches.

In real-world counting applications, the crowd density varies enormously in spatial and temporal domains. In the spatial aspect, even in a same image, the density in some regions may be much higher than those of others. In some background regions, there may even be no person present. Meanwhile, it is also natural for the crowd volume to change along with time: a business street may have very high crowd volumes during the workdays, while the weekends counterparts are much lower. Intuitively, here comes a question: can crowd counting exclusively based on either regression or detection be enough to simultaneously handle high and low density scenes?

To answer this question, we study the performance of two types of approaches on the ShanghaiTech PartB (SHB) dataset [42] collected from real street scenes with great variation in crowd densities. The result is illustrated in Figure 1. Figure 1(a) gives the detections from a fine-tuned Faster R-CNN head detector on a specific image: with the distance to the camera increasing, the crowd density and the number of missed detections rises. Figure 1(c) shows the relationship between median detection scores and ground-truth counts for 10,000 image patches with sizes of . It is clear that the score drops rapidly with the rise of the ground-truth count. We may therefore find that the reliability of detection based counting, reflected by the detection score, is highly correlated to the crowd density. In scenes with sparse crowds, the estimations are reliable, and the detection scores are also higher than those of congested scenes. On the other hand, in crowded scenes, the corresponding object sizes tend to be very small. Detection in these scenes is less reliable, leading to low detection scores and recall rates. Consequently, the predicted crowd counts will be underestimated; while the regression based counting methods could perform better on these occasions. Figure 1(b) provides the crowd density map visualization on the same image in (a), outputted by a 5-layer CNN based regression network with similar structure employed in [28]. We find that the estimations in remote congested areas are quite reasonable. However, in background regions near the camera viewpoint, there exist false alarm hot spots on the pavement. The relationship between ground-truth counts and corresponding predictions is plotted in Figure 1(d). Note that the prediction dots for patches with lower ground-truth counts are mostly above the dashed line. This indicates that prediction counts in these scenes are mostly larger than the ground-truth. Hence, being not aware of the location of each individual, and directly applying the regression based approaches to low density scenes may lead to overestimated results.

Based on the above ablation analysis, we may find that the detection and regression based counting approaches have their different strengths on different crowd densities. The regression based method is preferred for congested scenes. Without localization information for each person, applying them to low density scenes tends to overestimate counts. The detection based approach could localize and count each person precisely on these occasions since they are expected settings for object detectors. However, its reliability degenerates in crowded scenes due to small target sizes and occlusion.

Therefore, we may find that a conventional crowd counting method which only relies on either detection or regression is limited when handling real scenes with unavoidable density variations. An ideal counting method, on the other hand, should have an adaptive ability to choose the appropriate counting mode according to the crowd density: in low density scenes, it is expected to count by localizing as an object detector; whereas in congested scenes, it should behave in a regression manner. Motivated by this understanding, we propose a novel crowd counting framework named as DecideNet (DEteCtIon and Density Estimation Network), as shown in Figure 2. To the best of our knowledge, DecideNet is the first framework, which is capable of perceiving the crowd density for each pixel in a scene and adaptively deciding the relative weights for detection and regression based estimations.

In detail, for a given scene, the DecideNet first estimates two kinds of crowd densities maps by detecting individuals and regressing pixel-wise densities, respectively. To capture the subtle variation in crowd densities, an attention module QualityNet is proposed to assess the reliability of two types of density maps with the additional supervision of detection scores. The final count is obtained under guidance from QualityNet to allocate adaptive attention weights for the two density maps. Parameters in our proposed DecideNet

are end-to-end learnable by minimizing a joint loss function.

In summary, we make the following contributions:

  • We find that real-world crowd counting occasions are frequently faced with great density variations. While existing estimation methods, which either rely exclusively on detection or regression, are unable to provide precise estimations along the whole density range.

  • Based on the complementary property of two types of crowd counting methods, we design a novel framework DecideNet, which can capture this variation and estimate optimal counts by assigning adaptive weights for both detection and regression based estimations.

  • Experimental results reveal that our method achieves state-of-the-art performance on public datasets with varying crowd densities.

2 Related works

Crowd counting by detection.

Early works addressing the crowd counting problem major follow the counting by detection framework. Region proposal generators [9, 33] are firstly used to propose potential regions that include persons. Low-level features [8, 18, 27, 34]

are then used for feature representation. Different binary classifiers including Naive Bayes


, Random Forest

[23] and their variations [40, 11] are trained with these features. The crowd count is the number of positive samples outputted by the classifier on a test image. Global detection scores are employed to estimate crowd densities and utilized for object tracking in [26]. Recent approaches seek the end-to-end crowd counting solution by CNN based object detectors [14, 25, 24, 7] and greatly improve the counting accuracy. Though detection based crowd counting is successful for scenes with low crowd density, its performance on highly congested environments is still problematic. On these occasions, usually only partial of the whole objects are visible, posing great challenge to object detectors for localization. Therefore, part and shape based models are introduced in [10, 20, 39], where ensembles of classifiers are built for specific body parts and regions. Although these methods mitigate the issue in some degree, counting in evident crowded scenes still remains challenging, since objects in those areas are too small to be detected.

Crowd counting by regression.

Different from counting by detection, counting by regression estimates crowd counts without knowing the location of each person. Preliminary works employ edge and texture features such as HOG and LBP to learn the mapping from image patterns to corresponding crowd counts [19, 23, 40]. Multi-source information is utilized [15] to regress the crowd counts in extreme dense crowd images. An end-to-end CNN model adopted from AlexNet is constructed [36] recently for counting in extreme crowd scenes. Later, instead of direct regressing the count, the spatial information of crowds are taken into consideration by regressing the CNN feature maps as crowd density maps [41] . Observing that the densities and appearances of image patches are of large variations, a multi-column CNN architecture is developed for density map regression [42]. Three CNN columns with different receptive fields are explicitly constructed for counting crowds with robustness to density and appearance changes. Similar frameworks are also developed in [22], where a Hydra-CNN architecture is designed to estimate crowd densities in a variety of scenes. Better performance can be obtained by further exploiting switching structures [31, 28, 17] or contextual correlations using LSTM [29]. Though counting by regression is reliable in crowded settings, without object location information, their predictions for low density crowds tend to be overestimated. The soundness of such kind of methods relies on the statistical stability of data, while in such scenarios the instance number is too small to help explore the its intrinsic statistical principle.

Figure 2: The architecture of our proposed DecideNet. Image patches are sent to the RegNet and DetNet blocks for two types of density maps and estimation. The final density map is outputted by the QualityNet, which adaptively decides the attention weight between two density maps for each pixel. Three blocks are jointly learned on the training data.

3 Crowd Counting by DecideNet

3.1 Problem formulation

Our solution formulates the crowd counting task as a density map estimation problem. It requires training images as inputs. For a specific image , a collection of 2D points is provided by the dataset [6, 41, 42], indicating the ground-truth head positions in the image . The ground-truth crowd density map of is generated by convolving annotated points with a Gaussian kernel [22]. Therefore, the density at a specific pixel of could be obtained by considering the effects from all the Gaussian functions centered by annotation points, i.e.,


Summing over the density values of all pixels over the entire image , the total person count of can be acquired: . For a counting model parameterized by , its objective is to learn a non-linear mapping for , whereas the difference between the prediction density map and the ground-truth is minimized.

Traditional crowd counting by density estimation methods regress density maps by minimizing the pixel-wise Euclidean loss to the ground-truth [36, 2, 42, 22]. However, as we have analyzed in introduction, counting by purely regression would result in the overestimation problem on occasions with low crowd densities. Oppositely, counting by detection works comparably better in those scenes, since low crowd density is the expecting environment to an object detector.

In practical applications, the crowd density varies both spatially and temporally. Hence, deciding the crowd counts exclusively based on either regression or detection is insufficient. DecideNet is motivated by their complementary property to address this problem. As shown in Figure 2, instead of counting people either by merely regressing density maps, or applying an object detector over the whole image, DecideNet simultaneously estimates crowd counting with both detection and regression modules. Later, an attention block is utilized to decide which estimation result should be adopted for a specific pixel. Three CNN blocks are included in our framework: the RegNet, the DetNet and the QualityNet, parameterized by . The parameters for three CNN blocks could be jointly learned on the training set.

3.2 The RegNet block

Figure 3: The RegNet block consisting of 5 fully convolutional layers. It outputs the crowd density map of each pixel in image patches without predicting the head locations.

The RegNet block counts crowds in the absence of localizing each individual. Without knowing the specific location of each head in the input image patch, it directly estimates the crowd density for all the pixels in with a fully convolutional network:


As shown in Figure 3, the RegNet block consists of 5 convolutional layers. Because it is designed to capture the general crowd density information, larger filters’ receptive fields will grasp more contextual details, which is more beneficial for modeling the density maps. Therefore, in our implemented RegNet block, the “conv1” layer has 20 filters with a kernel size. filters with a kernel size are set as the “conv2” layer. In order to capture scale and orientation invariant person density features, the “conv1” and “conv2” layers are followed by two max-pooling layers. The “conv3” and “conv4” layers both have filter sizes with 20 and 10 filters, respectively. Since the density estimation result could be viewed as a CNN feature map with only one channel, we add a “conv5” layer with only one filter and a “” filter size. This layer is responsible to return the regression based crowd density map

, in which value on each pixel represents the estimated count at that point. A ReLU unit is applied after the “conv5” layer ensuring that the output density map will not contain negative values.

3.3 The DetNet block

Figure 4: The proposed DetNet block is built upon the Faster R-CNN network. A Gaussian convolutional layer is plugged after the bounding box outputs to generate the detection based crowd density map .

To handle varying perspectives, crowd densities and appearances, existing density estimation methods [41, 42, 28, 17] consist of several CNN structures like the RegNet block. However, without the prior knowledge about the exact position of each person in the image patches, the network purely decides the crowd density based on the raw image pixels. This regression methodology may be accurate in image patches with relatively large crowd densities, while it tends to overestimate the crowd counts in sparse or even “no-person” (background) scenes. In our proposed DecideNet architecture, the DetNet is designed to address this issue by generating the “location aware” detection based density map . The motivation is intuitive and simple: sparse and non-crowded image patches are expected settings for present CNN based object detectors. Therefore, compared to use regression networks to count on these patches, using the prior knowledge from outputs of object detectors should substantially relieve the overestimation problem.

The DetNet block, illustrated in Figure 4 is built based on the above assumption. It could be viewed as an extension of the Faster-RCNN network [14] for head detection on the basis of the ResNet-101 architecture. To be specific, we design a Gaussian convolutional layer and plug it after the bounding box outputs of the original Faster-RCNN network. The Gaussian convolutional layer employs a constant Gaussian function , to convolve over the centers of detected bounding boxes on the original image patch. The detection based density map is obtained by this layer, i.e.,


Since the pixel values of are obtained by considering the impact from the points in detection output , is a “location aware” density map. Compared to from the output of , responses of are more concentrated on specific head locations. The difference between them is obvious in of Figure 3 and of Figure 4.

3.4 Quality-aware density estimation

Figure 5: The QualityNet block: stacking two density maps and the original image as input, it outputs a probabilistic attention map . The final density estimation is jointly determined by , and .

Herein, we have described the details about obtaining two kinds of density maps: and for a given image . The detection based map employs object detection results for density estimation. Therefore, it could count persons precisely in sparse density scenes by localizing their head positions. However, counting via is not accurate on crowded occasions due to the low detection confidence resulted from the small object size and occlusion. On the contrary, the regression based map , which is unaware of individual locations, is the preferred estimation for these scenes: the full convolutional network is capable of capturing rich context crowd density information. Intuitively, one may think that fusing and by applying average or max pooling [37] may obtain better results on varying density crowds. Nevertheless, even in the same scene, the density may differ significantly in different parts or time intervals. Therefore, the importance between and also changes correspondingly for instant pixel values in . In DecideNet, we propose an attention block QualityNet, shown in Figure 5 to model the selection process for optimal counting estimations. It captures the different importance weight of two density maps by dynamically assessing the qualities of them for each pixel.

For a given , the QualityNet block firstly upsamples and to the same size of . Then , and are stacked together as the QualityNet input with 5 channels. Four fully convolutional layers and a pixel-wise sigmoid layer is followed to output a probabilistic attention map . We define the specific value of at the pixel reflects the importance of the detection based density map , compared to the regression counterpart . As a result, the QualityNet block could decide the relative reliability (i.e., the quality) between and . A higher at pixel means a higher attention we should rely on the detection, rather than the regression density estimation for . Hence, we could further define the final density map estimation as a weighted sum between two density maps and , guided by the attention map :


whereas is the Hadamard product for two matrices and the J is an all-one-matrix with the same size of .

4 Model Learning

Parameters of DecideNet consist of three parts: , and . Hence, we generalize the training process as a multi-task learning problem. The overall loss function , is given by Eq. (5):


whereas the , and are the losses for RegNet, DetNet and QualityNet, respectively.

could be optimized via Stochastic Gradient Descent with annotated training data. In each iteration, gradients for

, and are alternatively calculated and employed to update corresponding parameters. To be specific, for the loss of the RegNet component, we employ the pixel-wise mean square error as the loss function. That is:


whereas is the total number of training images.

For the block, different from the regression counterpart, the responses on the density map mostly concentrate on the detected head centers. Directly minimizing the difference between and involves in overwhelmed negative pixel samples, i.e., background pixels without head detections. Hence, instead of using the pixel-wise Euclidean loss as error measurement, we employ the bounding boxes as supervision. In this way, optimizing is equivalent to minimizing the classification and localization error in the original Faster R-CNN [25]:


Due to the fact that only the centers of individuals’ heads are provided as the annotation on crowd density estimation datasets, we manually label the bounding boxes on partial of the training set points. Later, we employ the average width and height of them for the bounding box supervision in Eq. (7).

The loss function for the attention module QualityNet should measure two kinds of errors. One is the difference between the final crowd density map and the ground-truth density map . This error is similar to that we have defined in . The second error measures the quality of the output probabilistic map in QualityNet. Recall that is the confidence of how reliable the detection result is at pixel in the image . As we analyzed in Figure 1(c), this confidence could be reflected by the object detection score at . Therefore, we employ the Euclidean distances between the probabilistic attention map and object detection score map as the second error component in . From another perspective, this error could be considered as a regularization term over the QualityNet parameters , by incorporating detection scores as prior information. In experiment evaluation, we will show that this regularization is indispensably beneficial to the performance of our proposed architecture. Since the object detection qualities are brought into this loss function, we name it as the “quality-aware” loss. The final formulation of this loss is defined as following:


where is the hyper-parameter to balance the importance between two errors.

5 Experimental Results

5.1 Evaluation settings

Our proposed method is evaluated on three major crowd counting datasets [6, 41, 42] collected from real-world surveillance cameras. For all datasets, DecideNet is optimized with 40k steps of iterations. We set the initial learning rate at 0.005 and cut it by half in each 10k steps. Then the best model is selected over the validation data. Instead of sending the whole image to DecideNet during training, we follow the strategy used in [28, 2, 22] to crop images into

patches. In this way, the number of samples for training the regression network is boosted. Each patch is then augmented by random vertical and horizontal flipping with a probability of

. We also add uniform noise ranging in on each pixel in the patch with a probability of for data augmentation. To optimize the parameters for the RegNet and the QualityNet, the ground-truth density maps are obtained by applying the Gaussian kernel with and a window size of 15. In each iteration, the object detection score map is acquired by evaluating on the DetNet. For each pixel in the detected bounding boxes, the value of is filled with corresponding detection score. For the rest of pixels which are not included in any bounding boxes, they are filled with a default value set at 0.1. The score map is downsampled to the same size of in order to calculate the “quality-aware” loss . We follow the convention of existing works [32, 41, 23]

to use the mean absolute error (MAE) and mean squared error (MSE) as the evaluation metric. The MAE metric reveals the accuracy of the algorithm for crowd estimation, while the MSE metric indicates the robustness of estimation.

5.2 The Mall dataset

The Mall dataset [6] contains 2000 frames, collected in a shopping mall. Each frame has a fixed resolution of . We follow the pre-defined settings to use the first 800 frames as the training set and the rest 1200 frames as the test set. The validation set is selected randomly from 100 images in the training set. We compare our DecideNet with both detection based approaches: SquareChn Detector [1], R-FCN [7], Faster R-CNN [25]; and regression based approaches: Count Forest [23], Exemplary Density [38], Boosting CNN [35], MoCNN [17], Weighted VLAD [30]. The evaluation results are exhibited in Table 1.

Method MAE MSE
SquareChn Detector [1] 20.55 439.1
R-FCN [7] 6.02 5.46
Faster R-CNN [25] 5.91 6.60
Count Forest [23] 4.40 2.40
Exemplary Density [38] 1.82 2.74
Boosting CNN [35] 2.01 N/A
MoCNN [17] 2.75 13.40
Weighted VLAD [30] 2.41 9.12
DecideNet 1.52 1.90
Table 1: Comparison results of different methods on the Mall dataset. The MAE and MSE error of our proposed DecideNet is significant lower than other approaches.

From Table 1, we can observe the detection based approaches [1, 7, 25] generally perform worse than the regression counterparts. Even the most recent CNN based object detectors [7, 25] still have a large performance gap to the CNN based regression approaches [35, 17, 30]. Our proposed DecideNet obtains the minimum error on both MAE and MSE metrics. Compared to the best approach “Boosting CNN”, which based on regression, DecideNet reveals 0.49 point improvement on MAE metric. This is achieved without using the ensemble scheme employed by the “MoCNN” and “Boosting CNN” methods. Moreover, the MSE metric of the DecideNet is merely 1.90. This is significantly lower than other state-of-the-art methods, which either use detection or regression approach. This gain rationally results from our density estimations formulated from both detection and regression results.

5.3 The ShanghaiTech PartB dataset

Method MAE MSE
R-FCN [7] 52.35 70.12
Faster R-CNN [25] 44.51 53.22
Cross-scene [41] 32.00 49.80
M-CNN [42] 26.40 41.30
FCN [21] 23.76 33.12
Switching-CNN [28] 21.60 33.40
CP-CNN [31] 20.1 30.1
DecideNet 21.53 31.98
DecideNet+R3 20.75 29.42
Table 2: Comparison results of different methods on the ShanghaiTech PartB dataset.

We also perform the evaluation experiments on the ShanghaiTech PartB (SHB) [42] crowd counting dataset, which is among the largest datasets captured in real outdoor scenes. It consists of 716 images taken from business streets in Shanghai, in which 400 of them are pre-defined training set and the rest are the test set. Compared to the Mall dataset, it poses very diverse scene and perspective types over greatly changing crowd densities. We use 50 randomly selected images in the training set for validation. Since the resolution of each image is , the patches are cropped from the original image with a size of during training. Our evaluation result and the comparison to other state-of-the-art methods are shown in Table 2. Due to the large variation in density and object size on the SHB dataset, the detection based approaches [7, 25] perform worse than the others relying on regression. Specifically, the ensemble and fusion strategy is employed by the M-CNN [42], Switching-CNN [28], CP-CNN [31] in Table 2. Compared to the Mall dataset, the challenging SHB dataset leads to much higher MAE and MSE on all the methods. Even though, our proposed method (DecideNet; DecideNet+R3, which trained with an additional R3 stream in Switching-CNN) is very competitive to existing approaches.

5.4 The WorldExpo’10 dataset

The WorldExpo’10 dataset [41] includes 1132 annotated video sequences collected from 103 different scenes in the World Expo 2010 event. There are a total number of 3980 frames with sizes normalized to . The patch size we used for training is . The training set consists of 3380 frames and the rests are used for testing. Since the Region Of Interest (ROI) are provided for test scenes (S1-S5), we follow the fashion of previous method [32] to only count persons within the ROI area. We use the same metric, namely MAE, suggested by the author [41] for evaluation. The results of our proposed approach on each test scene and the comparisons to other methods are listed in Table 3.

Method MAE
S1 S2 S3 S4 S5 Ave
Cross-scene [41] 2.00 29.50 9.70 9.30 3.10 12.90
M-CNN [42] 3.40 20.60 12.90 13.00 8.10 11.60
Local&Global [29] 7.80 15.40 15.30 25.60 4.10 11.70
CNN-pixel [16] 2.90 18.60 14.10 24.60 6.90 13.40
Switching-CNN [28] 4.40 15.70 10.00 11.00 5.90 9.40
DecideNet 2.00 13.14 8.90 17.40 4.75 9.23
Table 3: Comparison results of different methods on 5 scenes in the WorldExpo’10 dataset.

From Table 3, we can notice that our proposed approach achieves an average MAE at 9.23 across all 5 scenes. This is the best performance among those obtained by all compared methods, revealing 0.17 improvement on the second best “Switching-CNN” approach. It is not that significant, because our error on S4 is a little bit higher. The reason may lie on the fact that people in S4 majorly gather in crowds at remote areas, posing great challenge for the to output meaningful estimations. Therefore, the estimation on S4 are mostly relied on the outputs from . While without the ensemble regression structure, using the only may not be able to exhibit the superior counting precision. We can also notice that the prediction counts of different state-of-the-art methods alter considerably on the 5 scenes, revealing different approaches have their own strengths to specific scenes. However, DecideNet obtains three minimum MAE errors when compared to other approaches. This indicates DecideNet having a good generalization ability and prediction robustness on different scenes.

Method MAE MSE
Mall SHB Mall SHB
RegNet only 3.37 42.85 4.22 63.63
DetNet only 4.50 44.90 5.60 73.18
RegNet+DetNet (Late Fusion) 3.93 38.63 4.96 65.27
RegNet+DetNet+QualityNet 1.83 24.93 2.27 41.86
RegNet+DetNet+QualityNet (quality-aware loss) 1.52 21.53 1.90 31.98
Table 4: Qualitative results of different DecideNet components on the Mall and SHB dataset.

5.5 Effects of different components in DecideNet

To analyze effects of each components of the proposed DecideNet, we conduct ablation studies on the Mall and SHB dataset. The qualitative results are listed in Table 4, which shows several interesting observations. First, using the estimations exclusively from either the RegNet (“RegNet only”), or DetNet (“DetNet only”) only obtains fair results on both datasets. The estimations from the RegNet have lower error than the detection counterparts. This is possibly due to the fact that most of the image regions are with high crowd density on both datasets. Further, late fusion by averaging two classes of density maps (“RegNet+DetNet (Late Fusion)”) exhibits improvements than “RegNet only” and “DetNet only” on that SHB dataset. While on the Mall dataset, it only achieves a mediocre result between two kinds of density estimations. This indicates that direct late fusion is not robust enough to obtain better results across all kinds of datasets. Second, with DecideNet, even training without the object detection scores regularization (“RegNet+DetNet+QualityNet”), we obtain significant MAE and MSE decrease as compared with those obtained by the previous methods. Compared to late fusion, it almost decreases the MAE by half on two datasets, revealing the power of the attention mechanism. Last but not least, adopting the “quality-aware” loss during training (“RegNet+DetNet+QualityNet (quality-aware loss)”), the MAE and MSE errors are further reduced on two datasets. In particular, the MSE decreases from 41.86 to 31.98 on SHB dataset: this shows that the loss can substantially increase the prediction stability on challenging datasets with great variations.

Figure 6: Prediction and the ground-truth crowd counts on the test sets of the Mall (left) and SHB (right) datasets.

In Figure 6, we show the relationships between different crowd count predictions and the ground-truth crowd counts on the test sets of two datasets. Note that the horizontal axes “image id” are sorted in ascending order by the number of ground-truth crowd counts. Clearly, when the numbers of ground-truth crowd count are small, the regression based results from the RegNet overestimate the estimations: the blue lines are above the ground-truth red lines in the first half part of the horizontal axis in both figures. On the opposite, the detection based result curves (the green lines) fit the red lines well at that region on two datasets. However, when the numbers of ground-truth count increase, the estimations of the detection based density map become considerably lower than the red lines, particularly after the second half parts of the horizontal axis. The blue lines fit the ground-truth lines best in the middle part of the horizontal axes. This verifies our observation that regression based estimations are more suitable for high crowded patches. Directly applying the late fusion (the purple curves) helps to a certain extent, while its predicted counts are not stable along all images. At last, the cyan lines, which represent DecideNet outputs, indicate the smallest differences to the ground-truth curves along all parts of the horizontal axes. That is, the DecideNet trained with “quality-aware” loss exhibits the best estimation results for all kinds of crowd densities on two datasets.

5.6 Visualization on density maps

To better understand what is learned in our proposed model, we visualize three categories of crowd density maps in the SHB dataset from three blocks: RegNet, DetNet and QuialityNet in Figure 7 (best viewed in color).

Figure 7: The visualization results of three types of density maps on the SHB dataset (best viewed in color).

We can discover that the outputs of regression based density maps (on the most left column) exhibit diffused density estimations along the image regions. For the remote areas with highly congested crowds, such predictions from the RegNet are reliable. However, when it comes to the nearby regions with lower crowd density, the results are not satisfactory: some single person bodies are erroneously predicted with very high density. The prediction counts of the are also larger than the ground-truth (GT) counts, implying the occurrence of overestimation issue. Compared to , the detection based density maps (the middle column) are very different: the predicted peak regions are concentrated on the center of heads. This is resulted from the fact that these maps are generated from outputs of head detectors. We can further observe that the detection based density results are pretty good in nearby low density regions of the given image, while not all the heads are marked with high prediction peaks in the remote areas. The underestimated predicted counts of also reflect this phenomenon. With the attention information from the QualityNet, final density maps in the right column reveal very good characteristics: in the nearby region, the estimation prefers the detection results. Persons in those areas share very similar estimation patterns with . Oppositely, in remote and congested regions, instead of the “concentrated dot” patterns, the density maps are diffused. DecideNet considers the regression based results are more reliable for those cases. This confirms that the QualityNet block is able to assess the reliability of the corresponding density map value for a specific pixel.

6 Conclusion

In this paper, a novel end-to-end crowd counting architecture named DecideNet has been proposed. It is motivated by the complementary performance of detection and regression based counting methods under situations with varying crowd densities. To the best of our knowledge, DecideNet is the first framework to estimate crowd counts via adaptively adopting detection and regression based count estimations under the guidance from the attention mechanism. We evaluate the framework on three challenging crowd counting benchmarks collected from real-world scenes with high variation in crowd densities. Experimental results confirm that our method obtains the state-of-the-art performance on three public datasets.

7 Acknowledgment

Jiang Liu and Alexander Hauptmann are supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number D17PC00340. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation/herein. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of IARPA, DOI/IBC, or the U.S. Government. Chenqiang Gao and Deyu Meng are supported by the China NSFC projects with No.61571071, No.61661166011, No.61721002.


  • [1] R. Benenson, M. Omran, J. H. Hosang, and B. Schiele. Ten years of pedestrian detection, what have we learned? CoRR, abs/1411.4304, 2014.
  • [2] L. Boominathan, S. S. Kruthiventi, and R. V. Babu. Crowdnet: a deep convolutional network for dense crowd counting. In ACM MM, 2016.
  • [3] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR, 2008.
  • [4] A. B. Chan and N. Vasconcelos. Bayesian poisson regression for crowd counting. In ICCV, 2009.
  • [5] J. Chen, J. Liang, H. Lu, S.-I. Yu, and A. Hauptmann. Videos from the 2013 boston marathon: An event reconstruction dataset for synchronization and localization. 2016.
  • [6] K. Chen, C. C. Loy, S. Gong, and T. Xiang. Feature mining for localised crowd counting. In BMVC, 2012.
  • [7] J. Dai, Y. Li, K. He, and J. Sun. R-fcn: Object detection via region-based fully convolutional networks. In NIPS, 2016.
  • [8] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR, 2005.
  • [9] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. TPAMI, 2012.
  • [10] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. TPAMI, 2010.
  • [11] J. Gall, A. Yao, N. Razavi, L. Van Gool, and V. Lempitsky. Hough forests for object detection, tracking, and action recognition. TPAMI, 2011.
  • [12] C. Gao, P. Li, Y. Zhang, J. Liu, and L. Wang. People counting based on head detection combining adaboost and cnn in crowded surveillance environment. Neurocomputing, 2016.
  • [13] W. Ge and R. T. Collins. Marked point processes for crowd counting. In CVPR, 2009.
  • [14] R. Girshick. Fast r-cnn. In ICCV, 2015.
  • [15] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-source multi-scale counting in extremely dense crowd images. In CVPR, 2013.
  • [16] D. Kang, Z. Ma, and A. B. Chan. Beyond counting: Comparisons of density maps for crowd analysis tasks-counting, detection, and tracking. arXiv preprint arXiv:1705.10118, 2017.
  • [17] S. Kumagai, K. Hotta, and T. Kurita. Mixture of counting cnns: Adaptive integration of cnns specialized to specific appearance for crowd counting. arXiv preprint arXiv:1703.09393, 2017.
  • [18] B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. In CVPR, volume 1, pages 878–885. IEEE, 2005.
  • [19] V. Lempitsky and A. Zisserman. Learning to count objects in images. In NIPS, 2010.
  • [20] S.-F. Lin, J.-Y. Chen, and H.-X. Chao. Estimation of number of people in crowded scenes using perspective transformation. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 2001.
  • [21] M. Marsden, K. McGuiness, S. Little, and N. E. O’Connor. Fully convolutional crowd counting on highly congested scenes. arXiv preprint arXiv:1612.00220, 2016.
  • [22] D. Onoro-Rubio and R. J. López-Sastre.

    Towards perspective-free object counting with deep learning.

    In ECCV, 2016.
  • [23] V.-Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In ICCV, 2015.
  • [24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016.
  • [25] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • [26] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert. Density-aware person detection and tracking in crowds. In ICCV, 2011.
  • [27] P. Sabzmeydani and G. Mori. Detecting pedestrians by learning shapelet features. In CVPR, 2007.
  • [28] D. B. Sam, S. Surya, and R. V. Babu. Switching convolutional neural network for crowd counting. In CVPR, 2017.
  • [29] C. Shang, H. Ai, and B. Bai. End-to-end crowd counting via joint learning local and global count. In ICIP, 2016.
  • [30] B. Sheng, C. Shen, G. Lin, J. Li, W. Yang, and C. Sun. Crowd counting via weighted vlad on dense attribute feature maps. TCVST, 2016.
  • [31] V. A. Sindagi and V. M. Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, 2017.
  • [32] V. A. Sindagi and V. M. Patel. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 2017.
  • [33] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders. Selective search for object recognition. IJCV, 2013.
  • [34] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. In ICCV, 2003.
  • [35] E. Walach and L. Wolf. Learning to count with cnn boosting. In ECCV, 2016.
  • [36] C. Wang, H. Zhang, L. Yang, S. Liu, and X. Cao. Deep people counting in extremely dense crowds. In ACM MM, 2015.
  • [37] L. Wang, C. Gao, J. Liu, and D. Meng. A novel learning-based frame pooling method for event detection. Signal Processing, 2017.
  • [38] Y. Wang and Y. Zou. Fast visual object counting via example-based density estimation. In ICIP, 2016.
  • [39] B. Wu and R. Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. IIJCV, 2007.
  • [40] B. Xu and G. Qiu. Crowd density estimation based on rich features and random projection forest. In WACV, 2016.
  • [41] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd counting via deep convolutional neural networks. In ICCV, 2015.
  • [42] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-image crowd counting via multi-column convolutional neural network. In CVPR, 2016.
  • [43] T. Zhao, R. Nevatia, and B. Wu. Segmentation and tracking of multiple humans in crowded environments. ITPAMI, 2008.