I Introduction
Counting people in crowded scenes is a highly challenging and important problem in computer vision and video surveillance. According to the statistics
^{1}^{1}1https://en.wikipedia.org/wiki/List_of_human_stampedes, there are at least serious human stampedes happened in 2017, causing at leastpeople to die, and thousands of people were injured. Such tragedy can be prevented if we can estimate the crowd density and take preventive measures in time.
Precisely detecting or recognizing each person is intractable in crowed scenes due to the heavy occlusions and cluttered visual patterns, as shown in Figure 1. Therefore, most existing methods [28, 11, 26, 25, 22, 2, 21, 24, 17, 27]
choose to estimate the density map, of which the integral is equal to the number of people in a given image. With the recent breakthrough of deep learning in visual related tasks, most cuttingedge counting methods are based on the Convolutional Neural Network (CNN), e.g., MCNN
[28] and Hydra [17].However, our study reveals an interesting phenomenon that, in the current all “going deeper” era, the network architectures in most counting methods are rather shallow. For example, MCNN only uses convolutional layers with a limited number of filters. The same situation exists even in many multisubnet networks [2, 21, 22], e.g., the core component of SwitchCNN [2]
for generating density maps is exactly the same as in MCNN. Yet this is not simply because of hardware limitation, e.g., the auxiliary component in SwitchCNN, a densitylevel classifier, uses the very deep VGG16
[19] structure.In fact, learning a Deep Neural Network (DNN) to estimate density map is not so straightforward and even frustrating. In both [17] and [24], the difficulty of optimizing their own DNN has been mentioned. Besides, the earlier research in [3] also points out that learning a DNN with the loss, which is the major objective in most density map based methods, is vulnerable to outliers.
Based on the above considerations, we conduct indepth study to dig out the main reason behind the disadvantages of learning DNNs for density map estimation. After delving into the learning process of existing methods, we attribute the main reason to the inhomogeneous density distribution
in crowd images. By inhomogeneous density distribution, we refer to the various intraimage and interimage density levels. As we will see in the later sections, it will not only introduce outliers, but also result in other problems, such as the dying ReLU phenomenon, in the optimization process.
Fortunately, with the comprehensive analysis of the inhomogeneous density distribution problem, we are able to find an intuitive and natural solution for it: we simply transform our learning labels from 2D density maps to structured (3D) density maps. Such solution can reduce the effects of outliers sufficiently, and with a few modifications on the VGG16 network, we successfully train a single DensityAware Network (DAN), which obtains comparable performance with the much more complicated multisubnet networks.
In summary, this paper focuses on the inhomogeneous density distribution problem in the highdensity crowd counting task. The major contributions of this paper are dual: (a) We present the detailed analysis of how the inhomogeneous density distribution affects the density map based methods. We also point out the reasons of several abnormal phenomena in the optimization process, and provide solutions for them. We hope this can motivate other researchers to handle similar problems in their optimization process. (b) We propose the structured inhomogeneous density learning method as a practical solution for crowd counting, which is implemented by our DAN model. Extensive experiments on several datasets show that our simple method can obtain the stateoftheart performance.
Ii Related Work
Earlier research on crowd counting mainly aims at low density scenes, and can be roughly divided into detectionbased methods [9, 7] and regressionbased methods [6, 15, 5, 4, 1]. In recent years, as highdensity datasets [12, 27, 28] are built, density mapbased methods, which is firstly introduced in [14], have gained more attention. Due to the page limitation, we focus on density mapbased methods.
Most of the stateoftheart density mapbased methods are implemented using CNNs and its variants. A comprehensive survey of CNNbased counting models can be found in [20]. Based on the network architecture, we can divide existing methods into (a) singlecolumn networks, (b) multicolumn networks and (c) multisubnet networks, as shown in Figure 4:
SingleColumn Networks: we consider that networks [17, 25] with the straight structure similar to the Fully Convolutional Networks [18] as the singlecolumn networks. A typical example of this kind of networks is the Counting CNN (CCNN, [17]), which is simply trained by minimizing the loss between its outputs and the groundtruth density maps. In [25], the CNN is combined with the LSTM to capture spatial and temporal dependencies. The advantage of this kind of networks is the simplicity of their architectures.
MultiColumn Networks
: multicolumn networks, or multiscale networks are introduced to tackle the scale variance and distorted perspective map problem
[28, 17, 24, 26]. This type of networks usually use multiple columns with the same functions to capture features at different resolutions. For example, MCNN [28] uses three columns of filters of different sizes; Hydra [17] uses a pyramid of input patches; and in [26], a multiscale blob with different kernel sizes is introduced. Multicolumn networks are usually more robust than the single column networks, since they can integrate features with different resolutions. However, as we will point out later, multicolumn networks actually are global models, they will still be affected by the outliers in the training data.MultiSubnet Networks: multisubnet networks [22, 11, 2, 21, 24], or multitask networks are networks consisting of multiple subnets, and the main difference between multisubnet networks and multicolumn networks is that each subnet in multisubnet networks has its own objective function. For example, SwitchCNN [2] is composed of three local regressors and one densitylevel classifier. The three regressors are trained to generate the their own density maps, while the densitylevel classifier is trained to select the optimal density map. A more complicate example is the CPCNN [22] model, which has a globalcontext classifier, a localcontext classifier, a density map regressor, and a fusion network to combine these subnets together. Indeed, among the three kinds of networks, multisubnet networks can obtain the most impressive performance. However, learning a multisubnet network is difficult, since it requires individually tuning for its each component.
Iii Crowd Counting with Density Map
In this section, we first briefly overview density map based counting methods in Section IIIA, and then present the detailed analysis of how the inhomogeneous density distribution problem affects existing methods in Section IIIB, so that we can solve it accordingly in our method.
Iiia Problem Formulation
Given an image belonging to the image domain , the primary goal of this paper is to learn a function to estimate the number of people in , i.e., . To this end, we are given a training set consisting of images and their corresponding groundtruth head based annotations, i.e., , where , , and denote the height and width of the image, respectively, and if the center of a target appears at position , otherwise , , . Yet the information provided only by the count is limited, therefore we are also interested in inferring the density maps of the images. Specifically, given the annotation , the corresponding groundtruth density map is generated as follows:
(1) 
where denotes the set of nonzero points in , and denotes a 2D normalized Gaussian function centered at point with smoothing factor ^{2}^{2}2For the compactness of expression, we have omitted the spatial transformation process of .. The size of depends on the specific feature representation, and we consider it as , where is the scaling factor. Calculating the count based on the density map can be done easily by summing over elements in , i.e.,
. In this way, most density map based methods can be summarized as learning a regression model by minimizing the following loss function:
(2) 
where denotes the loss, and can be any differential regularizer. Like in other deep learning models, gradient descent or its variants can be adopted to minimize .
IiiB The Inhomogeneous Density Distribution
From the experimental results in previous research [28, 24, 2, 22], we observe three interesting phenomena: (a) Compared with global model, local model shows superior performance. This is observed from the fact that the three columns of CNNs in SwitchCNN [2] and those in MCNN [28] are the very same, and the major difference between these two networks is SwitchCNN discriminatingly selects one column for predicting, while MCNN merges all three columns into a global regressor. (b) Compared with shallow networks, quite surprisingly, DNNs perform poorly, especially on datasets with various density levels [24]. (c) The mean absolute error (MAE) of current methods [28, 22] is proportional to the relative density level. If we divide images into groups based on their density levels, i.e., extremely low to extremely high, and consider the group with medium density level as the baseline, then groups far from the baseline suffer higher MAE than those near the baseline. All these phenomena indicate that blindly increasing the capacity of networks cannot help to solve our task. In order to increase the robustness of our solution, we must dig out the main reason behind the above phenomena, which, in fact, is the inhomogeneous density distribution in our images.
To understand how the inhomogeneous density distribution problem affects the performance of current methods, note that in the loss function defined in Eq.(2), the major objective is to minimize the loss term, which is easily affected by outliers. Without loss of generality, we take MCNN as an example. For the convenience of discussion, we only consider its last layer, which contains a filter group of size
, and ReLU is used as the activation function. We assume the input of the last layer is noisefree. Let
and denote the weights and bias of the filters, denote the input of the last convolutional layer, and be short for the residual that, based on the chain rule we can calculate the partial derivative of
loss term with respect to and as(3) 
where is the indicator function. From Eq.(3) we can see that, for all , the gradient is dominated by . Therefore, if outliers exist, e.g., is affected by additive noise , then is equally affected by the bias , and consequently disrupts the optimization process.
But where do the outliers come from? Remember that is generated by convolving the head annotation map with the 2D Gaussian kernel (Eq.(1)), of which the essential idea is, the sum of density values in the area of one head is equal to . Therefore, in ideal situation, we should generate the high and compact Gaussian response for people far from the camera (highdensity area), while the low and flat response for those near the camera (lowdensity area). However, due to the lack of annotation, current methods have to use a “moderate” Gaussian template to generate the density map. In this way, for highdensity / lowdensity areas, the density values of the real training template are lower / higher than those of the ideal one, and consequently the learned model will underestimate / overestimate the counts, which is exactly in accordance with the experimental results in [22].
With the inhomogeneous density distribution, the aforementioned phenomena can be explained well now: (a) Since SwitchCNN selects the training images for each of its regressors individually, it actually removes the outliers in a certain extent. On the other hand, the geometryadaptive kernel in MCNN does help to reduce the bias, but the outliers remain in the training process. (b) With deeper architectures, networks are more likely to overfit the outliers, degrading their generalization ability. (c) The real training template is quite different from the ideal response in areas with extreme density levels, therefore the trained models are more affected by the outliers in these areas.
Besides, by analyzing the gradient of the loss term, we can also infer the following problems will occur if we train a deep neural network to minimize the loss without deliberate consideration:
Dying ReLU: For any , i.e., when the network overestimates the density, if the value of or the learning rate for gradient descent is too large, then will decrease dramatically, causing the activation value for in a wide range of values. Since the gradient can only pass through a ReLU when the activation value is larger than , it is hard to update in this case, and consequently the output of the ReLU will always be .
Exploding Gradients: Similar to Eq.(3), we have , so if the value of is too large, e.g., , it may cause the gradient to explode. In fact, Daniel et al. [17] mention that their CCNN cannot converge with the Xavier Initialization [10]. Although they didn’t explain the reason behind it, we can see clearly that it is caused by the term.
Saddle Points: The gradient is dominated by , and all in contribute equally. Hence, if there are a few with unbounded, extremely high values, they may cancel out the partial derivative of the other elements in , namely reaching the critical points. Furthermore, [8] points out that if the critical points are with much larger than the global minimum, then they are exponentially likely to be the saddle points and will slow than the learning process^{3}^{3}3For the relationship between critical points and saddle points, interested readers can refer to [8] for more details.
Iv DensityAware Network
Based on the aforementioned analysis of the inhomogeneous density distribution problem, we can see that the ways of tackling it are quite straightforward: we can either (a) minimizing the difference between ideal templates and real training templates, or (b) reduce the effects of outliers, or, as we will show in Section IVA, we can even combine and accomplish both goals simultaneously within a unified structured learning framework, and implement the framework easily via the proposed DensityAware Network (DAN) in Section IIIA.
Iva Separating Inhomogeneous Density Distribution as Structured Learning
Our solution for the inhomogeneous density distribution problem is intuitive and natural: we extend the 2D density maps to the structured density maps (3D), of which the last dimension implicitly indicates the density levels, and we use an individual Gaussian kernel on each density level.
Formally, our goal now is to learn a structured regressor , where denotes the number of predefined density levels. In order to do so, we need to convert the original head annotation into the new structured label , ,
. Here we propose a heuristic soft mapping rule to determine the value of
for :(4) 
where is the normalization term to make sure , is the empirical smoothing parameter, and denotes the index of the optimal density level for . To determine , notice that the average distances between annotated points in the highdensity areas are usually much smaller than those in the lowdensity areas, therefore, we simply define a strictly increasing thresholds , and calculate the average distance from point to its Top5 nearest annotated points, and consider as the index of the minimum threshold that large than or equal to the average distance. Similar to Eq.(1), we generate the structured density map on level , , by convolving with a normalized Gaussian filter with the empirical smoothing parameter .
Remark: Figure 4 shows a toy example of the generating process of the structured density maps. The benefits of learning such structured density maps are trial: (a) Compared with the single Gaussian template solution, apparently multiple templates are less biased, and hence are more suitable to model the inhomogeneous density distribution; (b) Each level learns to generate its own density maps resembling the local model, thus can help to reduce the effects of outliers. (c) The outputs of all levels are generated simultaneously, which naturally serves as the regularizer, e.g., it is unlikely for a position to have high response values on both the highdensity and the lowdensity levels.
IvB Robust Regression
With the structured density maps, we are now ready to define our objective function. As we have pointed out in previous sections, directly learning the inhomogeneous density distribution with the loss will cause many problems, e.g., being sensitive to outliers. Therefore, we propose to minimizing the following costsensitive Huber loss:
(5) 
where denote the residual on the th density level, is the Huber loss (elementwise) that:
(6) 
and is the weight defined as:
(7) 
where , and
are hyperparameters and
, and denote the predicted and groundtruth count, respectively. To see how optimizing can obtain a more robust model, one should notice that if , otherwise , so it clips the gradient to no matter how large the biasis. This property reduces the effects of outliers and provides relief from the exploding gradient problem. Furthermore, the weight
is introduced to emphasize hard, highresidual examples. The term is the relative absolute residual, and the term is used to prevent from dividing . It is easy to see that is inversely proportional to , so that examples with high estimation error can gain more attention in our learning process. The hyperparameter controls the range of that , and adjusts the slope of curve of , as demonstrated in Figure 5. We set throughout this paper. and together can ensure the gradient stay within a reasonable range, and hence reduce the difficulty of optimization.IvC One Network, Two Goals
In order to realize the structured density map learning, we propose a simple yet effective DensityAware Network (DAN) in this section. The intuition behind our design of DAN is straightforward: as to regression, since shallow neural networks can generate rather good density maps, DNNs shall be able to do so as well, due to their higher capacity; as to classification, DNNs are also capable of distinguishing various density levels, e.g., in cuttingedge multisubnet networks [2, 22], the very deep VGG16 [19] is adopted as the classifier for density levels. Therefore, we believe that a single deep architecture is sufficient for learning the structured density maps. Besides, classifying density levels and learning the density maps are highly related tasks, therefore using a single network can let them share visual cues more effectively.
The proposed DAN is a multibranch network and its architecture is shown in Figure 3: the “trunk” of DAN, namely the shared blocks before splitting into multiple branches, stems from the VGG16 network. Specifically, we adopt the Conv1 to Conv5 blocks in VGG16, and remain only the first and the second pooling layers in VGG16, in this way, the downscaling factor of DAN is . We attach branches to the end of the “trunk”, and each branch consists of two convolutional blocks to generate its corresponding density maps. The specific number of and the sizes of filters in the branches can be adjusted based on the dataset, and we find and filters of perform well in most situations. Note that the whole DAN is fully convolutional, therefore it can handle images of arbitrary size.
Besides, in order to prevent the dying ReLU phenomenon, we also change all the activation functions in DAN to the leaky ReLU [16] that . As to the weight initialization, the weights of filters in Conv1 to Conv5
are initialized by the VGG16 model pretrained on the ImageNet. For the filters in the branches, as we have analyzed above, high initial value of
will cause the exploding gradient problem easily. Therefore, after generating with the Xavier initialization method [10], we multiply by a small number , i.e., . In our experiments, we find that can prevent gradient explosion effectively.Table I summarizes all major modifications in DAN, which can be implemented easily. The architecture of DAN is flexible and can be compatible with most current deep learning frameworks. Unlike other density map based methods, DAN can be trained in a totally endtoend way, thus makes it a considerable solution for the crowd counting problem.
Architecture:  VGG16 multibranch networks 

Loss function:  loss 
Activation function:  ReLU leaky ReLU 
Output:  
Initialization: 
Dataset  Num  Range  Avg  Std 
UCF [12]  40 : 10  1279.48  960.13  
WorldExpo[27]  3380 : 600  48.29  38.22  
ShanghaiTechA[28]  300 : 182  500.29  456.56  
ShanghaiTechB[28]  400 : 316  123.59  94.52 
denote the average and standard deviation of counts, respectively.
V Experiments
This section provides the detailed analysis of our experiments on several dataset [12, 27, 28]. Our mainly concerns on three aspects: (a) The comparison between the proposed structured inhomogeneous density learning and other stateoftheart methods (Section VC), which can show that the proposed method is a simple yet considerable solution for crowd counting in highdensity scenes. (b) The ablation study of the proposed DensityAware Network (DAN) for implementing structured learning on the ShanghaiTech, Part A [28] (Section VD), which validates our opinions on how the inhomogeneous density distribution affects the learning process. (c) Qualitative results of the proposed method (Section VE), in which not only the highquality results, but also those in the worst situations of the proposed methods are demonstrated.
Va Implementation Details
In order to minimizing the costsensitive Huber loss , we adopt the Adam optimizer [13] with a small learning rate, e.g., to . The model will converge in about K iterations. In each iteration, we select a training image and randomly crop a patch of the image size as the training example. We perform data augmentation by horizontally flipping images and adding Gaussian noisy to images.
As to the hyperparameters, the smoothing parameter for soft mapping, ; the density thresholds depend on the image sizes, and a customary choice is , and the smoothing parameters for Gaussian Filters, are set to ; the threshold in Huber loss, , is set to .
Our implementation of DAN is in MATLAB with the MatConvNet framework [23], source code will be released in the future.
VB Setup
To validate the effectiveness of our method, we perform extensive experiments on four publicly available datasets, including UCF [12], WorldExpo [27], ShanghaiTech A and B [28]. The statistics of these datasets are summarized in Table II. The counts in our experiments cover a wide range from to , which can reduce the possibility of overfitting. For the purpose of fair comparison, following most existing methods, we adopt the mean absolute error (MAE) and the mean square error (MSE) as the metrics for evaluation, which are defined as follows:
(8) 
VC Quantitative Analysis
ShanghaiTech: The first dataset we evaluate our method on is the ShanghaiTech dataset [28] including part A and B. Part A includes training images and images with counts ranging from to , while Part B includes training images and test images with counts ranging from to . We compare the proposed DAN networks with deep learning based stateoftheart methods on this dataset, including Zhang et al. [27], MCNN [28], CascadedMTL [21], SwitchCNN [2], Huang et al. [11], MSCNN [26] and CPCNN [22].
From the experimental results demonstrated in Table III, we can see that the performance of DAN is significant: it outperforms all stateoftheart methods by a large margin in Part B, considering both MAE (13.2) and MSE (20.1). As to Part A, DAN only gets the second best performance with respect to MAE. However, note that CPCNN is a multisubnet network with higher capacity than DAN, therefore we also compare DAN with the fused network consisting of the global context model (VGG16) and the density map regressor (MCNN) in CPCNN (denoted as CPCNN(G)), and find out that DAN can achieve better accuracy (MAE reduced by ). Besides, SwitchCNN [2] also employs the VGG16 architecture to construct one of its subnets, and we can see that the proposed DAN outperforms SwitchCNN on both Part A and Part B. These results indicate that, with our method, a simply endtoend trained deep network can achieve the stateoftheart performance.
Part A  Part B  
Method  MAE  MSE  MAE  MSE 
Zhang et al. [27]  181.8  277.7  32.0  49.8 
MCNN [28]  110.2  173.2  26.4  41.3 
CascadedMTL [21]  101.3  152.4  20.0  31.1 
SwitchCNN [2]  90.4  135.0  21.6  33.4 
Huang et al. [11]      20.2  35.6 
MSCNN [26]  83.8  blue127.4  blue17.7  30.2 
CPCNN (G) [22]  89.9  127.9     
CPCNN [22]  red73.6  red106.4  20.1  blue30.1 
DAN (ours)  blue81.8  134.7  red13.2  red20.1 
Method  Scene 1  Scene 2  Scene 3  Scene 4  Scene 5  Avg 

Chen et al.[6]  red2.1  55.9  red9.6  11.3  red3.4  16.5 
Zhang et al.[27]  9.8  blue14.1  14.3  22.2  3.7  12.9 
MCNN [28]  3.4  20.6  12.9  13.0  8.1  11.6 
ConvLSTM [25]  7.1  15.2  15.2  13.9  3.5  10.9 
Huang et al. [11]  4.1  21.7  11.9  11.0  3.5  10.5 
SwitchCNN with perspective map [2]  4.2  14.9  14.2  18.7  4.3  11.2 
SwitchCNN [2]  4.4  15.7  blue10.0  11.0  5.9  9.4 
CPCNN [22]  blue2.9  14.7  10.5  red10.4  5.8  red8.86 
DAN (ours)  4.1  red11.1  10.7  16.2  5.0  9.4 
WorldExpo: The second dataset for evaluation is WorldExpo [27], which contains test images from scenes. We compare DAN with stateoftheart methods on this dataset, including Chen et al.[6], Zhang et al.[27], MCNN [28], ConvLSTM [25], Huang et al. [11], SwitchCNN [2] and CPCNN [22].
As demonstrated in Table IV, the performance of all methods are very close (from to ). The major reasons behind this are dual: on the one hand, counts in this dataset range from to , with the smallest standard deviation value () among all the datasets, therefore it is hard to distinguish the performance of different methods based on the MAE metric; on the other hand, this dataset provides the groundtruth perspective maps for generating training templates, which obviously reduces the effects of outliers. Nevertheless, the proposed DAN still gets the lowest MAE in Scene 5, and the second lowest MAE considering the average MAE across all scenes. Furthermore, we do not use the groundtruth perspective maps in the training process of DAN, which indicates our method is robust to variant perspectives.
UCF: The UCF [12] is the most difficult dataset, because it has only images for training and has the highest standard deviation value (). Similar to other methods, we perform 5fold cross validation on this dataset. We compare DAN with stateoftheart methods on UCF, including Lempitsky et al. [14], Idrees et al. [12], Zhang et al. [27], MCNN [28], Huang et al. [11], CCNN [17], MSCNN [26], CNN Boosted [24], Hydra 2s [17], CascadedMTL [21], SwitchCNN [2] and CPCNN [22].
The experimental results are shown in Table V. From this result we can see that DAN can achieve rather high accuracy, its MAE is , which is between that of SwitchCNN () and CPCNN (). This again validates our opinion that a single deep network with proper learning settings can reduce the effects of outliers significantly.
Method  MAE  MSE 
Lempitsky et al. [14]  493.4  487.1 
Idrees et al. [12]  419.5  541.6 
Zhang et al. [27]  467.0  498.5 
MCNN [28]  377.6  425.2 
Huang et al. [11]  409.5  563.7 
CCNN [17]  488.67  646.68 
MSCNN [26]  363.7  468.4 
CNN Boosted [24]  364.4   
Hydra 2s [17]  333.73  425.26 
CascadedMTL [21]  322.8  397.9 
SwitchCNN [2]  318.1  439.2 
CPCNN [22]  blue295.8  blue320.9 
ConvLSTM [25]  red284.5  red297.1 
DAN (ours)  309.6  402.64 
VD Ablation Study
Remember that the major conclusion of this paper is that the inhomogeneous density distribution problem will result in the bias to outliers and other difficulties in the learning process. To validate this, we conduct an ablation study on the ShanghaiTech dataset with Part A. We compare the performance of DAN with different settings: training the VGG16 part in DAN with loss (denoted as VGG16), DAN adopting Huber loss and only branch (denoted as DAN1H), and DAN with different numbers of branches (denoted as DAN1 to DAN4). All the variants of DAN are trained with the same settings.
Method  Density Thresholds  MAE 

VGG16    
DAN1H  87.0  
DAN1  84.3  
DAN2  83.0  
DAN3  83.8  
DAN4  81.8 
The experimental results are shown in Table VI. Firstly, we have tried to train the VGG16 with loss several times, but unfortunately, all end up with the exploded gradients. This confirms our opinion that outliers can cause difficulties in the learning process. With the Huber loss and other modifications in our network, we succeed in learning the deep networks (DAN1H), which also outperforms most current networks. Secondly, our weighting strategy in the loss function can increase the performance significantly, as the performance of DAN1 is higher than that of DAN1H. Lastly, we test DAN with different number of branches, of which the density thresholds are set to be distributed as equally as possible, e.g., selected from . We find that with more branches, the performance of DAN is getting better. This can be understood as with more branches, the differences between training templates and ideal templates are smaller, and more local models are ensembled, and consequently the networks are more robust.
VE Qualitative Analysis
We demonstrate representative examples of our experimental results on the ShanghaiTech for qualitative analysis in this section. Figure 6 and 7 shows images with their groundtruth structured density maps and predicted density maps on Part A and B, respectively. As in [22], we also calculate the Peak Signal to Noise Ratio (PSNR) as the metric to evaluate the quality of our predicted density maps.
From these results we can see that the proposed method perform well in most situations, not matter for extremely highdensity scenes (Part A) or less crowd scenes (Part B). Besides, most PSNR values of our predicted density maps are above dB, which indicates DAN can generate highquality density maps. More importantly, notice that our predicted structured density maps are divided into levels clearly, therefore validates our opinion that DAN can accomplish the regression task and the classification task at the same time.
Yet there are a few failure cases for the proposed method. To analyze this, we demonstrate the top4 images ranked by the MAE in Figure 6 and 7. We notice that a common characteristic of these images is that in the extremely highdensity areas of these images, due to the restricted resolutions, even human cannot precisely distinguish the targets. Especially for images in Part A, sizes of targets are even less than pixels, e.g., the last row in Figure 6. Therefore, for these images, the resolution of our network is not high enough to handle them. Fortunately, these images are rare in all datasets, hence we still can consider DAN as a practical solution for the crowd counting problem.
Vi Conclusion
In this paper, we focus on the crowd counting problem in extremely highdensity scene images and take indepth analysis into the phenomena that existing deep learningbased methods do not work well while deeper networks are employed. Our comprehensive study reals that it can be heavily attributed to the inhomogeneous density distribution problem, and a feasible solution is provided by extending the density map from 2D to 3D, with a extra dimension implicitly indicating the density level. Based on this, we also present a single DensityAware Network that is simple and easy to train. Extensive experiments demonstrate that it achieves the stateofart performance on several challenging datasets.
References
 [1] C. Arteta, V. S. Lempitsky, J. A. Noble, and A. Zisserman. Interactive object counting. In Proceedings of European Conference on Computer Vision (ECCV), pages 504–518, 2014.

[2]
D. Babu Sam, S. Surya, and R. Venkatesh Babu.
Switching convolutional neural network for crowd counting.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, 2017.  [3] V. Belagiannis, C. Rupprecht, G. Carneiro, and N. Navab. Robust optimization for deep regression. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pages 2830–2838, 2015.
 [4] A. B. Chan and N. Vasconcelos. Counting people with lowlevel features and bayesian regression. IEEE Transactions on Image Processing, 21(4):2160–2177, 2012.
 [5] K. Chen, S. Gong, T. Xiang, and C. C. Loy. Cumulative attribute space for age and crowd density estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2467–2474, 2013.
 [6] K. Chen, C. C. Loy, S. Gong, and T. Xiang. Feature mining for localised crowd counting. In Proceedings of British Machine Vision Conference (BMVC), pages 1–11, 2012.
 [7] S. Chen, A. Fern, and S. Todorovic. Person count localization in videos from noisy foreground and detections. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1364–1372, 2015.
 [8] Y. N. Dauphin, R. Pascanu, Ç. Gülçehre, K. Cho, S. Ganguli, and Y. Bengio. Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Annual Conference on Neural Information Processing Systems (NIPS), pages 2933–2941, 2014.
 [9] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, and D. Ramanan. Object detection with discriminatively trained partbased models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9):1627–1645, 2010.
 [10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.
 [11] S. Huang, X. Li, Z. Zhang, F. Wu, S. Gao, R. Ji, and J. Han. Body structure aware deep crowd counting. IEEE Transactions on Image Processing, 2017.
 [12] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multisource multiscale counting in extremely dense crowd images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2547–2554, 2013.
 [13] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2015.
 [14] V. S. Lempitsky and A. Zisserman. Learning to count objects in images. In Annual Conference on Neural Information Processing Systems (NIPS), pages 1324–1332, 2010.
 [15] C. C. Loy, S. Gong, and T. Xiang. From semisupervised to transfer counting of crowds. In Proceedings of IEEE International Conference on Computer Vision (ICCV).
 [16] A. L. Maas, A. Y. Hannun, and A. Y. Ng. Rectifier nonlinearities improve neural network acoustic models.
 [17] D. OñoroRubio and R. J. LópezSastre. Towards perspectivefree object counting with deep learning. In Proceedings of European Conference on Computer Vision (ECCV), pages 615–629, 2016.
 [18] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(4):640–651, 2017.
 [19] K. Simonyan and A. Zisserman. Very deep convolutional networks for largescale image recognition. CoRR, abs/1409.1556, 2014.
 [20] V. Sindagi and V. M. Patel. A survey of recent advances in cnnbased single image crowd counting and density estimation. Pattern Recognition Letters, 2017.
 [21] V. A. Sindagi and V. M. Patel. Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting. In IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pages 1–6, 2017.
 [22] V. A. Sindagi and V. M. Patel. Generating highquality crowd density maps using contextual pyramid cnns. In Proceedings of IEEE International Conference on Computer Vision (ICCV), 2017.
 [23] A. Vedaldi and K. Lenc. Matconvnet: Convolutional neural networks for MATLAB. In Proceedings of the Annual ACM Conference on Multimedia Conference (ACM MM), pages 689–692, 2015.
 [24] E. Walach and L. Wolf. Learning to count with CNN boosting. In Proceedings of European Conference on Computer Vision (ECCV), pages 660–676, 2016.
 [25] F. Xiong, X. Shi, and D.Y. Yeung. Spatiotemporal modeling for crowd counting in videos. In Proceedings of IEEE International Conference on Computer Vision (ICCV), pages 5151–5159, 2017.
 [26] L. Zeng, X. Xu, B. Cai, S. Qiu, and T. Zhang. Multiscale convolutional neural networks for crowd counting. In Proceedings of IEEE International Conference on Image Processing (ICIP), pages 465–469, 2017.
 [27] C. Zhang, H. Li, X. Wang, and X. Yang. Crossscene crowd counting via deep convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–841, 2015.
 [28] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Singleimage crowd counting via multicolumn convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597, 2016.