Crowd counting aims to count the number of people by means of estimating the density distribution of the crowd in a single image. It is a very useful computer vision technique to facilitate a variety of applications, including crowd control, disaster management and public safety monitoring. However, it is not a trivial task due to great challenges in real-world situations caused by cluttered backgrounds and non-uniform people scale within an image.
Tremendous algorithms [23, 12, 7] have been proposed in the literature for estimating the crowd density distribution. The majority of them focused on addressing two problems when learning the mappings from image features to density distribution maps, i.e., 1) how to eliminate the impacts of cluttered backgrounds, and 2) how to deal with varying scales of people within an image. Figure 1 illustrates both mentioned problems. Specifically, in Figure 1(a), the right picture depicts the estimated density map of the left image, derived by the MCNN model . It can be noticed that backgrounds, e.g., umbrellas, could be mistakenly regarded as people on the density map, thus decreasing the estimation accuracy. Meanwhile, as illustrated in Figure 1(b), sizes of human heads can vary greatly within an image, because of their different distances from the camera.
To eliminate noises caused by cluttered backgrounds, attention mechanism is usually introduced to re-weigh features or regions in terms of their probabilities of being the crowds. Generally, additional training samples and parameters are employed to train standalone classifiers indicating density levels or head probability  as the metric to evaluate the importance of different features/regions within an image, on the basis of which, different weights are given to the features/regions. However, standalone networks with complex structures usually require millions of extra to-be-learned parameters, which can be a heavy burden for a real-life application.
By exploring the relationship between images and their corresponding normalized shallow feature maps generated by several baselines [23, 1, 12] (Figure 2), we observe, for the first time, that backgrounds like stairs, trees and buildings, tend to have significantly different responses from those of the human crowds. For example, the backgrounds have stronger responses in Figure 2 but weaker ones in Figure 2, whereas human crowds’ reactions are opposite (weaker responses in Figure 2 but stronger ones in Figure 2). This tells us that backgrounds and human crowds are more separable on shallow-layer feature maps. An attention model based on shallow features has potential to generate more accurate attention maps. Therefore, instead of involving a sophisticated standalone attention model as previous works, we incorporate an attention module in our feature extraction networks, which effectively reuses the shallow features and enjoys less complex structures to diminish background noises.
Regarding the problem of varying scales of people within an image, some works [23, 4] adopted “multi-column” frameworks to extract multi-scale information from images, where each branch extracts features of a specific scale by adopting filters with a certain size. Others exploit some convolutional operations, like dilated [12, 4] and deformable convolution kernels [14, 25], to capture multi-scale information by expanding the receptive field of filters. Yet most of them extracted features layer by layer, and thus the features of the current layer may lose information of features in some preceding layers.
Actually, the most representative features of people across different scales can appear in different layers of the feature extraction networks. For example, the most representative features of people in a smaller scale can probably be extracted in an earlier layer, while those of people in a larger scale can be extracted in a later layer. Thus, it is vital to keep information of features in all different layers. Therefore, densely-connected structure that enables each layer to process features from all preceding layers seems like an appropriate structure, on which features corresponding to all scales can be well preserved and better encoded to facilitate the estimation of the crowd density.
Based on the observations above, we propose a new method for crowd counting, termed Shallow feature based Dense Attention Network (SDANet). SDANet consists of three components, i.e., low-level feature extractor, high-level feature encoder, and attention map generator. As mentioned above, the attention map generator reduces the noises caused by backgrounds via re-weighing specific regions with attention maps generated with shallow features. Moreover, multi-scale information is well preserved via densely connecting the features of different layers in the high-level feature encoder. Extensive experiments on benchmark datasets also clearly demonstrate the superiority of SDANet.
Contributions of our work are summarized as follows:
We observe, for the first time, that shallow features contain distinguishable information between backgrounds and human crowds, which allows us to utilize a lightweight network to generate even more accurate attention maps.
We propose to employ densely connected structures in feature extraction/encoding networks, such that multi-scale information in different layers can be well kept to facilitate the estimation of the crowd density.
We propose a novel crowd counting method termed SDANet. And experiments conducted on three benchmark datasets show that SDANet achieves the state-of-the-art performance for crowd counting.
Over the last few years, researchers have attempted to address the issue of crowd counting by density estimation with a variety of approaches 
, where a mapping from image features to crowd density is learned and then the counted number is the summation over an estimated density map. Existing density estimation methods can be generally categorized as hand-crafted feature based ones and deep feature based ones, where latter ones tend to incorporate attention mechanism recently.
Hand-Crafted Feature based Methods
Early works usually extract hand-crafted features implying global image characteristics, such as local binary pattern (LBP) and gray level co-occurrence matrices (GLCM), and learn its mapping to the density by regression models, ranging from linear ones to non-linear ones. Lempitsky et al. utilized linear models to describe the mapping from image features to the density in a local region, which is applied in bacteria counting and crowd counting with a relatively sparse density. Idress et al. explored features from three sources, i.e., Fourier, interest points and head detection combined with their respective confidences to get counts at localized patches and adopted a Markov Random Field (MRF) framework to obtain an estimated count for the entire image.
Deep Feature based Methods
Inspired by the huge success of convolutional neural networks (CNN) in image classification, recently deep features have been leveraged for density estimation. Owing to their superior performance, deep learning based methods [20, 23, 4, 12, 5] quickly dominate the research in crowd counting.
Zhang et al. proposed a multi-column based architecture (MCNN), where each column adopts a filter with a certain size to extract features of the corresponding scale. Instead of training all patches with the same paralleled network, Sam et al. proposed a switching CNN that adaptively selects the optimal branch for an image patch according to its density. A classifier indicating patch density is trained beforehand and empowers density estimation networks by providing prior knowledge. Recently, dilated kernels have also been involved in multi-column frameworks to further deliver larger reception fields .
Attention mechanism in crowd counting
Recently, attention mechanism is widely incorporated to enhance the crowd counting performance. The idea is to roughly approximate the regions in the image where people are likely appeared. To do so, an attention model is learned to assign larger weights to pixels/regions of being human crowds. [13, 8, 5, 14, 24].
ADCrowdNet  employs an attention map generator trained on additional negative samples and then applies it to detect crowd regions in the images. Hossain et al. proposed a Scale-Aware Attention Networks (SAAN), which utilizes attention mechanism to re-weigh multi-scale features learned by multi-columns. SFANet  generates an attention map with the same size of the image by an additional CNN branch, where each pixel indicates its probability of being the head. Alternatively, DecideNet  uses a learned attention map to combine the two maps generated by the regression branch and the detection branch.
The proposed SDANet in this paper is also a deep feature based method with attention mechanism incorporated. However, different from previous works that learned a standalone attention model with sophisticated structures, by observing that shallow features can have strong signals to distinguish backgrounds and human crowds, we propose to use shallow features to build an attention module in SDANet with simpler network structures. Moreover, instead of encoding multi-scale features layer by layer that has the risk of losing feature information of some preceding layers, we propose to densely connect outputs of each layer in SDANet, so that multi-scale features of different layers can be better kept and encoded to facilitate the estimation of crowd density.
The framework of SDANet is illustrated schematically in Figure 3(a), which mainly consists of three components: Low-level Feature Extractor (LFE), High-level Feature Encoder (HFE), and Attention Map Generator (AMG).
Low-level Feature Extractor (LFE)
Most existing methods use separate branches with different size filters to extract multi-scale information from images, which may introduce redundant structures into the pipeline . Inspired by the success of SANet 
in feature extraction, the Inception module, a tool to process visual information of various scales, is used as the shallow feature extractor of SDANet.
Specifically, LFE consist of two feature extractor blocks and each of them contains four branches with filter sizes of , , , and respectively, as shown in Figure 4. Each branch focuses on a certain scale and generates the same number of feature maps. To further enhance model’s capability to capture multiple scales information, dilated convolution, which can enlarge the receptive field without involving extra computations, is employed in the second block. Additionally, expect for the branch, there is an extra
As a departure from most of works, we remove the pooling layers between the inception modules to avoid the reduction in spatial resolution caused by the pooling operation and the additional complexity brought by subsequent deconvolutional layers. Considering the trade-off between resource consumption and model accuracy, we instead adopt dilated filters with the dilated rate of 2 to replace the pooling layer . Features from different branches, covering multi-scale appearance of people in images, are subsequently concatenated together for the feature encoding.
High-level Feature Encoder (HFE)
The structure of HFE is shown in Figure 3(b), which takes shallow features extracted from the second block of LFE as input. While encoding features, such a structure can well preserve multi-scale information.
HFE is compose of two blocks, where each block consists of three convolution layers with the filter size of followed by a ReLU activate function. Particularly, the input of a specific convolution layer () is the concatenation of all outputs from preceding layers, i.e., , which are indicated by different colors in the figure. The dense connection between layers ensures that multi-scale information in the shallow features can be preserved. At the bottom of each block, a convolution layer is applied to integrate the concatenated hierarchical features and reduce feature channels to the same dimension as the input, which is indicated by . Therefore, the output of the -th block in HFE can be obtained by,
Finally, the input of each block is added onto the output, which will in turn become the input of the next block.
On top of that, to further preserve multi-scale information, shallow features obtained by low-level feature extractor () and the output of each block in HFE () are concatenated together, which is in Eq. (2), as the input for the feature integration in global level. In the integration, a and a convolution layer are employed to integrate high-level features in a global level, which is indicated by in Eq. (2). Henceforth, the output of HFE can be calculated by,
Rather than widening the network, the proposed densely connected structure takes full advantage of features from all layers and well preserves the scale information in shallow features, which efficiently eliminates the problem of scale variation. In the paper, the dimension of and are both set to 64 according to the extensive experiments, which is less than most of the state-of-the-art methods.
Attention Map Generator (AMG)
In light of the observation that backgrounds on shallow feature maps tend to have significantly different responses, compared to the crowds, we generate attention maps based on low-level features only. Specifically, AMG takes shallow features from the first block of LFE () as input and generates pixel-wise attention maps () on which crowd regions are always ”brighter” than the backgrounds, i.e.,
Here, two convolution layers followed by a sigmoid function, as shown in Figure3(c), are used to ensure that all the computed weights are within the range of 0 to 1. , the summation of pixel-wise Euclidean distance between refined feature maps and ground-truth density map , conveys the supervision information to the learning process of the attention module. Subsequently, the attention map is employed to refine the encoded feature by element-wise multiply as follows,
where is taken as the input of the last two convolution layers whose filter sizes are and respectively to generate the high-quality density map under the supervision of a combination of several losses.
The density maps generator in the SDANet adopts a coarse-to-fine strategy. Concretely, the loss is composed of two terms: and in the Figure 3(a) respectively.
Firstly, a convolution layer with the filter size of is employed to learn a coarse mapping between combined feature maps () from the HFE and AMG to the density maps, and meanwhile, prepare coarse density maps for further process. In order to supervise the learning process of attention maps and the generation of coarse density maps, , measuring the Euclidean distance between coarse density maps () and the ground-truth density map , is adopted. Explicitly, is defined as,
where is the dimension of , and is set to 32 throughout all experiments.
Subsequently, two convolution layers with filter sizes of and are involved to further refine the quality of coarse density map, thus enhancing the accuracy of crowd counting. Noticeably, the ReLU activation function is employed after convolution layers to avoid appearance of negative values. Last, is introduced to supervise the refinement process and generate the fine-grained density map (). Concretely, is composed by an Euclidean loss () and a Counting loss (), which are somewhat complementary to each other. Initially, is adopted to improve the quality of density map by minimizing the Euclidean distance between the fine-grained density map and the ground-truth, which can be described by,
where and are estimated density map and ground truth of the -th image , respectively, and
refers to the number of training samples. However, sharp edges and outliers in coarse density maps might be blurry in fine-grained maps. To remedy this situation,is added as a compensation, which is defined by,
where and represent, respectively, the estimated number of people and the ground truth of the -th training sample, which are the integral over all pixels on the corresponding density map, i.e., . Additionally, is set to avoid the denominator being zero. not only accelerates the convergence process but improves the counting accuracy. In summary, is expressed as,
where is the empirical weight for .
Therefore, the overall loss of SDANet is,
Adam  algorithm with the initial learning rate of 1e-4 is adopted to optimize the SDANet.
Similar to the previous work, the mean absolute error (MAE) and mean squared error (MSE) metrics are used for algorithm evaluation, which are defined as:
where represents the total number of images involved in testing, and are the ground truth and estimated number of people for the -th image respectively.
In the experiment, three crowd counting benchmark datasets, the UCF_CC_50 dataset, the WorldExpo10 dataset, and the ShanghaiTech dataset, are used to evaluate the performance of SDANet, each being elaborated below.
 contains 50 images with various perspectives and resolutions. The number of annotated people per image ranges from 94 to 4543 with an average number of 1280, which is a challenging dataset in the field of crowd counting.
 consists of 3980 annotated frames from 1132 video sequences captured by 108 different surveillance cameras, which is divided into a training set (3380 frames) and a test set (600 frames). The region of interest (ROI) is also provided for the whole dataset.
 consists of 1198 annotated images with a total amount of 330,165 annotated people. The dataset contains two parts: PartA and PartB. PartA includes 482 internet images with highly congested scenes while PartB includes 716 images with relatively sparse crowd scenes taken from streets in Shanghai.
|SDANet without AMG||12.89||15.28|
|SDANet without Dense Structure||10.14||13.25|
|SDANet without Refinement||9.64||13.19|
Taking the computation cost and data variety into account, we adopted the patch-wise training strategy. Following the previous work , 9 patches, where each patch is of the image size, are cropped from each image to generate the training set. The first four patches contain four quarters of the image without overlapping while the other five patches are randomly cropped from the image. During the test, non-overlapping patches are cropped from each image in the test set and compute individually. The final density map of the image is the concatenation of its patches’ predictions. Additionally, images are further augmented by randomly horizontal flipping.
Besides, we generated the ground-truth from head annotations given by datasets . Each head annotation is blurred with a Gaussian kernel, whose summation is normalized to one and the number of people is the integral over the density map.
Results and Analysis
On each dataset, we follow the standard protocol to generate ground truth and compare our method with the state-of-the-art algorithms. Furthermore, we conduct extensive ablation experiments on the WorldExpo10 dataset to analyze the effects of different components in SDANet. We explain experimental settings and show results as follows.
On the UCF_CC_50 dataset, we performed a 5-fold cross-validation to evaluate the proposed method as suggested by . Table. 1 shows the comparison of the results of our method with contemporary state-of-the-art works on UCF_CC_50 dataset, which illustrates the proposed SDANet is able to deal with crowd scenes with varying densities and achieves a superior performance over other approaches. Specifically, our method achieves 11.91 MAE reduction and 5.52 MSE reduction. This clearly demonstrates that SDANet is super robust against the scale and density changes.
The comparison results of SDANet with contemporary state-of-the-art work on the 5 scenes (S1S5) in the test set of WorldExpo10 dataset are shown in Table. 2. The challenging test set is a combination of different densities, ranging from sparse to dense, and various backgrounds including squares, stations, etc. From the result, it can be seen that the proposed SDANet scores the best in Scene1, Scene4 and Scene5 as well as the best accuracy on average, which again proves the strong adaptability of SDANet against different scenarios with varying density levels.
On the ShanghaiTech dataset, SDANet is evaluated and compared with other recent works and results are shown in Table. 3. Again, the proposed method attains the lowest MAE and MSE as well. Specifically, our approach outperforms the latest work TEDnet by 4.87 and 20.31 over the MAE and MSE metric respectively on the ShanghaiTech PartB dataset.
We firstly analyzed the attention maps generated by AMG and obtained some statistical results. Taking the attention map of Figure 2 as an example, the average attention value of crowd region (center-right) is 0.874 (GT=1) while that for background region (left corner) is 0.253 (GT=0), which proves that the attention maps reduce the background noise by arranging background regions with relatively low weights.
To demonstrate the performance of SDANet on scenes with cluttered backgrounds and varying head sizes, we choose, in particular, the ShanghaiTech dataset for estimated density maps visualization, which are shown in Figure 5. For each group of images, pictures in the middle and on the right are corresponding ground truth and estimated density map of the image on the left, where the number on the top right corner indicates the ground truth (GT) and the estimated number of people (PRE) respectively. Here, we display the estimated density maps of various scenarios, ranging from 103 persons to 1067 persons, to demonstrate that the proposed SDANet performs decently in both dense and sparse scenes. It can be seen that SDANet has a strong adaptability to different density levels with a error less than 4.
To validate the effectiveness of key components in the SDANet, we also conducted ablation studies on the WorldExpo10 dataset which is more realistic and challenging due to the fact that all images are acquired from real surveillance scenes.
Effectiveness of AMG
We explore the performance improvement offered by AMG by removing the attention module from the SDANet and compare it with the network with AMG. The result is indicated by SDANet without AMG in Table. 4. There are 37 increase in MAE and 15 increase in MSE if AMG is dropped out, clearly demonstrating that AMG has made a significant contribution in diminishing background noise.
Effectiveness of densely-connected structure
In order to shed light on how the densely connecting structure preserves multi-scale features, we conduct an experiment on the same dataset without the dense connection between layers and the result is indicated by SDANet without Dense Structure in Table. 4. It can be seen that the removal of the dense connection between layers leads to an over 20.1 drop in the counting accuracy, which means that densely-connected structure reinforces the diversity of features and improve the performance of SDANet.
Effectiveness of estimation refined layers
Furthermore, we study the refinement ability of the last two layers and the loss term . We screen out the last two convolution layers in SDANet and train the network with solely , whose result is indicated by SDANet without Refinement in Table. 4. Without the refinement layers, there is a nearly 16
decline in the MAE. Therefore, the coarse-to-fine strategy involved in the loss function can further enhance the performance of the network.
In this paper, we have presented a brand-new Shallow feature based Dense Attention Network (SDANet) aiming to automatically count the number of people in an image. Our SDANet is characterized by: 1) diminishing the impact of backgrounds via involving a lightweight attention model, and 2) capturing multi-scale information via densely connecting hierarchical image features. Extensive experiments have been carried out and the results on three benchmark datasets validate the adaptability and robustness of the SDANet when varying crowd scenes from sparse to dense.
-  (2016) Crowdnet: a deep convolutional network for dense crowd counting. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 640–644. Cited by: Introduction.
-  (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: Low-level Feature Extractor (LFE), Table 1, Table 2.
-  (2018) Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 40 (4), pp. 834–848. Cited by: Low-level Feature Extractor (LFE).
An aggregated multicolumn dilated convolution network for perspective-free counting.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 195–204. Cited by: Introduction, Deep Feature based Methods.
-  (2019) Crowd counting using scale-aware attention networks. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1280–1288. Cited by: Attention mechanism in crowd counting, Attention mechanism in crowd counting, Deep Feature based Methods, Table 1.
-  (2013) Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547–2554. Cited by: Hand-Crafted Feature based Methods, UCF_CC_50 dataset, Quantitative results.
-  (2019) Crowd counting and density estimation by trellis encoder-decoder network. arXiv preprint arXiv:1903.00853. Cited by: Introduction, Table 3.
-  (2018) Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115. Cited by: Attention mechanism in crowd counting.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: Loss Function.
-  (2012) Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp. 1097–1105. Cited by: Deep Feature based Methods.
-  (2010) Learning to count objects in images. In Advances in neural information processing systems, pp. 1324–1332. Cited by: Hand-Crafted Feature based Methods.
-  (2018) Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1091–1100. Cited by: Introduction, Introduction, Introduction, Deep Feature based Methods, Deep Feature based Methods, Low-level Feature Extractor (LFE), Table 1, Table 2.
-  (2018) Decidenet: counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: Attention mechanism in crowd counting, Attention mechanism in crowd counting, Table 3.
-  (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3225–3234. Cited by: Introduction, Introduction, Attention mechanism in crowd counting, Attention mechanism in crowd counting.
-  (2017) Switching convolutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4031–4039. Cited by: Introduction, Deep Feature based Methods, Table 2.
-  (2018) Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5245–5254. Cited by: Table 3.
-  (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1861–1870. Cited by: Table 3.
-  (2018) A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters 107, pp. 3–16. Cited by: Related Works.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: Low-level Feature Extractor (LFE).
-  (2015) Deep people counting in extremely dense crowds. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1299–1302. Cited by: Deep Feature based Methods.
-  (2015) Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 833–841. Cited by: Table 2, WorldExpo10 dataset.
-  (2018) Crowd counting via scale-adaptive convolutional neural network. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1113–1121. Cited by: Table 2.
-  (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597. Cited by: Introduction, Introduction, Introduction, Deep Feature based Methods, Deep Feature based Methods, Table 2, ShanghaiTech dataset, Experiment Settings, Experiment Settings.
-  (2019) Dual path multi-scale fusion networks with attention for crowd counting. arXiv preprint arXiv:1902.01115. Cited by: Attention mechanism in crowd counting, Attention mechanism in crowd counting.
-  (2018) DA-net: learning the fine-grained density distribution with deformation aggregation network. IEEE Access 6, pp. 60745–60756. Cited by: Introduction.