Nowadays, crowd counting has become an important task for a variety of applications, such as traffic control 
, public safety, and scene understanding[33, 52]. As a result, density estimation techniques have become a research trend for various counting tasks. These techniques utilize trained regressors to estimate people density for each area so that the summation of the resultant density functions can yield the final count of crowd.
A variety of regressors, such as Gaussian Processes 18]
, and more recently, deep learning based networks[23, 19, 22] have been used for crowd counting and density estimation. However, the state-of-the-art approaches are mostly deep learning based approaches due to their capabilities of generating accurate density maps and producing precise crowd counting .
Generally, deep neural networks (DNNs) based approaches utilize standard convolutions and dilated convolutions at the heart of the models to learn local patterns and density maps[22, 1]. Most of them use the same filters, pooling matrices, and settings across the whole image, and implicitly assume the same congestion level everywhere . However, this assumption often does not hold in reality.
To better understand the effect of this mis-assumption, let us show some examples with clearly different levels of crowdedness. Fig. 1 presents some exemplar images of different congestion scenarios. Fig. 1(a) shows a highly crowded image having more than 1,000 people, while Fig. 1(c) presents a less crowded scene having less than 70 people. However, if we look at Fig. 1(a), we notice that there is a relatively more congested area, which is shown in Fig. 1(b). The same situation can be seen in Fig. 1(c), and it is obvious that a small area within this crowd, as shown in Fig. 1(d), is more crowed.
Due to this dynamic variation in the crowded scenes, naturally we should utilize different features and branches to respond and capture details at different levels of crowdedness. In the past, this has been attempted by four major types of approaches, i.e., defining separate pathways from the former layers and utilizing different sizes of the convolutional filters, image pyramid-based methods [7, 26], detection-based crowd counting , patch-based crowd counting [31, 1], and multi-level feature based methods 
. Although these methods are able to show robust performance with some different tactics, there are still lots of spaces to improve their performances by designing highly efficient convolutional layer structures for better feature extraction.
First, generally speaking, a kernel size of for a convolution filter is more effective than the larger ones for extracting valuable features because more details can be captured with lower complexities and without making it more difficult to train the network [36, 39, 40]. Kang et al.  proved that smaller receptive fields gave better performance. Secondly, using patch-based processing and multi-patch is time costly due to that the same features have to pass through different paths and patches multiple times. If we want to take benefit of the multi-patch or multi-column based approaches, it is better to extract some coarse features from the initial layers and then pass them to some branches for further zooming in to find more sophisticated features. To utilize a deeper network for crowd counting, we need an approach that can deploy the aforementioned proposals on the multi-column structure to achieve better performance.
In this paper, we present a deep encoder-decoder based architecture called PDANet that combines the pyramid feature extraction with spatial and channel attentions to produce richer features for crowd counting and density estimation. In our work, we use the VGG16 as the feature extractor for the encoder to produce features for the decoder of the model. To learn multi-scale features, we first use a cascade of global adaptive pooling (GAP), convolution and dilated convolutions with kernels of to extract more mature features with different scales from VGG16 features. Then, we apply the channel and spatial attentions in different layers to enhance and boost the quality of features in order to obtain more accurate density maps. On the other hand, to make the model more adaptive, we introduce a classification module to detect the crowdedness of the input scene. Besides, to address the intra-variation in the crowdedness level, we develop generation models of a low and high crowded density maps within the decoder section.
This work is different from existing crowd counting approaches that use the pyramid contextual information and attention modules in many ways. (a) Unlike the work in , we do not separate the input scene to different patches to address intra-congested areas within the input scene. (b) We utilize a new combination of global average pooling, convolution and Atrous convolution. Our approach is different in terms of the orders and parameters, and is more effective than the existing methods. In the experimental section, we will prove that this combination can better aggregate local scale features, and on all datasets, further increase the performance. (c) The architecture of our end-to-end attention model is also different from ADCrowdNet  because it does not use the spatial-based attention modules, which are used within the architecture and trained in an end-to-end way. (d) We treat the low and high density scenes differently from . We introduce a two-level density management module to manage high density inputs and the highly density area within an input scene. We do not separate patches of the low and high density scenes at the beginning. Instead, we introduce a classification module to send the low and high crowded scenes to different pathways from using the first-layer pyramid features to optimize the network weights and address dense scenes effectively. (e) Furthermore, we separate the decoder part of our model into two parts to manage the intra-density within a scene. (f) Compared with the work in , we use a new combination of low and high density features and output to create a unique spatial-attention correction module to generate a final density map from a mix of low and high density features.
To summarize, our contributions are as follows.
We propose a new deep encoder-decoder based network, dubbed as Pyramid Density-aware Attention Net (PDANet) for accurate crowd counting that incorporates pyramid contextual information and attention modules into an end-to-end trainable density estimation pipeline, and learns to exploit the right context at each location within a scene.
We introduce two-stage decoder modules that address low and high level crowded scenes as well as intra-congested areas within the scene sperately.
We use a combination of classification and regression losses to address the whole and within-the-scene changes in the density maps.
Extensive experiments on several challenging benchmarks are conducted to demonstrate the superior performance of our approach over the state-of-the-art solutions.
2 Related Works
In this section, we provide literature review related to our PDANet model. Although early solutions to crowd counting focused on counting by detection, unfortunately, due to their incapability of handling highly congested scenes , they fail to deal with those more challenging, very crowded scenes. In recent years, counting by regression has become the most popular crowd counting approaches, which learn a regressor to find the relationship between image characteristics and the density or object count [26, 31, 44, 23]. In recent years, regression models based on the deep neural networks (DNNs) have become the dominant ones for density estimation and crowd counting.
Due to the excellent ability of CNN to learn local patterns, researchers have started to utilize it for regressing the density map and crowd counting [20, 4, 7]. In the earlier researches about crowd counting, researchers focused on using a single branch or scale crowd counting [17, 18]. With the CNN-based models they superseded the previous studies that utilized the traditional models such as Gaussian Process or Random Forests as a regressor [6, 18]. One of the best single column methods was proposed by Li et al. , which combined VGG-16  and dilated convolution layers to handle multi-scale contextual information.
However, despite these achievements, a significant issue still remains, that is the huge variation of people sizes among the different datasets and within the input scene . To do that, some researchers tried to utilize the patch-based processing [27, 1]. They divided the scenes into some overlapping patches and then fed them to the CNN-based models to estimate the final density map . Although these approaches were able to improve the accuracy significantly, they had a big drawback, that was time cost. Therefore, researchers came up with a new idea, i.e., to utilize multi-scale networks to do the density estimation task [4, 26]. Cao et al. introduced a CNN-based network, where the encoder extracted the scale diversity in its features by using an aggregation module and the decoder utilized transposed convolutions to generate high-resolution density maps (SANet ). Experimental results demonstrated that their model could achieve superior performance to state-of-the-art methods.
Many studies have been done based on multi-column architectures [50, 31]. One of the initial works was done by Zhang et al., who proposed a three-CNN-column based structure, each with a different receptive parameter to manage a range of different head sizes (MCNN ), which has improved the-state-of-the-art results remarkably. Based on the idea of MCNN, a multi-column patch-based model, Switch-CNN [31, 50] was proposed by Sam et al.. Their approach used the concept of patch classification and multi-scale regressors for generating the density map. IG-CNN  was an extensive study of their previous work, that combined the clustering and crowd counting for estimating the density map more adaptively based on training a mixture of experts that could incrementally adapt and grow based on the complexity of the dataset. Sindagi et al. proposed a new multi-column network, i.e., CP-CNN , that added two other branches to classify an image-wise density to provide the global and local context information to the MCNN model. Recently, Deb et al.  incorporated the Atrous convolutions into the multi-branch network, by assigning different dilation rates to the various branches.
Most recently, Kang et al.  proposed a model that used image pyramids to handle multiple scales within the scenes. They created an image pyramid of the input scene and passed each image through the FCN to get the output density maps, and then fused them adaptively at every pixel location. Although they were able to improve the performance, it was not significant enough to compare with the state-of-the-art researches. Shi et al. proposed a perspective information CNN-based model for crowd counting (PACNN ). Their model combined the perspective information with density regression to address the person scale change within an image. They generated the ground truth perspective maps and used it for generating perspective-aware weighting layers to combine the results of multi-scale density adaptively. Wan et al.  proposed a new model to utilize the correlation information among training dataset (Residual information) for accurate crowd counting (RRSP). They fused all the residual predictions and created the final density map based on the appearance-based map and the combined residual maps from the input scene. Although they achieved an excellent result, their model suffered a considerable training cost.
Recent studies mostly focused on utilizing the pyramid and attention-based module . Pyramid modules were introduced by Zhao et al.  to produce proper quality features on the scene semantic segmentation task. They introduced an efficient method to estimate the head size and combined it with attention module to aggregate density maps from different layers and generate the final density map. Liu et al. presented another end-to-end multi-scaled solution based on fusing multi-scale pyramid features (CAN ). They used modified PSP modules for extracting multi-scale features from the VGG16 features to address the rapid scale changes within the scenes. Their model leveraged multi-scale adaptive pooling operations to cover a various range of receptive fields. Compared to CAN, Chen et al. proposed an end-to-end single-column structure as a Scale Pyramid Network (SPN) which extracted multi-scale features with the dilated convolution with various dilation rate ( and ) from the VGG16 backbone features . The experimental results proved that their idea worked well on some well-known datasets (Shanghai-part A ).
On the other hand, the attention module and idea proposed by  aimed to re-calibrate the features adaptively, so as to highlight the effect of valuable features, while suppressing the impact of weak ones . Recently, researchers attempted to incorporate this module and its variations into their models to improve the performance in several tasks such as object detection, object classification, medical image processing [14, 32, 47]. Rahul et al. proposed an attention-based model to regress multi-scale density maps from several intermediate layers . ADCrowdNet  is one of the latest research in the area of crowd counting which used attention modules to generate accurate density maps. Liu et al. utilized a two-step cascade encoder-decoder architecture, one for the detection of the crowded areas and producing the attention map (AMG), and the other for generating density maps (DME). Their method achieved excellent results with Shanghai-Part A dataset. Although the idea of using the attention map was interesting, but it has some significant drawbacks, such as (a) it needed an external dataset to train AMG to detect the crowd area, (b) after producing the attention map it will apply on the input crowd image to create a masked input data for DME, which is redundant and time consuming. Wu et al. proposed an adaptive multi-pass model for crowd counting (ASD ). Their model has three branches, two for sparse and dense crowd counting with different respective fields, and the third layer for adaptively recalibrating the effect of each density map to produce the final density output.
3 Pyramid Density-aware Attention Net
In this section, we first present the general structure of our proposed PDANet for adaptively addressing the challenges in crowd counting. This new structure uses pyramid-scale feature extraction and consists of adaptive pooling, and and
convolutions to enrich the feature maps for handling objects of various scales within a scene. In the following subsections, we will give more details about the attention modules, pyramid feature modules, decoders and loss functions.
As discussed above, we formulate crowd counting as regressing a people density map from a scene.
The overall architecture of PDANet for regressing the density map of the crowd from an image is illustrated in Fig. 2. This framework contains five main components, i.e., Feature extractor, Pyramid Feature Extractor (PFE), Classifier, Density Aware Decoder (DAD), and Attention Modules. Each of these sections has an impact on the accuracy and efficiency of the model for the crowd counting.
The backbone of PDANet is a network based on VGG16  which is widely used for the extraction of low-level features. We eliminate layers between the last two pooling layers considering the trade-off between resource cost and accuracy . Then, we apply a channel and spatial based attention module to it to highlight essential features. Then, these features are fed into the pyramid feature extractor module (PFE). The PFE module incorporates the combination of adaptive pooling and and dilated convolution to produce scale-aware mature features for last layers of decoder section. In the next step, we incorporate a Global Average Pooling (GAP) and a fully connected layer to classify the input scene as a high-density or a sparse one. Then, we pass this information to the respective decoder with the same structure (our theoretical studies proved that the same respective field is better than the different one). The decoders contain four dilated convolution layers which are empowered with an attention module after each layer.
Furthermore, to address the congestion within the dense or sparse scenes, we divide the decoder module into another two branches to generate the low and high-density maps within the input scene and assign them with the corresponding regression losses. In the final step, we use the dense and sparse features from the last layer of the decoders to produce the final output density map (DM). Our PDANet uses the same loss for sparse, dense and final output DM, and a classification loss to train the model in an end-to-end manner.
To summarize, in our proposed PDANet, each part plays a role in the overall performance:
The role of the Attention Module is to put more emphasis on the significant features (crowded areas).
The Pyramid Feature Extractor generates more productive features which are more suitable for the crowd counting task with scale variation, by a combination of different scale adaptive pooling and dilated convolution.
The Classifier helps find the proper branch of the decoder according to the crowdedness level of the scene.
The mid-branch Decoder is to address congestion change within the input image.
3.2 Channel and Content based Attention Modules
The Attention block was firstly introduced as a squeeze and excitation (SE) block in , which could be easily integrated within the CNN architecture. It utilized the global average pooling to figure out the spatial dependency and made a channel-specific descriptor to emphasize the useful channels and re-calibrate the feature map. Based on this foundation, concurrent spatial and channel squeeze and excitation’ was proposed to apply channel and spatial based re-calibration concurrently .
In this study, we re-calibrate the feature maps adaptively by mixing attention modules to augment the effect of essential features, while suppressing the weak ones. We use the combination of spatial and channel-based attention for finding and separating the crowded area within the input image. As it is shown in Fig. 2, we utilize two types of attention modules in our model. In most cases, we apply channel and spatial attention  after the convolution layers, shown as the brown color in Fig. 2. This module contains channel and spatial attention, to produce the final attention features in each layer. We consider the maximum value of each index location between the channel and spatial attention outputs.
The other attention module is a spatial attention map that we generated based on the density map of the sparse and dense crowded area within the image, we apply a sigmoid on this attention module and multiply it with the joint convolution feature maps from the last layer of a sparse and dense decoder.
Fig. 3 illustrates this attention module. As shown in this figure, there are two branches in this illustration, i.e., the channel attention branch on the top, and the spatial attention branch on the bottom. The channel attention module utilizes a cascade of Global Average Pooling (GAP) and two fully connected layers with the size of and C, respectively (C is the channel size of a convolution layer).
where is a sigmoid layer for bringing the value in a range of to find the impact of each layer in the feature maps.
Finally, for channel based attention features are obtained by multiplying the encoded channel-wise dependencies ( ) to to get .
On the other hand, to obtain the spatial attention map, , we perform a convolution, i.e., , on the input feature map. Thus, we can measure the importance of a spatial information of each area within . In the next stage, we multiply the spatial attention map to the input feature maps to get the final spatial attention features , which augment relevant spatial locations and suppress irrelevant ones.
Finally, we combine the results of these two attentions by element-wise max of the channel and spatial excitation, i.e., . These feature maps amplify the input feature map data and re-calibrate the crowded area within each input convolution layer.
3.3 Pyramid Feature Extractor (PFE)
As discussed in Section 3, we need to capture details at various scales of crowd density within the input images to respond to the limitation of the same receptive field. In this section, we use a Pyramid Feature Extractor (PFE), which is a modified Spatial Pyramid Pooling  to address this issue. The PFE fuses features under various pyramid scales by a combination of GAP, and a shared 2D convolution layers with a mixture of and dilated kernel. The general operation of PFE is illustrated in Fig. 4.
We extract contextual features as:
where, for each scale , calculates the average by performing over the input feature maps and produce contextual features for each channel. Various scales of contextual features form the pooled representation for different areas and provide rich information about the density level in various sub-regions of the input image.
In the Experiment section, we consider two scenarios for GAP. In the first case we pass six different pyramid scales to the first shared convolution layers (Conv Module), with GAP equals to In the other scenario, we pass three pyramid scales with GAP equals to ( ), where is the size of input feature map. We also pass the input feature to Conv Module. We deal wih these two different scenarios for evaluating the effect of different scales of GAP on the density estimation. The results presented in the Experiment section are based on the second approach.
Then, we feed to the Conv Module to improve the representation power of the feature map. This procedure is different from the architectures that reduce the dimension with convolution .
As illustrated in Fig. 4, we perform the Conv module as:
where, for each scale ,
is the shared Conv module that comes with a bi-linear interpolation to up-sample the contextual features to be of the same size as. These operations reduce the number of parameters to learn in PFE, speed up the processing, and increase the model efficiency.
On the other hand, with passing input features to the shared layers, we extract local feature patterns invariantly as kernels travel across all the image positions in different and detect local learned patterns. The shared layer contains one convolution to reduce the number of channel form . We do this to reduce the number of parameters that need to train and reduce the computational cost of PFE.
In the next stage, we get the summation of a convolution, and a dilated convolution as a piece of extra bonus information about the contextual features . Experimentally, we verify that this combination of convolution filters improves the performance of the PFE module in the density estimation task.
Finally, we concatenate all the and the input features , with a convolution, we reduce the number of the channels to the original VGG features . We define this as the following equation:
where is the number of pyramid contextual features plus the original input feature map.
Then, we utilize a special attention module, which is the combination of the Conv module and attention module that we explained in Section 3.2. We pass to two separate attention sections. As illustrated in Fig. 4, in the bottom, we feed the to the GAP with the size , and then apply the Conv Module on it.
Finally, the attention module and GAP plus the Conv module are then combined to get the element-wise max of their outputs. We apply the GAP with the size to highlight and escalate the most important parts of the output feature maps.
Then, we perform the max operation at each point in the feature map. Finally, we combine the results of these two attentions by element-wise max of the Conv Module output and attention module output.
Altogether, as illustrated in Fig. 4, the PFE module extracts contextual features as discussed above, which are then fed to the classification module and a Density Aware Decoder (DAD) module that produces the density map.
3.4 Classification Module
The next step in our overall framework, as illustrated in Fig. 2, is to decide whether the input contextual features are dense or sparse. We do this to address the huge variation of crowd densities among different images. We pass input features to the suitable DAD to adaptively react to the density level of the input image and provide better estimation for crowd density.
To model this, we need a binary classification module to learn to classify the input feature maps into two classes, dense or non-dense. Therefore, we use the following equation:
where is global average pooling, with the scale of
, which produces a vector with the size of, and is a fully connected layer.
We use this combination to produce a class probability which is a value in the range of. If the output probability () is less than , the model considers the input as a non-dense crowd image and passes it to a low-Density Aware Decoder (DAD) module; otherwise it passes it to the high DAD one.
3.5 Density Aware Decoder (DAD)
DAD is one of the special parts of the proposed PDANet model, as it dynamically handles intra-variation of the density level within the input image. To achieve this, we use four dilated convolution layers with the attention module attached to each layer, similar to the one introduced in Section 3.2. Through passing the high density areas to the high density branch, and low density to the sparse branch, we achieve a model that is able to address the density variation of the input image adaptively. Furthermore, we break DAD into two parts, i.e., the shared layers and the low or high-density branches. This separation enables us to cope with various occlusion, internal change, and diversified crowd distribution, as illustrated in Fig. 1.
The structure of DAD is illustrated in Fig. 5. As shown in the figure, we consider the first two layers as shared layers and then pass the output feature maps to two separate paths with the other two convolution layers to manage the within-image density variation. The number of channels in the dilated convolution in DAD is () with the kernel filter size and the dilation rate . Furthermore, to reduce the training parameters, we utilize a convolution to decrease the input channel to and then perform 2D dilated convolution on the reduced channel feature maps. This processing speeds up the training and convergence of our model. In each branch, we have a convolution at the end to produce the density maps for the low and dense crowded areas. We call these layers as . There is a small notation, for the highly dense and low dense images, we use the for the high dense regions within the image. However, for the low density regions, within the low or high dense input image, we have used a shared layer. This design gives us the benefit of using more information to train the model to map the low, dense region with the input image. Therefore, we are able to have a better density estimation for the low crowded areas. On the other hand, by utilizing a different for the highly-dense areas within the low or high crowded input image, our DAD module is able to improve its estimation for this area.
By utilizing this architecture in the DAD, we will have two resultant density maps for the low and high crowded areas of the input image. Besides this, we pick up the feature map of the last layer in the dense and non-dense branch. Then, we sum up these feature maps and name the sum as and utilize the summation of low and high-density map as an attention module . Therefore, we use the following equation to produce the final overall feature map:
where is the sigmoid scaling of the , and is final overall feature map which is fed to the final layer to produce an overall dense map. This novel design enables DAD to handle various occlusion, inter and intra crowd density variation.
3.6 Implementation Details
3.6.1 Regression Loss and Ground Truth
The last part of our method is about the loss function.
The PDANet uses four losses, which fall into two categories, i.e., the regression and classification losses. For the regression loss, we use a pixel-wise loss for training. We define high, low and overall density losses based on this one, where The loss is defined as
where and are the ground-truth and estimated density maps, respectively. We rely on the same methodology as the previous work to obtain the ground-truth density map . is generated by convolving each delta function with a normalized Gaussian kernel  as:
where represents the number of annotated points in the image . Note that, the summation of the density map () is equal to the crowd count in the image. Instead of using the geometry-adaptive kernels , we use a fixed spread parameter of the Gaussian kernel for generating ground truth density maps.
To obtain and density map for high density and sparse regions within the input image, we utilize a simple rule, which is defined as:
Then, we use , , and with the loss to produce the overall, high density and sparse losses. We name these losses as , and , respectively.
3.6.2 Classification Loss and Ground Truth
On the other hand, according to our model, we need to classify the scene. Thus, we introduce as an actual class tag. To obtain the , we define a rule to decide whether the input image is highly crowded or not.
We consider as a measure of how dense the input scene is, which is defined as:
where counts the number of pixels that have a density value.
Then, according to the change in the number of people in each dataset, we define a threshold, and if the number of people is larger than that threshold, we can consider it as a high density input scene; otherwise it is a low density one. We have tested different threshold values, and found that our model is not too sensitive to it and able to classify the input scenes correctly.
Then, we consider a binary cross entropy ( BCE ) loss to train the model to detect sparse and dense input images, where is defined as:
where and are the actual class and the class predicted by the model, correspondingly.
3.6.3 Total Loss
Finally, by helping with these losses, we need to define a rule to train the model efficiently. As it is obvious from the structure of the model, we need to detect and correctly pass high and sparse dense input to the corresponding DAD. Therefore, we need to penalize the model whenever it cannot detect the dense level of the input scene. Thus, we use the following equation to combine different losses:
According to the , by adding the we are able to overcome the mis-classification of the input scene. With , the model can learn the dense and sparse area within an input image precisely.
In this section, we evaluate the performance of our proposed approach. We first introduce the evaluation metrics and then report experimental results obtained on benchmark datasets. The experiments are conducted on four benchmark datasets and results are compared with the recently published state-of-the-art approaches, which have already been used for comparison purpose since. We then perform a detailed ablation study.
4.1 Evaluation Metrics
where is the number of test images, denotes the exact number of people inside the ROI of the -th image and is the estimated number of people. In the benchmark datasets discussed below, the ROI is the whole image except when explicitly stated otherwise. Note that the number of people can be calculated by summation over the pixels of the ground truth () as it is defined in 9 and the predicted density maps (). We followed the  methodology to prepare ground truth density data.
4.2 Data Augmentation
We take the benefit of data augmentation to avoid the risk of overfitting to the small number of training images. We use five types of cropping alongside with a resizing as data augmentations. We crop each image into of the original dimension. The first four cropped images extract four non-overlapping patches based on each corner of the original image. Furthermore, the fifth crop is randomly cropped from the input scene. For resizing, we just resize the input image to the dimension of () or () depending on the scale of the input data.
4.3 Experimental Results on the ShanghaiTech Dataset
The ShanghaiTech dataset  is one of the most popular and large-scale crowd counting dataset, which contains 1,198 annotated images with a total of 330,165 people. It contains two parts, i.e., Part A (ShanghaiTech-A) with 482 images randomly collected from the Internet, and Part B (ShanghaiTech-B), including 716 images taken from the urban areas in Shanghai. Each part is divided into two subsets for training and testing. As the challenge caused by diversity of scenarios and variation of congestion differs, it is difficult to estimate the number of pedestrians precisely.
we use the KNN method to calculate the average distance between each head and its three nearest heads andis set to . For Part B, we set a fixed value for . We compare our method with state-of-the-art methods recently published on this dataset.
The quantitative results for ShanghaiTech-A are listed in Table I. It can be seen that our PDANet has achieved an MAE of and an MSE of in the experiment. Our proposed method also exhibits significant advantages over many top ranked methods such as CSRNet , SANet , ADCrowdNet , HA_CNN , and SPN .
|PACNN+CSRNet . ||62.4||102.0|
Table II illustrates the results of our PDANet obtained on the ShanghaiTech-B dataset, which is less crowded than ShanghaiTech-A. The experimental results show that our method outperforms the state-of-the-art and existing best-performing approaches. Our proposed PDANet has achieved a MAE of and a MSE of , both are better than those of the state-of-the-art results.
These results suggest that our proposed PDANet is able to cope with sparse and dense scenes, thanks to the combination of the pyramid module 3.3 and two steps DAD 3.5. Because of these, our proposed model can distinguish the crowd level of the input scene and analyse the crowd accordingly for better estimation.
|PACNN . ||8.9||13.5|
4.4 Experimental Results on the WorldExpo10 Dataset
The WorldExpo10 dataset  is another large-scale crowd counting benchmark dataset. During the Shanghai WorldExpo 2010, video clips were captured by surveillance cameras to produce this large dataset. We follow the standard procedures  and take annotated images from 103 scenes as the training set and the other remaining frames ( images) from remaining scenes as testing sets. We prune the crowd density map of the last layer within the Regions of Interest (RoI) in training and testing time.
Table III summarizes the prediction results of our PDANet compared with twenty state-of-the-art methods. This table provides MAE results based on five different scenes. The best-performing state-of-the-art methods are CAN , ADCrowdNet , and PACNN  with an average MAE less than 8. However, as shown in the table, our proposed PDANet has achieved an average MAE of which suppresses the-state-of-the-art results with a margin of over the result achieved by CAN . Furthermore, our PDANet yields the lowest MAE of out of all scenes with a MAE equal to (), respectively. As it is demonstrated, the overall performance of our PDANet across various scenes is superior compared with the state of the art approaches.
4.5 Experimental Results on the UCF Dataset
The UCF CC 50  is one of the most challenging data sets in crowd counting research area due to its limited number of training images and significant variation in the number of people within the datasets (from 94 to 4,543 across images). There is a standard procedure for using this small dataset for training and testing, which is 5-fold cross-validation  for training and evaluating models. We choose the similar to ShanghaiTech-A  setting for generating ground truth density maps.
We present the results achieved on this datasets in Table IV. It is shown in this Table that our PDANet outperforms the state-of-the-art models by a significant margin. We achieved a MAE of with a MSE of , which is about 28 percent better than SPN+L2SM , the best benchmark model. In our experiments, we observed that our PDANet is able to estimate the number of people accurately in all subsets. We will explore the results in details in the Ablation study section.
Overall, it can be concluded that our proposed PDANet can work well on both the sparse and dense scenarios.
|PACNN . ||267.9||357.8|
4.6 Experimental Results on the UCSD Dataset
The UCSD dataset (CITATION?) is the latest dataset that we conduct experiments on. This dataset contains 2,000 annotated frames which are captured by a CCTV camera from pedestrians on a walkway. This dataset comes with ROIs, and most of the existing crowd counting approaches have reported the results based on ROIs. In the experiment, we used Frames 601 through 1400 for training and the remaining out of 2000 for testing. Table V shows the MAE and MSE results obtained on this dataset in comparison with other state-of-the-art approaches. By comparing with 16 approaches, it is shown that our PDANet is the second-best on this data set with a MAE of and MSE=, which is very close to PACNN  result.
|Density Learning ||1.70||1.28|
|Learning to Count||1.70||2.16|
|Count Forest ||1.43||1.30|
|Arteta et al. ||1.24||1.31|
|Zhang et al. ||1.70||1.2|
|Bidirectional ConvLSTM ||1.13||1.43|
5 Ablation Study
To further demonstrate the effectiveness of each component proposed in our PDANet model, we conduct series of ablation studies.
In this section, we first visualize some of the results achieved, and then explore some of our model components and discuss their outputs to analyze the effectiveness of each component. The ablation studies were conducted on the UCF CC50  and ShanghaiTech  datasets.
5.1 Density Map Visualization
Qualitatively, we visualize the density maps generated by our proposed PDANet method on the ShanghaiTech Part A dataset in comparison with the original ground truth (GT) in Fig. 6. It has three input scenes with the crowd count varying from 2,240 to 1,358, and their corresponding GT, as well as their overall, dense and sparse, estimated density maps (Est) in Fig. 6 (a) to (e).
As illustrated, the estimated count and the actual ground truth is close to each other, and the model performs properly in various crowdedness level. For instance, for the third row, the ground truth is 2,244, and the prediction is 2,161, which is a reasonable estimation for such a highly crowded input scene. When further looking into the results of dense and sparse scenes, we can draw a conclusion that our model works well to provide better information for overall density map estimations.
5.2 Effect of the PFE Module
In the first experiment, we compare two different procedures to produce PFE with the baselineAD (i.e., PDANet without PFE module). The first one is a combination of three GAPs proposed in the Section 3.3 (GAP3), and the other one is utilizing six GAPs with the corresponding size of (GAPS6). Table VI represents the MAE and MSE for different parts of the UCF dataset. As shown in this table, our PDANet (GAP3) outperforms the PDANet (GAP6) and baselineAD models. The BaselineAD provides better results in comparison with most of the state-of-the-art results. However, we improved the crowd counting results by using PFE modules. Between the two PFE modules, GAP3 PFE provides better crowd level predictions in Part0, Part3, Part4, and on average. However, the difference among them is only about 2 in the MAE metric.
|UCF CC 50|
We have tested the effect of different numbers of pyramid GAPs on the ShanghaiTech dataset as well and the test results are shown in Table VII. The results again show that the proposed PDANet (GAP3) outperforms the baselineAD and PDANet(GAP6).
In summary, it is obvious that the PFE with three GAPs works better in crowd counting. To better understand this, let us look at the difference between these two PFEs. We captured feature maps based on 1/5, 1/10, and 1/15 of the original feature maps in the PFE with three GAPS, in comparison with considering the feature maps with sizes of 1 to 6 in the other one. We believe that by using scales as alike as PDANet (GAP3), the output feature maps have more accurate scale information than that of PDANet (GAP6).
5.3 Effect of Attention Module
To gain insight into the effect of the Attention Module, we performed an ablation study to demonstrate the contribution of the module to the performance of the proposed model. We compared the performance of our design choices with the baseline with PFE and DAD module. Tables VIII and IX illustrate the results for UCF CC 50 and ShanghaiTech dataset. Part0 of UCF CC 50 dataset has the greatest improvement in terms of MAE/MSE, but the improvement in the performance of part1 to part4 was small. As shown in Table IX, we have achieved more or less the same improvement in crowd counting by adopting the attention module.
Overall, we used the attention module for localizing the crowd area and improving the performance of our model. As shown in these tables, we achieved our goal by combining spatial/channel based attention and attention based on sparse and dense crowded areas. Thus, these results proved the application of the attention module on improving the accuracy of the crowd counting model. However, by comparing the results of Tables VI,VII,VIII,IX, we can conclude that the effect of the attention module on improving the performance is less significant than the pyramid module.
|UCF CC 50|
5.4 Effect of Classification and DAD Modules
To address the variation on the density within and between different input images, we have proposed a DAD module. In this section, we aim to understand the effect of this module in our overall performance improvement. Same as the previous sections, we compare the results of our PDANet with DAD and without DAD (pathing the data to only one branch) on both UCF CC 50 and the ShanghaiTech dataset.
Tables X and XI show the experimental results for UCF and ShanghaiTech datasets, respectively. As seen from Table X, we were able to boost the accuracy of crowd counting by around 20 percent for the UCF dataset in all subsets. With the ShanghaiTech dataset, we achieved a noticeable improvement in accuracy with the help of the DAD module.
These results prove our initial idea about processing the sparse and dense crowded feature maps separately. We believe that the DAD module helps the PDANet generate a proper density map for both high and low crowded areas in the images, and simultaneously, it guides the proposed model to react to the difference of the input images with different crowdedness.
In this work, we have introduced a novel deep architecture called Pyramid Density-Aware Attention-based network (PDANet) for crowd counting. The PDANet incorporated pyramid features and attention modules with a density-aware decoder to address the huge density variation within the crowded scenes. The proposed PDANet utilized a classification module for passing the pyramid features to the most suitable decoder branch to provide more accurate crowd counting with two-scale density maps. To aggregate these density maps, we took the benefit of the sigmoid function and produced a gating mask for producing the final density map. Extensive experiments on various benchmark datasets have demonstrated the performance of our PDANet in terms of robustness, accuracy, and generalization. Our approach was able to achieve superior performance compared with the state-of-the-art results on three challenging crowd counting datasets (ShanghaiTech, UCF CC 50 and World Expo 10), especially in UCF 50 with more than 25 immediate improvements in the results based on all evaluation metrics.
-  (2018) A-ccnn: adaptive ccnn for density estimation and crowd counting. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 948–952. Cited by: §1, §1, §2, §2, TABLE IV, TABLE V.
-  (2014) Interactive object counting. In Proceedings of the ECCV, pp. 504–518. Cited by: TABLE V.
-  (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In , pp. 3618–3626. Cited by: §2, TABLE IV.
-  (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §2, §2, §3.6.1, §4.1, §4.1, §4.3, §4.3, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V.
-  (2008) Privacy preserving crowd monitoring: counting people without people models or tracking. In Proceedings of the CVPR, pp. 1–7. Cited by: TABLE V.
-  (2009) Bayesian poisson regression for crowd counting. In 2009 IEEE 12th international conference on computer vision, pp. 545–551. Cited by: §1, §2.
-  (2019-01) Scale pyramid network for crowd counting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), Vol. , pp. 1941–1950. External Links: Cited by: §1, §2, §2, §3.3, §3.3, §4.1, §4.3, TABLE I, TABLE II, TABLE IV, TABLE V.
-  (2019) Scale pyramid network for crowd counting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1941–1950. Cited by: §2.
-  (2018) An aggregated multicolumn dilated convolution network for perspective-free counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 195–204. Cited by: §2.
-  (2012) Learning to count with regression forest and structured labels. In Proceedings of the ICPR, Vol. 21, pp. 2685–2688. Cited by: TABLE I, TABLE II, TABLE III, TABLE V.
-  (2018) Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7132–7141. Cited by: §2, §3.2.
-  (2017) Body structure aware deep crowd counting. IEEE Transactions on Image Processing 27 (3), pp. 1049–1059. Cited by: TABLE V.
-  (2013) Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547–2554. Cited by: §4.5, TABLE IV, TABLE X, TABLE VI, TABLE VIII, §5.
-  (2018) Learn to pay attention. Cited by: §2.
-  (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6133–6142. Cited by: TABLE IV.
-  (2018) Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115. Cited by: §1, §2.
Fully convolutional neural networks for crowd segmentation. arXiv preprint arXiv:1411.4464. Cited by: §2.
-  (2010) Learning to count objects in images. In Proceedings of the NIPS, pp. 1324–1332. Cited by: §1, §2, TABLE I, TABLE II, TABLE III, TABLE V.
-  (2018) Structured inhomogeneous density map learning for crowd counting. arXiv preprint arXiv:1801.06642. Cited by: §1.
-  (2018) Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1091–1100. Cited by: §2, §3.1, §3.6.1, §4.1, §4.3, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V.
-  (2018) Decidenet: counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5197–5206. Cited by: TABLE II, TABLE III.
-  (2019) Performance-enhancing network pruning for crowd counting. Neurocomputing. Cited by: §1, §1.
-  (2019) DENet: a universal network for counting crowd with varying densities and scales. arXiv preprint arXiv:1904.08056. Cited by: §1, §1, §1, §2, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2018) Crowd counting using deep recurrent spatial-aware network. arXiv preprint arXiv:1807.00601. Cited by: TABLE II, TABLE III, TABLE IV.
-  (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3225–3234. Cited by: §1, §2, §4.3, §4.4, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V.
-  (2019) Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5099–5108. Cited by: §1, §1, §1, §2, §2, §2, §4.1, §4.4, TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2016) Towards perspective-free object counting with deep learning. In Proceedings of the ECCV, pp. 615–629. Cited by: §2, TABLE III.
-  (2015) Count forest: co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the ICCV, pp. 3253–3261. Cited by: TABLE V.
-  (2018) Iterative crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 270–285. Cited by: TABLE I, TABLE II, TABLE III, TABLE IV.
-  (2018) Concurrent spatial and channel ‘squeeze & excitation’in fully convolutional networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 421–429. Cited by: §2, §3.2, §3.2.
-  (2017) Switching convolutional neural network for crowd counting. In CVPR, Vol. 1/3, pp. 6. Cited by: §1, §1, §2, §2, TABLE III, TABLE V.
-  (2018) Attention-gated networks for improving ultrasound scan plane detection. Cited by: §2.
-  (2015) Deeply learned attributes for crowded scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4657–4666. Cited by: §1.
-  (2018) Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5245–5254. Cited by: TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V.
-  (2019) Revisiting perspective information for efficient crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7279–7288. Cited by: §2, §4.4, §4.6, TABLE I, TABLE II, TABLE III, TABLE IV, TABLE V.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §2, §3.1.
-  (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1879–1888. Cited by: §2, TABLE III.
-  (2019) HA-ccn: hierarchical attention-based crowd counting network. IEEE Transactions on Image Processing. Cited by: §4.3, TABLE I, TABLE II, TABLE IV.
-  (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §1.
-  (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: §1.
-  (2019) Scale-aware attention network for crowd counting. arXiv preprint arXiv:1901.06026. Cited by: §2.
-  (2019) Residual regression with semantic prior for crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4036–4045. Cited by: §2.
-  (2019) Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8198–8207. Cited by: TABLE I, TABLE II, TABLE IV.
-  (2019) Adaptive scenario discovery for crowd counting. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2382–2386. Cited by: §1, §2, §2, TABLE IV.
-  (2017) Spatiotemporal modeling for crowd counting in videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5151–5159. Cited by: TABLE V.
-  (2019) Learn to scale: generating multipolar normalized density map for crowd counting. arXiv preprint arXiv:1907.12428. Cited by: §4.5, TABLE IV.
-  (2016) Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. Cited by: §2.
-  (2015) Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the CVPR, pp. 833–841. Cited by: §4.4, TABLE III, TABLE V.
-  (2019) Nonlinear regression via deep negative correlation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: TABLE IV.
-  (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the CVPR, pp. 589–597. Cited by: §2, §2, §4.3, §4.5, TABLE I, TABLE II, TABLE XI, TABLE VII, TABLE IX, §5.
-  (2017) Pyramid scene parsing network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2881–2890. Cited by: §2.
-  (2012) Understanding collective crowd behaviors: learning a mixture model of dynamic pedestrian-agents. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2871–2878. Cited by: §1.