1 Introduction
The problem of crowd counting is described in [16]. Different from visual object detection, it is impossible to provide bounding boxes for all pedestrians due to the extremely dense crowds. On the other side, when only the total crowd counts of the images are provided, the training process will become notably difficult since the spatial awareness is completely ignored. Therefore, to preserve as many spatial constraints as possible and reduce annotation cost, the previous work [16]
started to only provide center points of heads and utilizes Gaussian distribution to generate ground truth density maps. It is worth noting that this annotation scheme is widely adopted by subsequent studies.
Existing crowd counting approaches mainly focus on improving the scale invariance of feature representation, including the multicolumn networks [13, 38, 39, 42, 52, 6], scale aggregation modules [3, 47], and scaleinvariant networks [9, 17, 20, 39, 45]. Despite the architectures of these methods are different, the loss function is employed by most of them. As a result, the spatial awareness in crowd image is largely ignored, though more scale information is embedded into their features.
We have examined three stateoftheart approaches (i.e., MCNN [52], CSRNet [17], and SANet [3]) on four crowd counting datasets (i.e., ShanghaiTech [52], UCF_CC_50 [11], WorldExpo’10 [48], and UCSD [4]). Two examples are shown in Figure 1. Similar to [3, 19, 20], we observe that densecrowd regions are usually underestimated, while sparsecrowd regions are overestimated. Such phenomenon is due to two main factors. First, the pixelwise loss struggles to retain the highfrequency variation in the density map: minimizing loss encourages finding pixelwise averages of plausible solutions which are typically overlysmooth and thus have poor spatial awareness [15]. Second,
loss is highly sensitive to typical noises in crowd counting, including the zeromean noise, head size changes, and head occlusions. We take a simple statistics and show that the cooccurrence of zeromean noise and overestimation could reach 96% (6,776 out of 7,044 testing images). We further find that almost all estimated density maps inaccurately predict the head positions or sizes when occlusion occurs, which could result in underestimation in highdensity areas. Moreover, the generated ground truth density could also be imprecise due to the annotation error and the fixed variance in Gaussian kernel. It is noted that the corresponding improvements of our method are illustrated in Figure
5.To fully utilize the spatial awareness, previous work [16] proposes a loss named Maximum Excess over SubArrays (MESA) to handle the above problems. Generally speaking, MESA loss attempts to find the rectangular subregion whose predicted density map has the maximum difference from the ground truth. It directly optimizes the counts of this subregion instead of the pixellevel density. Since the set of subregions could include the full image, MESA loss is an upper bound for the count estimation of the entire image. Besides, this loss is only sensitive to the spatial layout of pedestrians and is robust to various noises. However, the complexity of MESA loss function is extremely high. [16] utilizes CuttingPlane optimization to obtain an approximate solution. Since this method cannot be solved by the conventional gradient descent, MESA loss has not been employed in any existing CNNbased approach.
Motivated by the MESA loss, in this paper we present a novel deep architecture called SPatial Awareness Network (SPANet) to retain the highfrequency spatial variations of density. Instead of finding the mismatched rectangular subregion as in MESA, the Maximum Excess over Pixels (MEP) loss is proposed to optimize the pixellevel subregion which has high discrepancy to the ground truth density map. To obtain such pixellevel subregion, the weaklysupervised ranking information [23] is exploited to generate a mask indicating the pixels with high discrepancies. We further devise a multibranch architecture to leverage the full image for discrepancy detection by imitating the salience region detection [33, 50, 54], where patches with increasing areas are used for ranking. The proposed framework could be easily integrated into existing CNNbased methods and is endtoend trainable.
The main contribution of this work is the proposed Spatial Awareness Network and Maximum Excess over Pixels loss for addressing the issue of crowd counting. The solution also provides the elegant views of what kind of spatial context should be exploited and how to effectively utilize such spatial awareness in crowd images, which are problems not yet fully understood in the literature.
2 Related Work
2.1 Detectionbased Methods
The methods in this category use object detector to locate people in images. Given the individual localization of each people, crowd counting becomes trivial. There are two directions in this line, i.e., detection on 1) whole pedestrians [2, 7, 53] and 2) parts of pedestrians [8, 12, 18, 43]. Typically, local features [7, 18] are first extracted and then are exploited to train various detectors (e.g., SVM [18] and AdaBoost [41]). Though spatial information is well learned in these methods, they are not applicable in challenging situations, such as the highdensity clogging crowds.
2.2 Regressionbased Methods
Different from detectionbased methods, regressionbased approaches avoid the hard detection problem and estimate crowd counts from image features. Earlier methods [4, 5, 11, 28] usually predict the counts directly from the features, which will lead to poor performance as the spatial awareness is completely ignored. Later methods try to estimate the density map for counting [16, 26, 29], where the crowd count is obtained by integrating all pixel values over the density map. Though learning the density map somewhat provides the spatial information, their models still have difficulties in preserving the highfrequency variation in the density map.
2.3 CNNbased Methods
Deep CNN based crowd counting methods have shown very strong performance improvements over the shallow learning counterparts. Existing methods mainly focus on coping with the large variation in pedestrian scales, where many multicolumn networks are extensively studied. A dualcolumn network is proposed by [1] to combine shallow and deep layers for estimating the count. Inspired by this work, a famous threecolumn network MCNN is proposed by [52], which employs different filters on separate columns to obtain features with various scales. Many works have improved MCNN [13, 38, 39, 42] to further enhance the scale adaptation. Sam et al. [32]
introduce a switching structure, which uses a classifier to assign input image patches to appropriate columns. Recently, Liu
et al. [19] propose a multicolumn network to simultaneously estimate crowd density by detection and regression based models. Ranjan et al. [27] utilize a twocolumn network to iteratively train their model with images of different resolution.There are a lot of other attempts to further improve the scale invariance, including 1) study on the fusion of various scale information [22, 40, 45, 46], 2) study on multiblob based scale aggregation networks [3, 47], 3) design of scaleinvariant convolutional or pooling layers [9, 17, 20, 39, 45], and 4) study on the automated scale adaptive networks [30, 31, 49]. Typically, Li et al. [17] propose CSRNet that exploits dilated convolutional layers to enlarge receptive fields for boosting performance. Cao et al. [3] propose SANet to aggregate multiscale features for more accurate crowd count. These two approaches have achieved stateoftheart performance. Additionally, there also exist studies devoted to utilization of perspective maps [35], geometric constraints [21, 51], and regionofinterest (ROI) [20] to improve the counting accuracy.
The aforementioned methods utilize the Euclidean distance, i.e. loss to optimize the model. Although these methods can obtain scaleinvariant features, their performances are still unsatisfactory since the spatial awareness is largely ignored. Note that, SANet [3] also tries to solve the problem of loss and adds local pattern consistency ( loss) in the training phase. However, we find that still cannot learn the spatial context well. In our experiment, when integrating our MEP loss () into SANet, we achieve significant performance improvement. Our proposed MEP loss could fully utilize the spatial awareness, which is a key factor for the task of crowd counting.
3 Our Method
In this section, we first review the problem of crowd counting and two loss functions (i.e., MESA loss and loss). Then we present the proposed SPANet and MEP loss in details. It is worth noting that our method can be directly applied to all CNNbased crowd counting networks.
3.1 Problem Formulation
Recent technologies define the crowd counting task as a density regression problem [3, 16, 52]. Given images as the training set, each image is annotated with a total of center points of pedestrians’ heads . Typically, the ground truth density map for each pixel in image is defined as ,
(1) 
where is a Gaussian distribution. The number of people in image is equal to the sum of density values over all pixels as . With these training data, the aim of crowd counting task is to learn the predicted density map towards the ground truth density map .
MESA loss. To make use of the spatial awareness in annotations (i.e., center head positions ), the previous work [16] has proposed the Maximum Excess over SubArrays (MESA) loss as follows,
(2) 
where is the set of all potential rectangular subregions in image. As illustrated in Figure 2, MESA loss tries to find the box subregion whose predicted density map has the maximum difference from the ground truth. It can be treated as an upper bound for the count estimation of the entire image, as could include the full image. Besides, this loss is directly related to the counting objective instead of the pixellevel density, and is only sensitive to the spatial layout of pedestrians. In the 1D case, KolmogorovSmirnov distance [24] can be seen as a special case of .
Despite the above merits, it is difficult to optimize the MESA loss due to the hard process of finding such subregion. One has to traverse all potential subregions to achieve this, which is obviously an impossible task in practical application. To solve it, previous approach [16] converts the optimization of MESA loss to a convex quadratic program problem with limited constraints and utilizes CuttingPlane optimization to obtain an approximate solution. However, since this method cannot be solved by the traditional gradient descent, MESA loss has not been exploited in any existing CNNbased crowd counting methods.
loss. To facilitate the computation in deep frameworks, existing CNNbased methods [17, 27, 52] all directly use loss to minimize the difference between the estimated and ground truth density maps,
(3) 
However, as discussed in Sec. 1, we reveal that loss can hardly retain the highfrequency variation in the density map, leading to the poor spatial awareness. Furthermore, it is also highly sensitive to typical noises in crowd counting, including the zeromean noise, head size changes, and head occlusions. For example, existing methods always overestimate the density value in lowdensity areas and underestimate it in highdensity regions.
3.2 Spatial Awareness Network
The proposed Spatial Awareness Network (SPANet) aims to leverage the spatial context for accurately predicting the density values. Instead of searching the mismatched rectangular subregion as in MESA loss, which is the main obstacle for optimization, we try to find the pixellevel subregion which has high discrepancy to the ground truth density map. Since there is not any annotation of such region, this problem is unsupervised and will still be significantly difficult to solve. Inspired by the recent weaklysupervised method [23], we exploit an obvious ranking relation to achieve this, i.e., one patch of a crowded scene image is guaranteed to contain the same number or fewer persons than the original image. By sampling a pair of patches (where one is the subpatch of the other), the network is optimized with the ranking objective and outputs a new density map, which is in turn utilized to produce the subregion with high discrepancy, together with the previous one. We further devise a multibranch architecture to leverage the full image by sampling multiple pairs of patches. Note that the whole SPANet could be endtoend trained.
Figure 3 illustrates the framework of our proposed SPANet. Input images are first fed into the backbone network to generate the predicted density maps . The desired pixellevel subregion generation, i.e., , is conducted by branch using a pair of patches sampled from density maps . To leverage the full image for discrepancy detection, a multibranch architecture with branches is devised to produce multiple subregions by imitating the salience region detection [50, 54]. Finally, subregions () are combined to produce the final , which is then exploited to compute our proposed Maximum Excess over Pixels (MEP) loss. We will present these three submodules in details below.
Pixellevel Subregion Generation. The subregion indicates the area with high density discrepancy to the ground truth. Unfortunately, directly subtracting the predicted from the ground truth would make the problem go round in circles – the bias is usually large enough to prevent it from providing accurate region. Consequently, we turn to find the region with high changes along with the network training. It is natural that one can pick two density maps of the same image from different iterations. However, the obtained area only reflects the region that is already “revised”, which still seriously suffers from the poor spatial perception of the original loss. To this end, we exploit the weakly supervised ranking clues to produce the subregion. Instead of considering the pixellevel density, the ranking clue is directly related to the comparison of crowd counts.
In each branch , two parallel image patches are first sampled. As the feature maps of deep convolutional layers already contain rich location information, we treat the sampling process as the mask pooling operation on the density map. The strategy of selecting patches will be described later. Without loss of generality, suppose the two masks and are the 2dimensional matrix with or ( indicates the patch area), and is the subpatch of . The crowd counts and under the masks and could be obtained by integrating the values of density map over individual mask, which could be implemented as the mask pooling as follows,
(4)  
where is the elementwise product, and indicates the pixel on density map . It is worth noting that we utilize the same predicted density map when calculating the counts for two masks, rather than generating individual maps at two consecutive iterations. The reason is that the density map is not restricted to be positive, thus pooling on the pair of patches could also provide the ranking information. We have conducted an experiment showing that the two schemes have similar results. Besides, directly pooling on the same map is more efficient than the other.
With the assumption that is the subpatch of , the explicit constraint is that the number of people in is fewer than that in . Therefore, we employ a pairwise ranking hinge loss to model such relationship, which is formulated as
(5) 
where is a margin value that is set to the upper bound of the difference in the ground truth. The gradient of loss is calculated as
(6) 
Once the network parameters are updated with by backpropagation, the renewed density map estimated by the network is computed by
(7) 
where is the input image, and refers to a forward pass of the network. Given the updated density map and the old one , the desired subregion is obtained by thresholding the difference between them, where . To make it differentiable, we utilize a Sigmoid thresholding function, and is given by
(8) 
where is a threshold matrix with all elements being . is the parameter to ensure that the value of is approximately equal to when , otherwise .
Multibranch Architecture. Note that in above section, only a pair of patches are sampled for generating the subregion. In principle, we hope that the full density map could be leveraged to provide more information. Instead of only sampling a smalllarge pair of patches, which may involve large bias error due to the large difference between two patches, we adopt a multibranch architecture as shown in Figure 3. The bottom right corners of all patches are located at the same position, i.e., the bottom right corner of the density map. The area of patch is gradually enlarged along with the branches, until it reaches the size of full density map. Such design guarantees both the small bias error in each branch and the full utilization of training images.
To eliminate the influence of the detected subregion for better optimization in latter branches, we imitate the salience region detection [50] to erase the density values within in next branch, which is formulated as
(9) 
where is the matrix with all elements being , and is the elementwise product.
Maximum Excess over Pixels (MEP) loss. In the end, subregions () are generated by the branches. The final desired pixellevel subregion is computed by simply combining them together as
(10) 
where indicates merging pixels with values close to 1 in all subregion masks , rather than the direct summation. In practice, we take the maximum value at each pixel position from all masks. The final output is the mask that indicates the pixels which should be optimized. Based on that, our proposed MEP loss is then given by
(11) 
3.3 Model Learning
Our SPANet could be easily integrated into existing crowd counting methods, which is equivalent to adding a pooling layer with different masks on the final convolutional layer. It is trained by sequentially optimizing the times ranking loss, MEP loss, and the original loss of existing methods. When calculating the original loss, the mask pooling layer is removed. The overall training objective is formulated as
(12) 
where refers to the original loss of existing approach. In most cases, is the loss. More details of the ground truth generation and data augmentation are described in supplementary material.
ShanghaiTech A  ShanghaiTech B  UCF_CC_50  UCSD  
Method  Venue & Year  MAE  MSE  MAE  MSE  MAE  MSE  MAE  MSE  
Idrees et al. [11]  CVPR  2013          419.5  541.6     
Zhang et al. [48]  CVPR  2015  181.8  277.7  32.0  49.8  467.0  498.5  1.60  3.31 
CCNN [25]  ECCV  2016              1.51   
Hydra2s [25]  ECCV  2016          333.7  425.3     
CMTL [38]  AVSS  2017  101.3  152.4  20.0  31.1  322.8  397.9     
SwitchCNN [32]  CVPR  2017  90.4  135.0  21.6  33.4  318.1  439.2  1.62  2.10 
CPCNN [39]  ICCV  2017  73.6  106.4  20.1  30.1  295.8  320.9     
Huang at al. [10]  TIP  2018      20.2  35.6  409.5  563.7  1.00  1.40 
SaCNN [49]  WACV  2018  86.8  139.2  16.2  25.8  314.9  424.8     
ACSCP [34]  CVPR  2018  75.7  102.7  17.2  27.4  291.0  404.6     
IGCNN [31]  CVPR  2018  72.5  118.2  13.6  21.1  291.4  349.4     
DeepNCL [36]  CVPR  2018  73.5  112.3  18.7  26.0  288.4  404.7     
MCNN [52]  CVPR  2016  110.2  173.2  26.4  41.3  377.6  509.1  1.07  1.35 
CSRNet [17]  CVPR  2018  68.2  115.0  10.6  16.0  266.1  397.5  1.16  1.47 
SANet [3]  ECCV  2018  67.0  104.5  8.4  13.6  258.4  334.9  1.02  1.29 
MCNN+SPANet      99.7  146.3  19.1  28.7  292.5  401.3  1.00  1.33 
CSRNet+SPANet      62.4  99.5  8.4  13.2  245.8  333.1  1.12  1.42 
SANet+SPANet      59.4  92.5  6.5  9.9  232.6  311.7  1.00  1.28 
4 Experiment
4.1 Experiment Settings
Networks. We evaluate our method by combining it with three networks, i.e., MCNN [52], CSRNet [17], and SANet [3]. The implementations of MCNN^{1}^{1}1https://github.com/svishwa/crowdcountmcnn and CSRNet^{2}^{2}2https://github.com/leeyeehoo/CSRNetpytorch/tree/master are from others, while SANet is implemented by us. In general, there are four main differences between them: (1) Different size of networks. Specifically, MCNN, SANet, and CSRNet are corresponding to small, medium, and large crowd counting networks. (2) Different architectures. MCNN and SANet are multicolumn/multiblob networks, while CSRNet is a single column network. In addition, SANet uses the Instance Normalization (IN) layer and the deconvolutional layer, while CSRNet utilizes the dilated convolutional layer. (3) Different size of density maps. Density maps of MCNN and CSRNet are 1/4 and 1/8 of original images, while SANet produces density maps with the same size as input images. (4) Different testing scheme. SANet is tested on image patches, while CSRNet and MCNN are tested on the whole images.
Learning settings. For MCNN and SANet, the parameters are randomly initialized by a Gaussian distribution with mean of
and standard deviation of
. Adam optimizer [14] with a learning rate of is used to train the model. For CSRNet, the first ten convolutional layers are from pretrained VGG16 [37]. The other layers are initialized in the same way as MCNN. Stochastic gradient descent (SGD) with a fixed learning rate of
is applied during the training.Datasets. We evaluate our method on four datasets, including ShanghaiTech [52], UCF_CC_50 [11], WorldExpo’10 [48], and UCSD [4]. Typically, ShanghaiTech Part A is congested and noisy, while ShanghaiTech Part B is noisy but not highly congested. UCF_CC_50 consists of extremely congested scenes with heavy background noises. WorldExpo’10 and UCSD contain sparse crowd scenes. The scenes in WorldExpo’10 are noisier than UCSD.
Evaluation details. MCNN and CSRNet are tested on the whole images, while SANet is tested on image patches. Following previous works [17, 27, 52], Mean Absolute Error (MAE) and Mean Square Error (MSE) are used to evaluate the performance by
(13) 
where is the estimated crowd count and is the ground truth count of the th image. is the number of test images. Additionally, PSNR (Peak SignaltoNoise Ratio)^{3}^{3}3https://en.wikipedia.org/wiki/Peak_signaltonoise_ratio and SSIM (Structural Similarity)^{4}^{4}4https://en.wikipedia.org/wiki/Structural_similarity [44] are utilized to measure the quality of density maps. For fair comparison, similar to [17]
, bilinear interpolation is employed to resize estimated density maps to the same size as input images.
4.2 Comparisons with Stateoftheart
Table 1 and 2 report the results of four challenging datasets. As a summary, our method significantly improves all baselines and outperforms the other stateoftheart methods. This result fully demonstrates the effectiveness of our SPANet, which could provide accurate density estimation on both dense and sparse crowd scenes, and can be applied to all CNNbased crowd counting networks.
On ShanghaiTech dataset, our SPANet boosts MCNN, CSRNet, SANet with relative MAE improvements of 9.5%, 8.5%, 11.3% on Part A, and 27.7%, 20.8%, 22.7% on Part B. Noted that Part A is collected from the internet while Part B is from the busy streets and has more spatial constraints. Since our SPANet can fully utilize spatial awareness, it brings more improvements on Part B. On UCF_CC_50, SPANet provides the relative MAE improvements of 22.5%, 7.6%, 10.0% for the three baselines. Noted that the improved MCNN is even comparable with other stateoftheart methods. It clearly shows that SPANet can handle the extremely densecrowd scenes. Similar to the above two datasets, SPANet also achieves significant improvements on UCSD and WorldExpo’10, verifying the effectiveness of our method on the sparsecrowd scenes.
Method  S1  S2  S3  S4  S5  Avg. 

Zhang et al. [48]  9.8  14.1  14.3  22.2  3.7  12.9 
Huang et al. [10]  4.1  21.7  11.9  11.0  3.5  10.5 
SwitchCNN [32]  4.4  15.7  10.0  11.0  5.9  9.4 
SaCNN [49]  2.6  13.5  10.6  12.5  3.3  8.5 
CPCNN [39]  2.9  14.7  10.5  10.4  5.8  8.9 
MCNN [52]  3.4  20.6  12.9  13.0  8.1  11.6 
CSRNet [17]  2.9  11.5  8.6  16.6  3.4  8.6 
SANet [3]  2.6  13.2  9.0  13.3  3.0  8.2 
MCNN+SPANet  3.4  14.9  15.1  12.8  4.5  10.1 
CSRNet+SPANet  2.6  11.1  8.9  13.5  3.3  7.9 
SANet+SPANet  2.3  12.3  7.9  12.9  3.2  7.7 
4.3 Ablation Studies
Sampling positions. We first evaluate the impact of different starting positions when sampling patches for mask pooling. The results are listed in Table 3. We find that starting at the bottom is always better than the top, and the right is also better than the left. The possible reason is that it may be closely related to camera calibration. The results encourage us to sample patches from the bottom right corner. Noted that the differences between these sampling schemes are quite small, which demonstrates the robustness of our method. Additionally, we also present the comparison of performing mask pooling on the same or different density maps in each branch, which is already discussed in Section 3.2 and Eq. (4). As shown in Table 3, the results of two strategies are similar. Due to the efficiency problem, we directly pool patches from the same density map.
Configurations  MAE  MSE 

Center point  101.2  153.3 
Top left corner  101.5  153.7 
Bottom left corner  100.7  149.2 
Top right corner  100.5  149.4 
Bottom right corner  99.7  146.3 
Different density map  100.3  147.4 
Same density map  99.7  146.3 
110.2  173.2  
+  99.3  145.3 
107.2  164.5  
99.7  146.3  
Random  105.4  162.2 
Grid Search  98.3  142.5 
MCNN  CSRNet  SANet  

Dataset  PSNR  SSIM  PSNR  SSIM  PSNR  SSIM 
ShanghaiTechA [52]  21.42 22.18  0.52 0.66  23.79 24.88  0.76 0.85  23.36 25.33  0.78 0.85 
ShanghaiTechB [52]  23.43 26.19  0.78 0.85  27.02 29.50  0.89 0.92  27.44 29.17  0.89 0.91 
UCF_CC_50 [11]  14.44 18.25  0.37 0.51  18.76 20.17  0.52 0.78  18.35 20.01  0.51 0.76 
UCSD [48]  17.43 18.52  0.75 0.83  20.02 21.80  0.86 0.89  21.33 22.20  0.84 0.90 
WorldExpo’10 [4]  23.53 25.97  0.76 0.85  26.94 29.05  0.92 0.93  26.22 28.54  0.90 0.92 
Different losses/weights. We turn to evaluate the effect of different losses and weight schemes. As shown in Table 3, adding the ranking loss only provides slight improvement, while the significant improvement comes from the MEP loss. Besides, there is no significant difference whether is used. It demonstrates that our MEP loss can effectively learn spatial awareness to boost crowd counting. We further conduct experiments on two weight schemes: the random weight and the grid search with step 0.1. As shown in Table 3, our method is not sensitive to the weights. Even the grid search brings a very slight improvement.
Number of branches. We measure the performance of SPANet with different branch numbers . As illustrated in Figure 4, the performance first improves but then drops with the increasing number of . This observation is not surprising. On one side, small (e.g., ) would involve large bias error due to the large difference between two patches. On the other side, large (e.g., , where is the height of estimated density map) implies that the difference of two patches in each branch is very small, which cannot provide enough discrepancy for subregion generation. In experiments, is set to for MCNN/SANet and for CSRNet, which is determined via cross validation.
Size of estimated density maps. We further validate the effect of the size of estimated density maps. We add deconvolutional layers on top of the MCNN to increase the size of the estimated density maps. Eventually, two variants of MCNN are obtained, whose estimated density maps are of and the same size as the input images, respectively. As shown in Figure 4, the performance is improved along with the size increase of density maps. The results indicate that predicting highresolution density maps could bring considerable improvement.
4.4 Studies on Estimated Density Maps
We now evaluate the estimated density maps to verify whether our method can fully utilize spatial awareness. Table 4 summarizes the results. Our SPANet can significantly improve PSNR and SSIM across all baselines and datasets, which indicates that the quality of the generated density maps are significantly improved. To further verify that our method can indeed learn spatial awareness, we showcase the generated density maps of four examples from different methods in Figure 5. These four examples typically contain different crowd densities, occlusions, and scale changes. We can observe that the baseline models are always affected by the zeromean noise, which leads to overestimation in lowdensity areas. In contrast, zeromean noise is effectively suppressed in our SPANet. Besides, baseline models normally have an insufficient estimation for highdensity areas, while ours can obtain a more accurate estimation for them. Noted that the ground truth itself is also generated with center points of pedestrians’ heads, which inherently contains inaccurate information. It means that our method is still unable to produce the same density map to the ground truth.
4.5 Studies on Learning Curves
Finally, we study the learning curves to further evaluate our method. Figure 6 shows the training and validation mean absolute error (MAE) at every epoch on ShanghaiTech Part A dataset. For better viewing, we smooth the learning curves by exponential moving average (EMA) with a smoothing factor . Compared with original results, baselines integrated with our SPANet exhibit lower MAE on both training and testing set. Since the performance on the training and testing set generally denotes the fitting and generalization degree, this result demonstrates the promising capability on both sides. In addition, it also means that our method can significantly improve the stability during model training.
5 Conclusion
In this paper we present a novel deep architecture called SPatial Awareness Network (SPANet) for crowd counting, which is able to capture the spatial variations by finding the pixellevel subregion with high discrepancy to the ground truth. It could be integrated into all CNNbased methods and is endtoend trainable. Experiments on four datasets and three various networks fully demonstrate that it can significantly improve all baselines and outperforms the stateoftheart methods. It provides the elegant views of effectively using spatial awareness to improve crowd counting. In future work we will study how to preserve spatial awareness as much as possible in the ground truth generation.
Acknowledgements
This research was supported in part through the financial assistance award 60NANB17D156 from U.S. Department of Commerce, National Institute of Standards and Technology and by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior/Interior Business Center (DOI/IBC) contract number D17PC00340, National Natural Science Foundation of China (61772436), Foundation for Department of Transportation of Henan Province, China (2019J22), Sichuan Science and Technology Innovation Seedling Fund (2017RZ0015), China Scholarship Council (201707000083) and Cultivation Program for the Excellent Doctoral Dissertation of Southwest Jiaotong University (DYB 201707).
References
 [1] Lokesh Boominathan, Srinivas S. S. Kruthiventi, and R. Venkatesh Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of ACM International Conference on Multimedia, pages 640–644, 2016.

[2]
Gabriel J Brostow and Roberto Cipolla.
Unsupervised bayesian detection of independent motion in crowds.
In
Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, volume 1, pages 594–601, 2006.  [3] Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd counting. In Proceedings of European Conference on Computer Vision, pages 757–773, 2018.
 [4] Antoni B Chan, ZhangSheng John Liang, and Nuno Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1–7, 2008.
 [5] Antoni B Chan and Nuno Vasconcelos. Counting people with lowlevel features and bayesian regression. IEEE Transactions on Image Processing, 21(4):2160–2177, 2012.
 [6] ZhiQi Cheng, JunXiu Li, Qi Dai, Xiao Wu, JunYan He, and Alexander Hauptmann. Improving the learning of multicolumn convolutional neural network for crowd counting. In Proceedings of the 26th ACM International Conference on Multimedia, 2019.
 [7] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, volume 1, pages 886–893, 2005.
 [8] Piotr Dollár, Boris Babenko, Serge Belongie, Pietro Perona, and Zhuowen Tu. Multiple component learning for object detection. In Proceedings of European Conference on Computer Vision, pages 211–224, 2008.
 [9] Siyu Huang, Xi Li, Zhiqi Cheng, Zhongfei Zhang, and Alexander G. Hauptmann. Stacked pooling: Improving crowd counting by boosting scale invariance. CoRR, abs/1808.07456, 2018.
 [10] Siyu Huang, Xi Li, Zhongfei Zhang, Fei Wu, Shenghua Gao, Rongrong Ji, and Junwei Han. Body structure aware deep crowd counting. IEEE Transactions on Image Processing, 27(3):1049–1059, 2018.
 [11] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multisource multiscale counting in extremely dense crowd images. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 2547–2554, 2013.
 [12] Haroon Idrees, Khurram Soomro, and Mubarak Shah. Detecting humans in dense crowds using locallyconsistent scale prior and global occlusion reasoning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(10):1986–1998, 2015.
 [13] Di Kang and Antoni B. Chan. Crowd counting by adaptively fusing predictions from an image pyramid. In Proceedings of British Machine Vision Conference, page 89, 2018.
 [14] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[15]
Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew
Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz,
Zehan Wang, et al.
Photorealistic single image superresolution using a generative adversarial network.
In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4681–4690, 2017.  [16] Victor S. Lempitsky and Andrew Zisserman. Learning to count objects in images. In Proceedings of Conference on Neural Information Processing Systems, pages 1324–1332, 2010.
 [17] Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 1091–1100, 2018.
 [18] ShengFuu Lin, JawYeh Chen, and HungXin Chao. Estimation of number of people in crowded scenes using perspective transformation. IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans, 31(6):645–654, 2001.
 [19] Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G. Hauptmann. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2018.
 [20] Ning Liu, Yongchao Long, Changqing Zou, Qun Niu, Li Pan, and Hefeng Wu. Adcrowdnet: An attentioninjective deformable convolutional network for crowd understanding. CoRR, abs/1811.11968, 2018.
 [21] Weizhe Liu, Krzysztof Lis, Mathieu Salzmann, and Pascal Fua. Geometric and physical constraints for head plane crowd density estimation in videos. CoRR, abs/1803.08805, 2018.
 [22] Weizhe Liu, Mathieu Salzmann, and Pascal Fua. Contextaware crowd counting. CoRR, abs/1811.10452, 2018.
 [23] Xialei Liu, Joost van de Weijer, and Andrew D. Bagdanov. Leveraging unlabeled data for crowd counting by learning to rank. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 7661–7669, 2018.
 [24] Frank J Massey Jr. The kolmogorovsmirnov test for goodness of fit. Journal of the American statistical Association, 46(253):68–78, 1951.
 [25] Daniel OñoroRubio and Roberto Javier LópezSastre. Towards perspectivefree object counting with deep learning. In Proceedings of European Conference on Computer Vision, pages 615–629, 2016.

[26]
VietQuoc Pham, Tatsuo Kozakaya, Osamu Yamaguchi, and Ryuzo Okada.
COUNT forest: Covoting uncertain number of targets using random forest for crowd density estimation.
In Proceedings of International Conference on Computer Vision, pages 3253–3261, 2015.  [27] Viresh Ranjan, Hieu Le, and Minh Hoai. Iterative crowd counting. In Proceedings of European Conference on Computer Vision, pages 278–293, 2018.
 [28] Carlo S Regazzoni and Alessandra Tesei. Distributed data fusion for realtime crowding estimation. Signal Processing, 53(1):47–63, 1996.
 [29] David Ryan, Simon Denman, Clinton Fookes, and Sridha Sridharan. Crowd counting using multiple local features. In Digital Image Computing: Techniques and Applications, pages 81–88, 2009.

[30]
Deepak Babu Sam and R. Venkatesh Babu.
Topdown feedback for crowd counting convolutional neural network.
In
Proceedings of Conference on Artificial Intelligence
, pages 7323–7330, 2018.  [31] Deepak Babu Sam, Neeraj N. Sajjan, R. Venkatesh Babu, and Mukundhan Srinivasan. Divide and grow: Capturing huge diversity in crowd images with incrementally growing CNN. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3618–3626, 2018.
 [32] Deepak Babu Sam, Shiv Surya, and R. Venkatesh Babu. Switching convolutional neural network for crowd counting. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 4031–4039, 2017.
 [33] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Gradcam: Visual explanations from deep networks via gradientbased localization. In Proceedings of the IEEE International Conference on Computer Vision, pages 618–626, 2017.
 [34] Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, and Xiaokang Yang. Crowd counting via adversarial crossscale consistency pursuit. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 5245–5254, 2018.
 [35] Miaojing Shi, Zhaohui Yang, Chao Xu, and Qijun Chen. Perspectiveaware CNN for crowd counting. CoRR, abs/1807.01989, 2018.
 [36] Zenglin Shi, Le Zhang, Yun Liu, Xiaofeng Cao, Yangdong Ye, MingMing Cheng, and Guoyan Zheng. Crowd counting with deep negative correlation learning. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 5382–5390, 2018.
 [37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.
 [38] Vishwanath A. Sindagi and Vishal M. Patel. Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting. In Proceedings of International Conference on Advanced Video and Signal Based Surveillance, pages 1–6, 2017.
 [39] Vishwanath A. Sindagi and Vishal M. Patel. Generating highquality crowd density maps using contextual pyramid cnns. In Proceedings of International Conference on Computer Vision, pages 1879–1888, 2017.
 [40] Yukun Tian, Yimei Lei, Junping Zhang, and James Z. Wang. Padnet: Pandensity crowd counting. CoRR, abs/1811.02805, 2018.
 [41] Paul Viola, Michael J Jones, and Daniel Snow. Detecting pedestrians using patterns of motion and appearance. International Journal of Computer Vision, 63(2):153–161, 2005.
 [42] Elad Walach and Lior Wolf. Learning to count with CNN boosting. In Proceedings of European Conference on Computer Vision, pages 660–676, 2016.
 [43] Meng Wang and Xiaogang Wang. Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 3401–3408, 2011.
 [44] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
 [45] Ze Wang, Zehao Xiao, Kai Xie, Qiang Qiu, Xiantong Zhen, and Xianbin Cao. In defense of singlecolumn networks for crowd counting. In Proceedings of British Machine Vision Conference, page 78, 2018.
 [46] Xingjiao Wu, Yingbin Zheng, Hao Ye, Wenxin Hu, Jing Yang, and Liang He. Adaptive scenario discovery for crowd counting. CoRR, abs/1812.02393, 2018.
 [47] Lingke Zeng, Xiangmin Xu, Bolun Cai, Suo Qiu, and Tong Zhang. Multiscale convolutional neural networks for crowd counting. In Proceedings of International Conference on Image Processing, pages 465–469, 2017.
 [48] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Crossscene crowd counting via deep convolutional neural networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
 [49] Lu Zhang, Miaojing Shi, and Qiaobo Chen. Crowd counting via scaleadaptive convolutional neural network. In Proceedings of Winter Conference on Applications of Computer Vision, pages 1113–1121, 2018.
 [50] Xiaolin Zhang, Yunchao Wei, Jiashi Feng, Yi Yang, and Thomas S Huang. Adversarial complementary learning for weakly supervised object localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1325–1334, 2018.
 [51] Youmei Zhang, Chunluan Zhou, Faliang Chang, and Alex C. Kot. Attention to head locations for crowd counting. CoRR, abs/1806.10287, 2018.
 [52] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Singleimage crowd counting via multicolumn convolutional neural network. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016.
 [53] Tao Zhao, Ram Nevatia, and Bo Wu. Segmentation and tracking of multiple humans in crowded environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(7):1198–1211, 2008.

[54]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba.
Learning deep features for discriminative localization.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2921–2929, 2016.
Comments
There are no comments yet.