Learn to Scale: Generating Multipolar Normalized Density Map for Crowd Counting

07/29/2019 ∙ by Chenfeng Xu, et al. ∙ Huazhong University of Science u0026 Technology Microsoft 3

Dense crowd counting aims to predict thousands of human instances from an image, by calculating integrals of a density map over image pixels. Existing approaches mainly suffer from the extreme density variances. Such density pattern shift poses challenges even for multi-scale model ensembling. In this paper, we propose a simple yet effective approach to tackle this problem. First, a patch-level density map is extracted by a density estimation model, and is further grouped into several density levels which are determined over full datasets. Second, each patch density map is automatically normalized by an online center learning strategy with a multipolar center loss (MPCL). Such a design can significantly condense the density distribution into several clusters, and enable that the density variance can be learned by a single model. Extensive experiments show the best accuracy of the proposed framework in several crowd counting datasets, with relative accuracy gains of 4.2 14.3 A, Part B, UCF_CC_50, UCF-QNRF dataset, respectively.



There are no comments yet.


page 1

page 3

page 5

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A robust crowd counting system is of significant values in many real-world applications such as video surveillance, security alerting, event planning, etc. In recent years, the deep learning based approaches have been the mainstream of crowd counting, because of the powerful learning representation ability produced by convolutional neural works (CNN). To estimate the count, predominant approaches generate a density map by CNN, from which the count of instances can be integrated over image pixels.

Figure 1: (a) Three examples from ShanghaiTech-Part A dataset, which show extreme density variances. (b) The Mean Relative Error (MRE) over four crowd counting datasets (the scale variances get larger from left to right) of different approaches. Results show the robustness of the proposed approach to extreme scale variances, compared with recent works. [best viewed in color].

Although crowd counting has been extensively studied by previous methods, handling the large density variances which cause huge density pattern shift in crowd images is still an open issue. As illustrated in Fig. 1 (a), the densities of crowd image patches can vary significantly, which change from a bit sparse (e.g., Shanghai-B) to extremely dense (e.g., UCF-QNRF). Such large density pattern shifts usually bring grand challenges to density prediction by a single CNN model, due to its fixed sizes of receptive fields. Remarkable progresses have been achieved by learning a density map through designing multi-scale architectures [23] or aggregating multi-scale features [3, 33], which indicate that the ability to cope with density variation is crucial for crowd counting methods. Although density maps with multiple scales can be generated and aggregated, robustness is still hard to ensure when the density variances get increasing a lot. As shown in Fig. 1 (b), most recent works obtain a higher MRE ***MRE is calculated by MAE/P, where MAE denotes the standard Mean Average Error and P is the average count of a dataset on datasets with larger density variances, which indicates that the extreme density variance and pattern shift in crowd counting remains huge challenge.

In this paper, we propose a simple yet effective method to relieve the problem from extreme density variances. The core idea is learning to scale (denoted as L2S) image patches and to facilitate the density distribution condensing to several clusters, and thus the density variance can be reduced. The scale factor of each image patch can be automatically learned during training, with the supervision of a novel multipolar center loss (MPCL). More specifically, all the patches from each density level are optimized to approach a density center, which can be updated by online calculating a mean value for each density level.

In particular, the proposed L2S framework consists of two closely-related steps. First, given an image, an initial density map is generated by a CNN model. After that, each density map is divided into patches, and all the patch-level density maps are further evenly divided into groups, according to their density levels. Second, each patch is scaled by a learned scale factor, and thus the density of this patch can converge to a center of its density level. The final density map for the input image can be obtained by concatenating the patch-level density maps.

We conduct the experiments on several popular benchmark datasets, including Shanghaitech [33], UCF_CC_50 [9], UCF-QNRF [11]. Extensive evaluations show significantly superior performance over the prior arts. Moreover, the cross validation on these datasets further demonstrates that the proposed L2S framework has a powerful transferability. In summary, the main contributions in this paper are two-fold:

  • We proposed a Learning to Scale Module (L2SM) to solve the density variation issue in crowd counting. With L2SM, different regions can be automatically scaled so that they have similar densities, while the quality of their density maps is significantly improved. L2SM is end-to-end trainable when adding it into a CNN model for density estimation.

  • The proposed L2SM significantly outperforms state-of-the-art methods for crowd counting on three widely-adopted challenging datasets, demonstrating its effectiveness in handling density variation. Furthermore, L2SM also has a good transferability under cross validation on different datasets, showing the generality of the proposed method.

2 Related Work

Crowd counting has attracted much attention in computer vision. Early methods frame the counting problem as detection task 

[7, 29] that explicitly detects individual heads, which has major difficulty in occlusion and dense areas. The regression-based methods [4, 6, 8, 10]

greatly improve the counting performance on dense areas via different regression functions such as Gaussian process, ridge regression, and random forest regression. Recently, with the development of deep learning, the mainstream crowd counting methods switch to CNN-based methods 

[21, 32, 2, 33, 31, 5, 18]. These CNN-based methods address the crowd counting via regressing density map representation introduced by lempitsky et al. [14], and achieve better accuracy and transferability as compared to the classical methods. Recent methods mainly focus on two challenging aspects faced by current CNN-based methods: huge scale and density variance and severe over-fitting.

Methods addressing huge scale and density variance. Multi-scale change is a challenging problem for many vision tasks including crowd counting. It is difficult to accurately count the small heads in dense areas. There are many methods attempting to handle huge scale variance. The existing methods can be roughly divided into two categories: methods that explicitly rely on scale information and methods that implicitly cope with multi-scale changes.

1) Some methods explicitly make use of scale information for crowd counting. For instance, the methods in [32, 19] adopt deep CNN using provided geometric or perspective information. Yet, these scale related information is not always readily available. In [28, 23], the authors first use a network to estimate the scale level and density degree for the corresponding whole or partial region based on manually set scale degree. Then, the obtained scale related information is fused with another network or used to design different networks for dividing and counting. To overcome the difficulty in manually setting the scale degree, the authors design an incrementally growing CNN in [1] to deal with areas of different density degrees without involving any handcraft steps.

2) Some other works aim to implicitly cope with multi-scale problem. Zhang et al. [33] propose to build a multi-column CNN to extract multi-scale feature and fuse them together for density map estimation. Each column is limited to cover a subset of scale variance. To ameliorate the multi-column structure, the authors in [3] propose a multi-stage structure such that each stage exploits the multi-column convolution and combines the multi-scale feature to regress the density map. The author in [17] encodes the scale of the contextual information required to accurately predict crowd density. In [15], the authors propose to increment the receptive field size in CNN to better leverage multi-scale information. In addition to these specific network designs for implicitly handling multi-scale problem, Shen et al. [24] introduce an ad hoc

term in the training loss function in order to pursue the cross-scale consistency. In 

[11], the authors propose to adopt variant ground-truth density map representation with different Gaussian kernel size to better deal with density map estimation in areas of different density levels.

Methods alleviating severe over-fitting. It is well-known that deep CNN [13, 26] usually struggles with over-fitting problem on small datasets. Current CNN-based crowd counting methods also face this challenge due to small size and limited variety of existing datasets, leading to weak performance and transferability. To overcome the over-fitting, Liu et al. [18] propose a learning-to-rank framework to leverage abundantly available unlabeled crowd imagery and the self-learning method. In [25], the authors build a set of decorrelated regressors with reasonable generalization capabilities through managing their intrinsic diversities to avoid severe over-fitting.

Though many methods have been proposed to tackle the large scale and density variation issue, this problem still remains challenging for crowd counting. The proposed method also attempts to address this issue. Different from previous methods [33, 23, 27, 1, 3, 16], we mimic a rational human behavior in crowd counting through learning to scale dense region counting. We compute the scale ratios with a novel use of multipolar center loss [30] to explicitly bring all the regions of significantly varied density to multiple similar density levels. This results in a robust density estimation on dense regions and appealing transfer ability.

Figure 2: A rational human behaviour. For a given image, we are prone to first count in the regions of large heads (e.g., region on the bottom of image), then zoom in the regions of dense small heads for precise counting (see for example the region in the middle and its zoomed version on top right).
Figure 3: Overall pipeline of the proposed method consisting of two modules: 1) Scale Preserving Network (SPN) to generate an initial density map from stacked feature , and 2) Learn to Scale Module (L2SM) that computes scale ratios for dense regions selected (based on ) from non-overlapping division of image domain, and then re-predicts the density map for selected dense regions from scaled feature . We adopt multipolar center loss (MPCL) on relative density level reflected by for each region to explicitly centralize all the selected dense regions into multiple similar density levels. This alleviates the density pattern shift issue caused by the large density variation between sparse and dense regions.

3 Method

3.1 Overview

The mainstream crowd counting methods model the problem as density map regression using CNN. For a given image, the ground-truth density map is given by spreading binary head locations to nearby regions with Gaussian kernels. For sparse regions, the ground-truth density only depends on a specific person, resulting in regular Gaussian blobs. For dense regions, multiple crowded heads may spread to the same nearby pixel, yielding high ground-truth density with very different density pattern compared with sparse regions. This density pattern shift makes it difficult to accurately predict the density map for both dense and sparse regions in the same way.

To improve the counting accuracy, we aim to tackle the problem of pattern shift due to large density variations, improving the prediction for highly dense regions. Specifically, the proposed method mimics a rational behaviour when humans count crowds. For a given crowd image, we are prone to begin with dividing the image into partitions of different crowding level before attempting to count the people. For sparse regions of large heads, it is easy to directly count the people on the original region. Whereas, for dense regions composed of crowded small heads, we need to zoom in the region for more accurate counting. An example of this counting behaviour is depicted in Fig. 2.

We propose a network to mimic such human behaviour for crowd counting. The overall pipeline is depicted in Fig. 3, consisting of two modules: 1) Scale preserving network (SPN) presented in Sec. 3.2. We leverage multi-scale feature fusion to generate an initial density map prediction, which provides accurate prediction on sparse regions and indicates the density distribution over image; 2) Learning to scale module (L2SM) detailed in Sec. 3.3. We divide the image into non-overlapping regions, and select some dense regions (based on the initial density estimation) to re-predict the density map. Specifically, we leverage SPN to compute a scaling factor for each selected dense region, and scale the ground-truth density map by changing the distance between blobs and keeping the same peaks. The density re-prediction for the selected regions is then performed on the scaled features. The key of this re-prediction process lies on computing appropriate scaling factors. For that, we adopt the center loss to centralize the density distributions into multipolar centers, alleviating the density pattern shift issue and thus improving the prediction accuracy. The whole network is end-to-end trainable whose training objective is depicted in Sec. 3.4.

3.2 Scale Preserving Network

We follow the mainstream crowd counting methods by regressing density map. Precisely, we use geometry-adaptive kernels to generate ground-truth density maps in highly congested scenes. For a given image containing person, the ground-truth annotation can be represented via a delta function on each pixel : , where is the annotated location of -th person. The density map on each pixel is then generated by convolving with a Gaussian kernel : , where the Gaussian kernel is a spread parameter.

We develop a CNN to regress the density map . For a fair comparison with most methods, we adopt the VGG16 [26] as the backbone network. We discard the pooling layer between stage4 and stage5 as well as the last pooling layer and the fully connected layers that follow to preserve accurate spatial information. It is well-known that deep layers in CNN encode more semantic and high level information, and shallow layers provide more precise localization information. We extract features from different stages by applying

convolutions on the last layer of each stage. Then we pool these features extracted from

stage1 to stage5 into , , , , and , respectively. This results in a pyramid structure. Each spatial unit in the pooled feature indicates the density level, hence scale of the underlying region mapped to the original image. These pooled scale preserving features are then upsampled to the size of conv5

by bilinear interpolation and stacked together with features in

conv5 . We then feed the stacked feature to three successive convolutions and one deconvolution layer for regressing density map .

3.3 Learning to Scale Module

The initial density prediction is accurate on sparse regions thanks to the regular individual Gaussian blobs, but the prediction is less accurate on dense regions composed of crowded heads lying very close to each other. As indicated in Sec. 3.1, this triggers the pattern shift on the target density map. Following the rational human behaviour in crowd counting, we zoom in the dense regions for better counting accuracy. In fact, on the zoomed version, the distance between nearby heads is enlarged, which results in regular individual Gaussian blobs of target density map, alleviating the density pattern shift. Such density pattern modulating facilitates the prediction. Inspired by this, we first evenly divide the image domain into (e.g. ) non-overlapping regions. We then select the dense regions based on the average initial density of each region , where denotes the area of region .

We propose to mimic human behaviour in crowd counting by learning to scale the selected dense regions. For that, we first leverage the scale preserving pyramid features described in Sec. 3.2 to compute the scaling ratio for each selected region . Precisely, we downsample/upsample the pooled features described in Sec. 3.2 to , and concatenate them together. This is followed by a convolution to produce the scale factor map . Each value in this map represents the scaling ratio for the underlying region.

Once we have the scale factor map , we scale the feature on the selected regions accordingly through bilinear upsampling. Based on the scaled feature map corresponding to each selected region , we apply five successive convolutions to re-predict the density map for scaled . We then resize the re-predicted density map to the original size of and multiply the density on each pixel by to preserve the same counting result. The initial prediction on selected regions is replaced with resized density map re-prediction.

To guide the density map re-prediction on the selected regions, we also adjust the ground-truth density map for each region accordingly. For each selected region , instead of straightforwardly scaling the ground truth density map in the same way as feature map scaling, we first scale the binary head location map, and then recompute the ground-truth density map for by , where is the number of people in . As shown in Fig. 4, such ground-truth transformation for density map re-computation reduces the density pattern gap between sparse regions and dense regions, facilitating the density map re-prediciton.

Figure 4: An example of ground-truth transformation for density map re-computation by enlarging the distance between blobs while keeping the original peaks, alleviating the density pattern shift between sparse and dense regions.

The main issue of this density map re-prediction by learning to scale dense regions is to compute appropriate scale ratios for the selected dense regions. Yet, there is no explicit target scale suggesting how much region should be zoomed ideally. We would like to have the estimated average density approaches the ground truth average density on the -th region. The relative density degree of region could be well reflected by Assuming that we make the value of for each region close to one of multiple learnable centers, then we centralize all the selected regions to multiple similar density levels, alleviating the large density pattern shift and thus improving the prediction accuracy. This motivates us to resort to center loss on with multipolar centers. Put it simply, we attempt to centralize all the selected regions into centers following their average density acting as the unsupervised clustering.

We first initialize the centers with increasing random values for more and more dense regions. Then for each center , we follow the standard process of using center loss and update the center for -th iteration as below:


where , , and refer to the number of regions, average density map, and scaling ratio for -th region, respectively, that will be centralized to the -th center in an underlying image, and denotes the learning rate for updating each center. During each iteration, we use the selected dense regions to compute the center loss with multiple centers and update network parameters as well as the centers. The supervision on using center loss with multiple centers is the key to bring all the selected regions to multiple similar density levels, leading to robust density estimation.

3.4 Training objective

The whole network is end-to-end trainable, which involves three loss functions: 1) L2 loss for initial prediction of density map given by . 2) L2 loss for density map re-prediction on selected regions given by , where denotes the re-predicted density map on the scaled selected region . 3) Center loss on relative density level for the selected regions computed by:


The final loss function for the whole network is the combination of the above three losses given by:


where and

are two hyperparameters. Note that we optimize the loss function

in Eq. (3) to update not only the overall network parameters but also the centers .

Method PartA Part B UCF_CC_50 UCF-QNRF
MCNN [33] 110.2 173.2 26.4 41.3 377.6 509.1 277 -
CMTL  [27] 101.3 152.4 20.0 31.1 322.8 397.9 252 514
Switch-CNN [23] 90.4 135.0 21.6 33.4 318.1 439.2 228 445
CP-CNN [28] 73.6 112.0 20.1 30.1 298.8 320.9 - -
ACSCP [24] 75.7 102.7 17.2 27.4 291.0 404.6 - -
L2R [18] 73.6 112.0 13.7 21.4 279.6 388.9 - -
D-ConvNet-v1 [25] 73.5 112.3 18.7 26.0 288.4 404.7 - -
CSRNet [15] 68.2 115.0 10.6 16.0 266.1 397.5 - -
ic-CNN [22] 69.8 117.3 10.7 16.0 260.9 365.5 - -
SANet [3] 67.0 104.5 8.4 13.6 258.4 334.9 - -
CL [11] - - - - - - 132 191
VGG16 (ours) 72.9 114.5 12.1 20.5 225.4 372.5 120.6 205.2
SPN (ours) 70.0 106.3 9.1 14.6 204.7 340.4 110.3 184.6
SPN+L2SM (ours) 64.2 98.4 7.2 11.1 188.4 315.3 104.7 173.6
Table 1: Quantitative comparison of the proposed method with state-of-the-art methods on three widely adopted datasets.
Method SPN L2SM (G=3) L2SM (G=4) L2SM/S2AD (G=5)
MAE 70.0 65.1 66.1 67.2/68.9 65.4/68.1 64.2/67.0 67.1/69.2 69.8/73.6
MSE 106.3 100.4 103.5 102.3/110.3 100.7/107.3 98.4/105.4 101.6/108.7 104.5/113.5
Cost time (s) 0.524 0.576 0.569 0.539/0.540 0.550/0.551 0.565/0.563 0.583/0.580 0.592/0.587
Table 2: Ablation study on different settings of dense region selection, number of centers , and different ways of learning to scale: the proposed learning to scale module (L2SM) and straightforwardly scale to average density (S2AD).
setting MAE MSE
68.0 107.1
67.2 106.3
67.9 106.9
68.5 109.1
Table 3: Ablation study on image domain division for selecting dense region to re-predict under one center setting.
Method PartAPartB PartBPartA PartAUCF_CC_50 UCF-QNRFPartA PartAUCF-QNRF
MCNN [33] 85.2 142.3 221.4 357.8 397.7 624.1 - - - -
D-ConvNet-v1 [25] 49.1 99.2 140.4 226.1 364 545.8 - - - -
L2R [18] - - - - 337.6 434.3 - - - -
SPN (ours) 23.8 44.2 131.2 219.3 368.3 588.4 87.9 126.3 236.3 428.4
SPN+L2SM (ours) 21.2 38.7 126.8 203.9 332.4 425.0 73.4 119.4 227.2 405.2
Table 4: Cross dataset experiments on ShanghaiTech, UCF_CC_50, and UCF-QNRF dataset for assessing the transferability of different methods.

4 Experiments

4.1 Datasets and Evaluation Metrics

We conduct experiments on three widely adopted benchmark datasets including ShanghaiTech [33], UCF_CC_50 [9], and UCF-QNRF [11]

to demonstrate the effectiveness of the proposed method. These three datasets and the adopted evaluation metrics are shortly described in the following.

ShanghaiTech Dataset. The ShanghaiTech crowd counting dataset [33] consists of 1198 annotated images with a total of 330,165 people, divided into two parts. Part A contains 482 images which are randomly crawled from the Internet, among which 300 images are used for training and the remaining 182 images are used for testing. Part B includes 716 images which are taken from the busy streets of metropolitan area in Shanghai, among which 400 images are used for training and 316 images for testing. The dataset also provides annotation in terms of coordinates of people heads in each image.

UCF_CC_50 Dataset. This dataset is a collection of 50 images of very crowd scenes [9], containing signficanlty varied number of people ranging from 94 to 4543 people in image. The large variance of total number of people and the small amount of images make this dataset very challenging. Following classical benchmarks on this dataset, we use 5-fold cross-validation to evaluate the performance of our method.

UCF-QNRF: UCF-QNRF dataset is the largest dataset to-date [11] containing 1535 images which are divided into 1201 training and 334 testing images. The number of people in an image varies from 49 to 12,865, making this dataset feature huge density variation. Furthermore, the images in this dataset also have very huge resolution variance (e.g. ranging from to ).

Evaluation metrics. We employ two standard metrics, i.e., Mean Absolute Error (MAE) and Mean Squared Error (MSE). MAE and MSE are defined as


where (resp. ) represents the ground-truth (resp. estimated) number of pedestrians in the -th image, and is the total number of test images.

4.2 Implementation Details

We follow the setting in [15] to generate the ground-truth density map. For a given dataset, we first evenly divide all the images in a dataset into groups of regions with increasing density, and then attempt to centralize the most dense groups of regions to similar density levels (i.e., centers involved in the center loss), respectively. In the following, without explicitly specifying, is set to 5, and is set to 3 for all involved datasets except for UCF_CC_50 dataset. Since images in UCF_CC_50 dataset consist of crowded people over the whole image domain, we centralize all regions to similar density levels. Without explicitly specifying, the hyperparameter involved in dividing each image into regions is set to 4.

The loss function described in Eq. (3) is used for the model training. We set to 1 and discuss the impact of in Eq. (3) in the following. We use Adam [12] optimizer to optimize the whole architecture with the learning rate initialized to 1e-4. When training on the UCF-QNRF dataset containing images of very high resolution (e.g. ), we first down-sample the image of resolution larger than 1080p to . Then we divide each image into

and combine them into a tensor with batch size equal to 4. When training on the other dataset, we directly input the whole image to our network.

During inference, we first generate an initial density map for the whole input image, and then select dense regions from division based on the average initial density on each region . If is larger than a pre-defined value for selecting the top groups of regions in training, we replace the initial density map prediction with scaled re-prediction for each selected dense region .

The proposed method is implemented in Pytorch 

[20]. All experiments are carried out on a workstation with an Intel Xeon 16-core CPU (3.5GHz), 64GB RAM, and a single Titan Xp GPU.

Figure 5: Ablation study on the effect of weight of the center loss under one center and on whether using ground-truth transformation when scaling for re-prediction.
Figure 6: Qualitative visualization of predicted density map on two examples. From left to right: original image, prediction given by SPN, re-predicted density map with L2SM on selected regions (englobed by black boxes), and ground-truth density map.

4.3 Experimental Comparisons

We evaluate the proposed method on ShanghaiTech dataset, UCF_CC_50, and UCF-QNRF datasets. The proposed method outperforms all the other competing methods on all the benchmarks. The quantitative comparison with the state-of-the-art methods on these three datasets is depicted in Table 1.

ShanghaiTech. The proposed method outperforms the state-of-the-art method SANet [3] by 2.8 MAE and 6.1 MSE on ShanghaiTech Part A and 1.2 MAE and 2.5 MSE on ShanghaiTech Part B. It also can be seen in Table 1 that L2SM improves the performance of our SPN baseline by 5.8 MAE and 7.9 MSE on ShanghaiTech Part A, and 1.9 MAE and 3.5 MSE on ShanghaiTech Part B. In fact, Shanghai Part A contains images more crowded than ShanghaiTech Part B, and the density distribution of Shanghai Part A varies more significantly than that of Shanghai Part B. This may explain that the improvement of the proposed L2SM on ShanghaiTech Part A is more significant than that on ShanghaiTech Part B.

UCF_CC_50. We then compare the proposed method with other related methods on UCF_CC_50 dataset. To the best of our knowledge, UCF_CC_50 dataset is currently the most dense dataset publicly available for crowd counting. The proposed method achieves significant improvement over the state-of-the-art methods. Precisely, the proposed method improves SANet [3] from 258.4 MAE to 188.4 MAE, and from 334.9 MSE to 315.3 MSE.

UCF-QNRF. We have also conducted experiments on recent UCF-QNRF dataset containing images of significantly varied density distribution and resolution. By limiting the maximal image size to , our VGG16 baseline already achieves state-of-the-art performance. The proposed SPN brings an improvement of 10.3 MAE and 20.6 MSE compared with VGG16 baseline. The proposed L2SM further boost the performance by 5.6 MAE and 11.0 MSE.

4.4 Ablation Study

The ablation studies are mainly conducted on the ShanghaiTech part A dataset, as it is a moderate dataset, neither too dense nor too sparse, and covers a diverse number of people heads.

Effectiveness of different learning to scale settings. For the learning to scale process, we first evenly divide the images in a whole dataset into groups of regions with increasing density, and then attempt to centralize the most dense groups of regions to similar density levels. As shown in Table 2, the number of groups and number of centers are important for accurate counting. For a fixed number of groups (e.g., ), centralizing more and more regions leads to slightly improved counting result. Yet, when we attempt to centralize every image region, we also re-predict the density map for very sparse or background regions, bringing more background noise and thus yielding slightly decreased performance. A relative finer group division with a proper number of centers performs slightly better. As depicted in Table 2, the proposed learning to scale using multipolar center loss performs much better than straightforwardly scaling to the average density (S2AD) in each group.

Time overhead. To analyze the time overhead of the proposed L2SM, we conduct experiments under seven different settings (see Table 2). The time overhead analysis is achieved by calculating the average inference time on the whole ShanghaiTech Part A test set. The batch size is set to 1 and only 1 Titan-X GPU is used during inference. The average time overhead of SPN is about 0.524s per image. When we increase the number of centers and the number of regions to be re-predicted, the runtime slightly increases. When using 5 centers and re-predict all the regions, the proposed L2SM increases the runtime by 0.068s per image, which is negligible compared with the whole runtime.

Effectiveness of the weight of the center loss. We study the effectiveness of center loss on ShanghaiTech Part A using one center by changing its weight in Eq. (3). Note that when the weight is set to 0, the center loss is not used, which means that the scale ratio is learned automatically without any specific supervision. As shown in Fig. 5, the use of center loss to bring regions of significantly varied density distributions to similar density levels plays an important role for improving the counting accuracy. It is also worth to note that the performance improvement is rather stable for a wide range of weight of the center loss.

Effectiveness of the ground-truth transformation. We also study the effect of ground-truth transformation involved in scale to re-predict process. As shown in Fig. 5, the ground-truth transformation (WT TransedGT) by enlarging the distance between crowded heads is more accurate than straightforwardly scale the ground-truth density map (WO TransedGT). This is as expected, since enlarging the distance between crowded heads results in regular Gaussian density blobs for dense regions, which reduces the density pattern shift and thus facilitates the density map prediction.

Effectiveness of the division. We also conduct experiments by varying the image domain division. As depicted in Table 3. The performance is rather stable across different image domain divisions.

4.5 Evaluation of Transferability

To demonstrate the transferability of the proposed method across datasets, we conduct experiments under cross dataset settings, where the model is trained on the source domain and tested on the target domain.

The cross dataset experimental results are presented in Table 4. We can observe that the proposed method generalizes well to unseen datasets. In particular, the proposed method consistently outperforms the state-of-the-art methods in [25] and MCNN [33] by a large margin. The proposed method also performs slightly well than the method in [18] in transferring models trained on ShanghaiTech Part A to UCF_CC_50. Yet, the improvement is not as significant as the comparison with [33, 25]

on transferring between Shanghai Part A and Part B. This is probably because that the method in 

[18] also relies on extra data which may somehow help to reduce the gap between the two datasets. As depicted in Table 4, the proposed L2SM plays an important role in ensuring the transferability of the proposed method. Furthermore, as depicted in Table 1 and Table 4, the proposed method under cross-dataset settings performs competitively or even outperforms some methods [23, 28, 27, 33] using the proper training set. This also confirms the generalizability of the proposed method.

5 Conclusion

In this paper, we propose a Learn to Scale Module (L2SM) to tackle the problem of large density variation for crowd counting. We achieve density centralization by a novel use of multipolar center loss. The L2SM can effectively learn to scale significantly varied dense regions to multiple similar density levels, making the density estimation on dense regions more robust. Extensive experiments on three challenging datasets demonstrate that the proposed method achieves consistent and significant improvements over the state-of-the-art methods. L2SM also shows the noteworthy generalization ability to unseen datasets with significantly varied density distributions, demonstrating the effectiveness of L2SM in real applications.


This work was supported in part by the National Key Research and Development Program of China under Grant 2018YFB1004600, in part by NSFC 61703171, and in part by NSF of Hubei Province of China under Grant 2018CFB199, to Dr. Yongchao Xu by the Young Elite Scientists Sponsorship Program by CAST.


  • [1] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan. Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn. In CVPR, pages 3618–3626, 2018.
  • [2] L. Boominathan, S. S. Kruthiventi, and R. V. Babu. Crowdnet: A deep convolutional network for dense crowd counting. In ACM-MM, pages 640–644, 2016.
  • [3] X. Cao, Z. Wang, Y. Zhao, and F. Su. Scale aggregation network for accurate and efficient crowd counting. In ECCV, pages 734–750, 2018.
  • [4] A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR, pages 1–7, 2008.
  • [5] P. Chattopadhyay, R. Vedantam, R. R. Selvaraju, D. Batra, and D. Parikh. Counting everyday objects in everyday scenes. In CVPR, pages 1135–1144, 2017.
  • [6] K. Chen, C. C. Loy, S. Gong, and T. Xiang. Feature mining for localised crowd counting. In BMVC, volume 1, page 3, 2012.
  • [7] P. Dollar, C. Wojek, B. Schiele, and P. Perona. Pedestrian detection: An evaluation of the state of the art. TPAMI, 34(4):743–761, 2012.
  • [8] W. Ge and R. T. Collins. Marked point processes for crowd counting. In CVPR, pages 2913–2920, 2009.
  • [9] H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-source multi-scale counting in extremely dense crowd images. In CVPR, pages 2547–2554, 2013.
  • [10] H. Idrees, K. Soomro, and M. Shah. Detecting humans in dense crowds using locally-consistent scale prior and global occlusion reasoning. TPAMI, 37(10):1986–1998, 2015.
  • [11] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah. Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, 2018.
  • [12] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
  • [13] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [14] V. Lempitsky and A. Zisserman. Learning to count objects in images. In NIPS, pages 1324–1332, 2010.
  • [15] Y. Li, X. Zhang, and D. Chen.

    Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes.

    In CVPR, pages 1091–1100, 2018.
  • [16] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin. Crowd counting using deep recurrent spatial-aware network. IJCAI, 2018.
  • [17] W. Liu, M. Salzmann, and P. Fua. Context-aware crowd counting. In CVPR, pages 5099–5108, 2019.
  • [18] X. Liu, J. van de Weijer, and A. D. Bagdanov. Leveraging unlabeled data for crowd counting by learning to rank. In CVPR, 2018.
  • [19] D. Onoro-Rubio and R. J. López-Sastre. Towards perspective-free object counting with deep learning. In ECCV, pages 615–629, 2016.
  • [20] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [21] V.-Q. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada. Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation. In ICCV, pages 3253–3261, 2015.
  • [22] V. Ranjan, H. Le, and M. Hoai. Iterative crowd counting. In ECCV, 2018.
  • [23] D. B. Sam, S. Surya, and R. V. Babu. Switching convolutional neural network for crowd counting. In CVPR, volume 1, page 6, 2017.
  • [24] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd counting via adversarial cross-scale consistency pursuit. In CVPR, pages 5245–5254, 2018.
  • [25] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M.-M. Cheng, and G. Zheng. Crowd counting with deep negative correlation learning. In CVPR, pages 5382–5390, 2018.
  • [26] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [27] V. A. Sindagi and V. M. Patel. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In AVSS, pages 1–6, 2017.
  • [28] V. A. Sindagi and V. M. Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, 2017.
  • [29] P. Viola, M. J. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. IJCV, 63(2):153–161, 2005.
  • [30] Y. Wen, K. Zhang, Z. Li, and Y. Qiao.

    A discriminative feature learning approach for deep face recognition.

    In ECCV, pages 499–515, 2016.
  • [31] F. Xiong, X. Shi, and D.-Y. Yeung. Spatiotemporal modeling for crowd counting in videos. In ICCV, pages 5161–5169, 2017.
  • [32] C. Zhang, H. Li, X. Wang, and X. Yang. Cross-scene crowd counting via deep convolutional neural networks. In CVPR, pages 833–841, 2015.
  • [33] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-image crowd counting via multi-column convolutional neural network. In CVPR, pages 589–597, 2016.