1 Introduction
Counting is the process of estimating the number of a particular object. With the expansion of urban population and the convenience of modern transportation, it is common to have large crowds in specific events or scenarios, and crowd counting from images or videos becomes crucial for applications ranging from traffic control to public safety.
Previous methods of crowd counting may be roughly divided into two categories: detectionbased and regressionbased. Detectionbased methods have been studied with the pedestrian detectors [1, 2]. However, it is challenging for these methods to model very dense crowd or crowd in a clustered environment. The regressionbased approaches are firstly proposed in [3]. With the recent development of the convolutional neural network (CNN), the regression framework by estimation of the density maps has been widely used. Compared with the system employing a single CNN regressor (e.g., [4]), the networks with multiple columns/branches learn more contextual information and achieve the stateoftheart performance [5, 6, 7, 8]. Even though the different receptive fields are used in multiple branches, it is difficult to represent highly variable crowd images. As shown in Fig. 1, there still exist gaps between the groundtruth and prediction for some crowd images. From the figure we also observe that the images under similar scenario seem to have the same prediction pattern: the images with the lower camera viewpoints and more backgrounds usually achieve smaller counting prediction than the groundtruth (Fig. 1Left), while these with high viewpoint get larger predicted values (Fig. 1Right).
The central issue addressed in this paper is the following: Can we design a model to discover the scenarios and modeling the crowd images simultaneously? One intuitive idea is to add the number of network branch with welldesigned convolution filters. The limitations are, the CNN model will be difficult to train with the current crowd counting datasets, and it is also hard to directly define the scenarios. In this paper, we present an adaptive scenario discovery framework for crowd counting. Our network adopts the VGG model [9] as the backbone and is structured with two parallel pathways that are trained with different sizes of the receptive field to serve different scales and crowd densities. We consider the scenario as a linear combinational of two pathway with the discretized weights and design a third adaption branch to learn this scenario aware responses and discover the scenarios implicitly. Our contributions are summarized as follows.

From the perspective of scenario discovery, a novel adaptive framework for crowd counting is proposed. Different from previous multiple columns/branches frameworks, ours has the ability to represent highly variable crowd images with two branches by incorporating the discretized pathwaywise responses.
1.1 Related Work
Numerous efforts have been devoted to the design of crowd counting models. Detail survey of the recent progress can be found in [10]. In this section, we mainly discuss literature on the models with multiple branches representation, which are more related to this work. In [5], Zhang et al. proposed the MCNN by using three columns of convolutional neural networks with filters of different sizes. Sam et al. [6] proposed the SwitchingCNN, which decoupled the three columns into separate CNN (each trained with a subset of the patches), and a density selector is designed to utilize the structural and functional differences. Several works have studied the context information of the crowd images under multiple branch setting. For instance, Sindagi et al. [7] applied local and global context coding to population count density estimation, and Zhang et al. [11] proposed a scaleadaptive CNN architecture with a backbone of fixed small receptive fields. Another work related to ours is the CSRNet [8], where convolutional neural networks with dilation operations were employed after the backbone of the pretrained deep model.
These existing approaches construct density estimation models with multiple branches to represent different receptive fields or scales. Our framework also follows the general process, with the design of one branch representing the dense prediction and another for the relative sparse crowds. However, instead of using the fix branch weights or selecting one explicitly column, we adopt the learning of branch weights. Responses of the dense and sparse pathways are adaptively recalibrated by a third branch, which explicitly models interdependencies between pathways. Moreover, with the discretization of these pathwaywise responses, the crowd scenarios are implicitly discovered and responds to different crowd images in a highly scenariospecific manner. The whole framework can be endtoend trained, and as will be shown in the experiments, it is more accurate compared to previous approaches.
2 Framework
The overall architecture of our framework is illustrated in Fig. 2. We start by introducing the design of adaptive scenario discovery, followed by implementation details of the framework.
2.1 Adaptive Scenario Discovery
The selection of a suitable network structure is important to the success of a crowd counting system. There are generally two categories of networks: either it is with a new design of the structure and learned from scratch (e.g., [5, 12]), or the model is transferred from part of a pretrained network (e.g., [8, 13]). In this paper, our framework belongs to the second case, by employing the convolutional layers of a VGG16 model [9]
pretrained from ImageNet dataset
[14]and finetuning with the crowd images. We choose this strategy for the outstanding performance of the model in crowd counting as well as other computer vision tasks, and the results in the evaluation also confirm the effective of pretrained model.
Our counting network consists of two parallel pathways after the backbone module. The first pathway starts with a deconvolution layer that amplifies the inputs, and then a few convolutional layers with larger receptive fields are used, followed by a max pooling. This pathway is designed to model the high congested scenario with dense crowd, and the second pathway is for the sparse scenario. The convolution filers in this subnet are with a size of . Note that the concept of dense or sparse is relative and both pathways can output a density map.
There are several approaches to fuse the density maps, and here we would like to use a dynamic weighting strategy. Inspired by the excitation operation in SENet [15], we propose the adaption branch. The outputs of the last convolutional layer in the backbone go through a global average pooling and two fullyconnected layers and then have an initial response . We expect to adaptively recalibrate the weight of the dense and sparse pathways, therefore we normalize it into the interval of [0,1) with the following formula:
(1) 
Experiments on Sec. 3.2 will show the effect compared with the single branch or average fusion. However, we find that the convergence speed of this architecture is slow, probably due to the small size of the crowd counting dataset but the continuous response.
Our solution is to divide the response value into bins, by borrowing the idea from traditional visual features such as color histogram [16], SIFT [17], and HoG [18]. The benefits of discretization are twofolder. First, the model itself is easier to train and converge. Second, similar attributes are significantly observed from the images within the same bin (see Fig. 5), indicating that discretization operation is able to implicitly discovering the dynamic scenarios.
2.2 Implementation Details
Ground Truth Generation. We follow [8] to generate the density maps from ground truth. the density map is generated with the formula:
(2) 
where is a targeted object in the ground truth and
is a Gaussian kernel with standard deviation of
. For the datasets with high congested scene (such as ShanghaiTech Part A [5] and UCF_CC_50 [3]), is defined as a geometryadaptive kernel with . Here is the average distance of nearest neighbors of targeted object . For low congested scene (i.e., ShanghaiTech Part B [5]), we set .Training Details.
We define the loss function as follows:
(3) 
where is the ground truth density map of image from Equ. (2) and is the estimated density map of with the parameters learned by the proposed network.
In order to ensure the spatial feature and the context of the crowd images, we do not extract the image patches for the data augmentation. And there is also no additional image copy/conversion enhancement. During training, we employ the stochastic gradient descent (SGD) for its good generalization ability.
3 Evaluations
We conduct the experiments on the ShanghaiTech dataset [5] and the UCF_CC_50 dataset [3]. The ShanghaiTech dataset [5] is divided into Part A and Part B. ShanghaiTech Part A contains 482 crowd images with 300 training images and 182 testing images, and the average number of the pedestrian is 501. ShanghaiTech Part B is with 716 images (400 training and 316 testing). The resolution of the images are fixed with pixels, and the pedestrian number is generally smaller than Part A with an average number of 123. The UCF_CC_50 dataset [3]
contains 50 images with high crowd density. The images vary in the number of pedestrians, with a range of 94 to 4,543. For both datasets, we follow the standard experimental protocols, and mean absolute error (MAE) and mean squared error (MSE) is reported as the evaluation metric. We implement our framework based on PyTorch.
Method  Part A  Part B  UCF_CC_50  

MAE  MSE  MAE  MSE  MAE  MSE  
Zhang et al. [19]  181.8  277.7  32.0  49.8  467.0  498.5 
MCNN [5]  110.2  173.2  26.4  41.3  377.6  509.1 
CascadedMTL [20]  101.3  152.4  20.0  31.1  322.8  397.9 
SwitchingCNN [6]  90.4  135.0  21.6  33.4  318.1  439.2 
CPCNN [7]  73.6  106.4  20.1  30.1  295.8  320.9 
Huang et al. [21]      20.2  35.6  409.5  563.7 
DConvNet [13]  73.5  112.3  18.7  26.0  288.4  404.7 
ACSCP [12]  75.7  102.7  17.2  27.4  291.0  404.6 
DecideNet [22]      20.8  29.4     
SaCNN [11]  86.8  139.2  16.2  25.8  314.9  424.8 
CSRNet [8]  68.2  115.0  10.6  16.0  266.1  397.5 
ASD [ours]  65.6  98.0  8.5  13.7  196.2  270.9 
3.1 Results and Comparison
We first evaluate the overall results of our proposed framework for crowd counting. We compare our framework with several stateoftheart approaches, including the multicolumn CNN with different receptive fields [5], the SwitchingCNN that leverages variation of crowd density [6], and a very recent dilated convolution based model CSRNet [8]. The number of grouped scenario is 15, and the effect of the parameters will be evaluated in the next subsection. We denote our approach as ASD (Adaptive Scenario Discovery) in the following comparisons.
ShanghaiTech. Table 1 summarizes the MAE and MSE of previous approaches and ours in the datasets. On Part A of ShanghaiTech, we achieve a significant overall improvement of 24.8 of absolute MAE value over SwitchingCNN [6] and 2.6 of MAE over the stateoftheart CSRNet [8]. On Part B, our ASD framework also achieves the best MAE 8.5 and MSE 13.7 compared to the stateoftheart. Fig. 3(a) and (b) illustrate the density maps and the prediction results of some crowd images from both parts respectively.
UCF_CC_50. We now report results on the UCF_CC_50 dataset, as summarized in Table 1 and shown in Fig. 3
(c). Similar to the experiments on ShanghaiTech, the ASD framework shows better results than the other approaches, and we improve on the previously reported stateoftheart results by 26.3% for the MAE metric and 31.8% for the MSE, which indicates the low variance of our prediction across the high crowd density images.
3.2 Ablation Study
In this part, we evaluate a few parameters and an alternative implementation for the proposed framework. We report results on the ShanghaiTech Part A.
Network Architecture. We first evaluate the effect of the two parallel pathways over the whole framework. Fig. 4Left gives the comparison with different network architecture, including the single pathway and the fusion of them. With the fusion of a fixed pathway weight, the result is 74.1 of MAE and 114.0 of MSE, which is not higher than results by the single pathway. We observe significant performance gains when adding the dynamic pathwaywise responses and the discretization.
The effect of Scenario Discovery. Recall that the discretization on the adaption branch is applied to discover the dynamic scenarios implicitly; here we consider the different choice of parameters. The output response after the operation of Equ. (1) fall in the interval (0,1), and is divided into 2,10,100, and 1000 bins. Note that only a proportion of bins are with images after model training due to the size of the dataset, therefore the number of scenarios is usually smaller than that of the bin. Discretization with 2 bins can be considered as a simplified version of SwitchingCNN [6], and our learning strategy still achieves lower MAE (74.4 vs. 90.4). Without the discretization, we obtain the MAE of 69.4, which is not as good as the scenario discovery with 15 and 81 scenarios (MAE of 65.6 and 68.7, respectively). Fig. 5 illustrates some crowd images from the different scenarios.
4 Conclusions
In this paper, we have presented a novel architecture for highdensity population counting. Our approach focuses on the implicit discovery and dynamic modeling of scenarios. In addition, we have reformulated the crowd counting problem as a scenario classification problem such that the semantic scenario models into a combined prediction subtasks. We have built the ASD to obtain two weights of different sizes through parallel perception path for dynamic fusion. Experiments have stateoftheart results of our model on three population crowd counting sets.
References

[1]
M. Wang and X. Wang,
“Automatic adaptation of a generic pedestrian detector to a specific
traffic scene,”
in
Conference on Computer Vision and Pattern Recognition (CVPR)
, 2011, pp. 3401–3408.  [2] R. Stewart and M. Andriluka, “Endtoend people detection in crowded scenes,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2325–2333.
 [3] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multisource multiscale counting in extremely dense crowd images,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 2547–2554.

[4]
D. Oñoro Rubio and R. J. LópezSastre,
“Towards perspectivefree object counting with deep learning,”
in European Conference on Computer Vision (ECCV), 2016, pp. 615–629.  [5] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Singleimage crowd counting via multicolumn convolutional neural network,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 589–597.
 [6] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
 [7] V. A. Sindagi and V. M. Patel, “Generating highquality crowd density maps using contextual pyramid cnns,” in International Conference on Computer Vision (ICCV), 2017.
 [8] Y. Li, X. Zhang, and D. Chen, “Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
 [10] V. A Sindagi and V. M Patel, “A survey of recent advances in cnnbased single image crowd counting and density estimation,” Pattern Recognition Letters, vol. 107, pp. 3–16, 2018.
 [11] L. Zhang, M. Shi, and Q. Chen, “Crowd counting via scaleadaptive convolutional neural network,” in IEEE Winter Conference on Applications of Computer Vision (WACV), 2018, pp. 1113–1121.
 [12] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial crossscale consistency pursuit,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [13] Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M.M. Cheng, and G. Zheng, “Crowd counting with deep negative correlation learning,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [14] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei, “Imagenet: A largescale hierarchical image database,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
 [15] J. Hu, L. Shen, and G. Sun, “Squeezeandexcitation networks,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
 [16] M. A. Stricker and M. Orengo, “Similarity of color images,” in Storage and Retrieval for Image and Video Databases III, 1995, vol. 2420, pp. 381–393.
 [17] D. G Lowe, “Distinctive image features from scaleinvariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004.
 [18] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2005, vol. 1, pp. 886–893.
 [19] C. Zhang, H. Li, X. Wang, and X. Yang, “Crossscene crowd counting via deep convolutional neural networks,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 833–841.
 [20] V. A. Sindagi and V. M. Patel, “Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting,” in IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2017.
 [21] S. Huang, X. Li, Z. Zhang, F. Wu, S. Gao, R. Ji, and J. Han, “Body structure aware deep crowd counting.,” IEEE Transactions on Image Processing, vol. 27, no. 3, pp. 1049–1059, 2018.
 [22] J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, “Decidenet: Counting varying density crowds through attention guided detection and density estimation,” in Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
Comments
There are no comments yet.