I Introduction
With the growing deployment of surveillance video cameras, the surveillance applications have received increasing attention from the research community. Different from other problems such as face and vehicle detection [1, 2, 3], crowd counting, or estimating the head count, from video frames, has been proved to be a critical functionality in various traffic and public security scenarios [4, 5, 6, 7, 8, 9, 10], and applications in zoology [11] and neural science [12] further extend the usability and importance of the problem.
Multiple challenges exist to produce accurate and efficient crowd counting results. Heavy occlusion, lightning and camera perspective changes are the common issues. Moreover, as shown in Figure 1, head counts vary dramatically in different scenarios; area of the head ranges from hundreds of pixels to only a few pixels, and the crowd is often unevenly distributed due to camera perspective or physical barriers. Recent approaches provide rough head count estimates based on multiscale and contextaware cues [4, 5, 10, 6, 7, 8, 9]; a universal normalization or scaling mechanism is often used for different density domains. Practically, this can be suboptimal and leads to inaccuracy especially when the scenario is highly dynamic. Better adaption to different density domains is a firstorder question for current crowd counting algorithms.
In this paper, we propose a deep learning pipeline to automatically infer the crowd density level given a single input image, and our framework adaptively chooses a counter network that is explicitly trained for the target density domain. Different counter networks and the density level estimator are associated in a spatial gating unit for endtoend crowd counting. With this ground, our framework addresses the density adaption problem and produces more satisfactory results. The idea of constructing representation from multiple levels has been taken in several previous works, e.g., the SwitchCNN [6]
. Compared to these approaches using different network structures as regressors and classifier, our subnetworks employ the similar design and are easy to train.
To evaluate our proposed framework, we report the evaluations on the recent ShanghaiTech [5] and the UCF_CC_50 crowd counting dataset [13], and we compare with several recent approaches including MCNN [5] and CPCNN [7]. Notably, we achieve a significant 35.8 MAE improvements over the stateoftheart Shang et al. [14] on UCF_CC_50, and 2.5 MAE gains over CPCNN [7] on ShanghaiTech Part B. Meanwhile, a 20 FPS processing speed is obtained on an Nvidia Titan X GPU (Maxwell).
Ia Related Work
A number of studies on crowd counting have been demonstrated to solve the real world problem [15, 16]. They can be summarized into three categories depending on the methodology: detectionbased, regressionbased and densitybased, which will be briefly reviewed below.
Detectionbased crowd counting is straightforward and utilizes offtheshelf detectors [17, 18, 19] to detect and count target objects in images or videos. However, for crowded scenarios, objects are highly occluded and many objects are too small to detect. All these make the counting inaccurate.
Regressionbased crowd counting. Regressionbased approaches such as [20, 21, 22, 23, 14, 10, 24]
are proposed to bypass the occlusion problem that can be critical for detectionbased methods. Specifically, a mapping between image features and the head count is recovered, and the system benefits from better feature extraction and count number regression algorithms
[21, 22, 14, 20, 23]. Moreover, [25, 26, 27, 28] leverage spatial or depth information and use segmentation methods to filter the background region and regress count numbers only on foreground segments. These type of methods are sensitive to different crowd density levels and heavily depend on a normalization strategy that is universally good.Densitybased crowd counting. [29, 4, 5, 6, 7, 8] use continuous density maps instead of discrete count numbers as the optimization objective and learn a mapping between the image feature and the density map. Specifically, [4] presents a datadriven method to count in unseen scenarios. [5] proposes a multicolumn network to directly generate the density map from input images. [30] introduces the boosting process which yield a significant improvement both in accuracy and runtime. To address perspectivefree problem, [31] feeds a pyramid of input patches into their own designed network. [6] improves over [5] and uses a switch layer to exploit the variation of the crowd density. [8] jointly estimates the density map and count number with FCN and LSTM layers. [7] uses global and local context to generate high quality density map. The insufficiency of these type of methods is that the mapping between density and image may lead to deviation and the actual count can often be inaccurate.
In this paper, our framework leverages both continuous density map and discrete head count annotations in training; a densitylevel domain adaption network is used to explicitly recognize the domain allocation of each image patch. Different from previous works, we do not focus on pursuing a better density estimation but count directly. It totally drops the local details and is hard to learn. Instead, we propose a count map, which to some extent preserves the local details and can be calculated analytically. Besides, if we only use a single network to predict count map, the result will be dominated by the patch with highdensity crowds. Therefore, we classify each patch into low or highdensity. This step has two advantages: i) counting number of patches with lowdensity crowds becomes more accurate; ii) the classifier becomes easier to train.
Ii Framework
The overall architecture of our framework is demonstrated in Figure 2. Due to the dynamic crowd density between images and patches, we propose an adaptive network structure that includes a Density Adaption Network (DAN), a Lowdensity Counter Network (LCN) and a Highdensity Counter Network (HCN). DAN identifies each patch in the image if they belong to the low or highdensity domain; LCN and HCN are used to generate accurate head counts for low or highdensity patches. We note that using only the continuous density map or the discrete count map often leads to inaccurate estimation results. Instead, both of them will be considered in our framework.
Type  Kernel Size /  In/Output  #Params 
Stride / Dilation  Channels  
conv1_1  / 1 / 1  2.6  
conv1_2  / 1 / 1  20.3  
conv1_3  / 1 / 1  20.3  
conv2  / 1 / 2  40.7  
conv3  / 1 / 4  40.6  
conv4  / 1 / 2  10.2  
conv5  / 1 / 1  5.1  
density map  / 1 / 1  0.05  
LCN/HCN  density map  
/ 64 / 1  16.0  
DAN 
/ 64 / 1  2304  
/ 1 / 1  0.1 
Here we suppose the input image size is . Please note that conv15 correspond to the Basic Architecture in Figure 2, the layers in LCN/HCN and DAN correspond to the Local Sum and Local Classify in Figure 2 respectively.
As shown in Figure 2 and Table I, each network uses a similar CNN architecture to estimate the density map with independent Gaussian centered at each head. ^{1}^{1}1The symbols with hat denote the estimation of networks, while symbols without hat represent the ground truth. is predefined as in [5]. Like [13], we divide the input image into grids. LCN leverages one additional convolutional layer to generate the count map locally, and and are defined similarly for HCN and DAN. We polarize the DAN output and generate the density class by:
(1) 
where is a threshold, and therefore the spatial gating unit switches between LCN and HCN and the final head count in the image is given by:
(2) 
where is elementwise production.
Training strategy. We note that although the structures for DAN, LCN and HCN are similar, they are recognized as different parts of the network. This is because they are explicitly trained from different density domains; multiple supervision signals are involved in the training. We observe in our experiments that DAN as well as the two counters will produce reasonable results only with good initiations. Therefore, we start training the Basic Architecture only with density map annotations; after convergence, density class and count map
annotations are added to finetune DAN, LCN and HCN on previous basic model. To convert a density map into a count map in our network inference, we use a consecutive convolutional layer, which we expect after a few epochs can learn a better mapping compared with a simple sum pooling. Losses for the density map, head count and density class for each network unit are defined as follows:
(3) 
(4) 
(5) 
where is the network parameters, is the number of training samples and is the th image. , , and are the ground truth density map, count map and density class respectively^{2}^{2}2It’s worth noting that Least square errors (L2) and Least absolute deviations (L1) are applied to Eq. (3), (4) respectively. is usually much larger than
, meaning that head count loss is more sensitive to outlier; and the variance will be further augmented and the sample with thousands of people may dominate the final function if L2 loss is employed in head count loss function.
. Therefore, the multipart loss functions for different network parts are defined as combination of the density and count (density class) losses:
(6) 
(7) 
(8) 
Implementation details
. We implement our framework in PyTorch
^{3}^{3}3http://pytorch.org/ and we note a few practical issues here. First, all input images are resized towith 3 channels as the network input and the aspect ratio are kept with zero padding. The first three layers in
conv1 are three consecutiveconvolutional layers. As dilated convolutional layers have been shown to be effective in many computer vision tasks
[32, 33, 34, 35], the following 4 layers (conv 25) are dilated layers with dilation parameter = 2,4,2,1 respectively. The last convolutional layer is a bottleneck to regress the density level. Second, in order to choose the appropriate thresholds for different datasets, we add up the count values of patches in training images and the final thresholds are set according to the intermediate value of the statistics. Then, DAN connects two consecutive convolutional layers after conv5, the output serves as the density gate with size . LCN and HCN connect one consecutive convolutional layers after density map layer to obtain the corresponding count maps. Finally, we augment the training data with only random flips and we use Adam with learning rate=.Iii Experiments
We demonstrate crowd counting results compared with previous works on two recent datasets: the ShanghaiTech Dataset [5] and the UCF_CC_50 Dataset [13]. The effectiveness of each component in our module is evaluated and the influence of density class map size will be explored. We further consider transferring the learning between the datasets.
The metric we use include Mean Absolute Error, MAE=, and Mean Squared Error, MSE=, where is the number of the testing images and and are the ground truth and the predicted count number in the th test image.
Iiia Datasets and Results
The UCF_CC_50 Dataset [13] contains 50 images with head counts ranging from 94 to 4,543, and a total number of 63,974 individuals are annotated. Despite that the number of images is not large, the diversity of the scenarios makes the dataset extremely challenging. We conduct a fivefold crossvalidation for training and testing, which is the standard evaluation setting used in [13]. In the training, we generate the density map using the same spread () in the Gaussian kernel, and the threshold for the density boundary that decides the patch sparcity is set to 40 due to the highdensity crowd in the dataset.
Table II shows the results on UCF_CC_50. We compare with [4, 5, 6, 7, 14] that are stateoftheart CNNbased approaches, except for [14] that uses LSTM over a sequence of video frames. Recall that Shang et al. [14] use additional training images, and our method still achieves stateoftheart MAE and MSE, as our networks can leverage different density level patches to their appropriate counter and achieve the more accurate results. Examples of the testing results can be seen in Figure 3(c).
Method  MAE  MSE 

Zhang et al. [4]  467.0  498.5 
MCNN [5]  377.6  509.1 
SwitchingCNN [6]  318.1  439.2 
CPCNN [7] 
295.8  320.9 
ConvLSTM [9]  284.5  297.1 
Shang et al. [14]  270.3   
Our Method  234.5  289.6 
The ShanghaiTech Dataset [5] is one of the largest datasets available in terms of annotation. It contains 1,198 annotated images with a total of 330,165 people. The dataset consists of two subsets: Part A and Part B. Part A has 482 images collected from the Internet while Part B includes 716 images captured from downtown Shanghai. The dynamic scenarios make the dataset even more challenging. We conduct our experiments following setting of [5] where Part A is divided into 300 images for training and 182 images for testing, and 400 images in Part B for training and the rest for testing. In the training process, we generate the ground truth density map as in [5] with geometryadaptive kernels for Part A and the same spread in Gaussian kernel for Part B. The density boundary threshold is set to 20 for Part A and 10 for Part B.
Method  Part A  Part B  

MAE  MSE  MAE  MSE  
Zhang et al. [4]  181.8  277.7  32.0  49.8 
MCNN [5]  110.2  173.2  26.4  41.3 
SwitchingCNN [6]  90.4  135.0  21.6  33.4 
CPCNN [7]  73.6  106.4  20.1  30.1 
Our Method  88.5  147.6  17.6  26.8 
Table III demonstrates the comparison of our model with stateoftheart approaches on the ShanghaiTech dataset: Zhang et al. [4], MCNN [5], SwitchingCNN [6] and CPCNN [7]. Our approach achieves a promising improvement of 2.5 MAE and 3.3 MSE on Part B while producing comparable results on Part A. We note that our network structure is much simpler than the CPCNN and hence is much faster. Our framework runs at 20FPS on an Nvidia X GPU (Maxwell), and qualitative results on Part A and Part B can be seen in Figure 3(a) and 3(b).
IiiB Ablation Study
Method  MAE  MSE 

LCN  660.9  867.6 
HCN  647.2  747.5 
LCN + HCN + DAN  234.5  289.6 
Ideal classification  157.3  195.6 
Grid size  4 4  8 8  16 16  

Input size  MAE  MSE  MAE  MSE  MAE  MSE 
256 256  97.1  165.5  92.2  150.6  105.3  181.2 
512 512  100.0  180.5  88.5  147.6  99.8  155.3 
The effectiveness of each part will be evaluated in this section. As shown in Table IV, LCN and HCN demonstrate MAE/MSE on the UCF_CC_50 dataset in which we use only LCN or HCN for crowd counting; using the counter from a single density domain produces much worse counting results due to the lack of context from other density level. The DAN in our framework achieves the density domain allocation accuracy of 0.96. The LCN + HCN + DAN demonstrates the performance combining LCN and HCN according to the classification results of DAN, while the last row shows the MAE/MSE using the ground truth density level rather than the predicted class map from DAN. It is clear that the density domain is a critical factor and there is still a gap between our results and optimal.
We also demonstrate the how the density class map grid size affects the results. We can see from the Table V that the best performance can be get when grid size and input size are set to 88 and 512512 respectively. Note that the grid size of 1616 is suboptimal for images with few heads due to division of a head into several parts. The grid size of 1616 is suboptimal for images with highdensity crowd due to loss of local details. Overall, input size of 512512 preserves more details and achieves better results than that of 256256. We have also tried larger input sizes but the training becomes suboptimal.
Method  MAE  MSE 

W/O finetune  378.2  434.9 
Step learning on target 
234.5  289.6 
Finetune on target 
228.9  283.2 
MCNN [5]  295.1  490.2 
IiiC Dataset Transfer
We wonder how generalizable our proposed framework is. Similar to Zhang et al. [5], we verify the dataset transfer by using ShanghaiTech Part A as the source domain and UCF_CC_50 as the target domain. The results are reported in Table VI. We compare three training strategies. (i) W/O finetune: we use the base model pretrained on the source domain only with density map annotations and test on the target domain on base architecture. (ii) Step learning on target: the base model is pretrained only with density map on target dataset, and three subnetworks (DAN, LCN, HCN) are finetuned with pretrained base model also using the target dataset. (iii) Finetune on target: the base model pretrained on the source domain is used as the initialization of the entire framework and is finetuned on the target domain. The transfer results of MCNN [5] is illustrated for comparison. It is clear that w/o finetune achieves reasonable performance compared with MCNN [5]; finetuning on the target domain further improves 5.6 MAE and 6.4 MSE. These indicate that our model is flexible and can transfer between datasets with dynamic scenarios.
Iv Conclusions
In this paper, we demonstrate the density adaption networks for crowd counting in dynamic scenarios. The framework leverages a density level estimator to adaptively choose between different counter networks that are explicitly trained for different crowd density domains. Experiments on two major crowd counting benchmarks show promising results of the proposed approach.
References

[1]
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,”
IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.  [2] L. Wang, Y. Lu, H. Wang, Y. Zheng, H. Ye, and X. Xue, “Evolving boxes for fast vehicle detection,” in ICME, 2017, pp. 1135–1140.
 [3] S. Lyu et al., “Uadetrac 2017: Report of avss2017 & iwt4s challenge on advanced traffic monitoring,” in AVSS, 2017, pp. 1–7.

[4]
C. Zhang, H. Li, X. Wang, and X. Yang, “Crossscene crowd counting via deep convolutional neural networks,” in
CVPR, 2015, pp. 833–841.  [5] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “SingleImage Crowd Counting via MultiColumn Convolutional Neural Network,” in CVPR, 2016, pp. 589–597.
 [6] D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in CVPR, vol. 1, no. 3, 2017, p. 6.
 [7] V. A. Sindagi and V. M. Patel, “Generating HighQuality Crowd Density Maps using Contextual Pyramid CNNs ,” in CVPR, 2017, pp. 1–14.
 [8] S. Zhang, G. Wu, J. P. Costeira, and J. M. Moura, “Fcnrlstm: Deep spatiotemporal neural networks for vehicle counting in city cameras,” in ICCV, 2017.
 [9] F. Xiong, X. Shi, and D.Y. Yeung, “Spatiotemporal Modeling for Crowd Counting in Videos,” ICCV, 2017.
 [10] S. Kumagai, K. Hotta, and T. Kurita, “Mixture of counting cnns: Adaptive integration of cnns specialized to specific appearance for crowd counting,” arXiv preprint arXiv:1703.09393, 2017.
 [11] C. Arteta, V. Lempitsky, and A. Zisserman, “Counting in the wild,” in ECCV, 2016, pp. 483–498.
 [12] W. Xie, J. A. Noble, and A. Zisserman, “Microscopy Cell Counting with Fully Convolutional Regression Networks,” in MICCAI, 2015, pp. 1–8.
 [13] H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multisource multiscale counting in extremely dense crowd images,” in CVPR, 2013, pp. 2547–2554.
 [14] C. Shang, H. Ai, and B. Bai, “Endtoend crowd counting via joint learning local and global count,” in ICIP, 2016, pp. 1215–1219.
 [15] C. C. Loy, K. Chen, S. Gong, and T. Xiang, “Crowd counting and profiling: Methodology and evaluation,” in Modeling, Simulation and Visual Analysis of Crowds. Springer, 2013, pp. 347–382.
 [16] V. A. Sindagi and V. M. Patel, “A survey of recent advances in cnnbased single image crowd counting and density estimation,” Pattern Recognition Letters, 2017.
 [17] M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the number of people in crowded scenes by mid based foreground segmentation and headshoulder detection,” in ICPR, 2008, pp. 1–4.
 [18] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowded scenes,” in CVPR, vol. 1, 2005, pp. 878–885.
 [19] L. Wang and N. H. Yung, “Crowd counting and segmentation in visual surveillance,” in ICIP, 2009, pp. 2573–2576.
 [20] A. B. Chan and N. Vasconcelos, “Bayesian poisson regression for crowd counting,” in ICCV, 2009, pp. 545–551.
 [21] K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting,” in BMVC, 2012.
 [22] S. Seguí, O. Pujol, and J. Vitrià, “Learning to count with deep object features,” in CVPR Workshops, 2015, pp. 90–96.
 [23] C. Arteta, V. Lempitsky, J. A. Noble, and A. Zisserman, “Interactive Object Counting,” in ECCV, 2014, pp. 1–15.
 [24] Q. Wang, J. Wan, and Y. Yuan, “Deep metric learning for crowdedness regression,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
 [25] A. B. Chan, Z.S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in CVPR, 2008, pp. 1–7.
 [26] D. Ryan, S. Denman, C. Fookes, and S. Sridharan, “Crowd counting using multiple local features,” in Digital Image Computing: Techniques and Applications, 2009, pp. 81–88.
 [27] A. B. Chan and N. Vasconcelos, “Counting people with lowlevel features and bayesian regression,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 2160–2177, 2012.
 [28] H. Fu, H. Ma, and H. Xiao, “Realtime accurate crowd counting based on rgbd information,” in ICIP, 2012, pp. 2685–2688.
 [29] V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in NIPS, 2010, pp. 1324–1332.
 [30] E. Walach and L. Wolf, “Learning to count with cnn boosting,” in ECCV. Springer, 2016, pp. 660–676.
 [31] D. OnoroRubio and R. J. LópezSastre, “Towards perspectivefree object counting with deep learning,” in ECCV, 2016, pp. 615–629.
 [32] F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” arXiv preprint arXiv:1705.09914, 2017.

[33]
Y. Zheng, H. Ye, L. Wang, and J. Pu, “Learning multiviewpoint contextaware representation for rgbd scene classification,”
IEEE Signal Processing Letters, vol. 25, no. 1, pp. 30–34, 2018.  [34] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017, pp. 1003–1012.
 [35] B. Xu, H. Ye, Y. Zheng, H. Wang, T. Luwang, and Y.G. Jiang, “Dense dilated network for few shot action recognition,” in ICMR, 2018, pp. 379–387.
Comments
There are no comments yet.