With the growing deployment of surveillance video cameras, the surveillance applications have received increasing attention from the research community. Different from other problems such as face and vehicle detection [1, 2, 3], crowd counting, or estimating the head count, from video frames, has been proved to be a critical functionality in various traffic and public security scenarios [4, 5, 6, 7, 8, 9, 10], and applications in zoology  and neural science  further extend the usability and importance of the problem.
Multiple challenges exist to produce accurate and efficient crowd counting results. Heavy occlusion, lightning and camera perspective changes are the common issues. Moreover, as shown in Figure 1, head counts vary dramatically in different scenarios; area of the head ranges from hundreds of pixels to only a few pixels, and the crowd is often unevenly distributed due to camera perspective or physical barriers. Recent approaches provide rough head count estimates based on multi-scale and context-aware cues [4, 5, 10, 6, 7, 8, 9]; a universal normalization or scaling mechanism is often used for different density domains. Practically, this can be suboptimal and leads to inaccuracy especially when the scenario is highly dynamic. Better adaption to different density domains is a first-order question for current crowd counting algorithms.
In this paper, we propose a deep learning pipeline to automatically infer the crowd density level given a single input image, and our framework adaptively chooses a counter network that is explicitly trained for the target density domain. Different counter networks and the density level estimator are associated in a spatial gating unit for end-to-end crowd counting. With this ground, our framework addresses the density adaption problem and produces more satisfactory results. The idea of constructing representation from multiple levels has been taken in several previous works, e.g., the Switch-CNN 
. Compared to these approaches using different network structures as regressors and classifier, our subnetworks employ the similar design and are easy to train.
To evaluate our proposed framework, we report the evaluations on the recent ShanghaiTech  and the UCF_CC_50 crowd counting dataset , and we compare with several recent approaches including MCNN  and CP-CNN . Notably, we achieve a significant 35.8 MAE improvements over the state-of-the-art Shang et al.  on UCF_CC_50, and 2.5 MAE gains over CP-CNN  on ShanghaiTech Part B. Meanwhile, a 20 FPS processing speed is obtained on an Nvidia Titan X GPU (Maxwell).
I-a Related Work
A number of studies on crowd counting have been demonstrated to solve the real world problem [15, 16]. They can be summarized into three categories depending on the methodology: detection-based, regression-based and density-based, which will be briefly reviewed below.
Detection-based crowd counting is straightforward and utilizes off-the-shelf detectors [17, 18, 19] to detect and count target objects in images or videos. However, for crowded scenarios, objects are highly occluded and many objects are too small to detect. All these make the counting inaccurate.
are proposed to bypass the occlusion problem that can be critical for detection-based methods. Specifically, a mapping between image features and the head count is recovered, and the system benefits from better feature extraction and count number regression algorithms[21, 22, 14, 20, 23]. Moreover, [25, 26, 27, 28] leverage spatial or depth information and use segmentation methods to filter the background region and regress count numbers only on foreground segments. These type of methods are sensitive to different crowd density levels and heavily depend on a normalization strategy that is universally good.
Density-based crowd counting. [29, 4, 5, 6, 7, 8] use continuous density maps instead of discrete count numbers as the optimization objective and learn a mapping between the image feature and the density map. Specifically,  presents a data-driven method to count in unseen scenarios.  proposes a multi-column network to directly generate the density map from input images.  introduces the boosting process which yield a significant improvement both in accuracy and runtime. To address perspective-free problem,  feeds a pyramid of input patches into their own designed network.  improves over  and uses a switch layer to exploit the variation of the crowd density.  jointly estimates the density map and count number with FCN and LSTM layers.  uses global and local context to generate high quality density map. The insufficiency of these type of methods is that the mapping between density and image may lead to deviation and the actual count can often be inaccurate.
In this paper, our framework leverages both continuous density map and discrete head count annotations in training; a density-level domain adaption network is used to explicitly recognize the domain allocation of each image patch. Different from previous works, we do not focus on pursuing a better density estimation but count directly. It totally drops the local details and is hard to learn. Instead, we propose a count map, which to some extent preserves the local details and can be calculated analytically. Besides, if we only use a single network to predict count map, the result will be dominated by the patch with high-density crowds. Therefore, we classify each patch into low- or high-density. This step has two advantages: i) counting number of patches with low-density crowds becomes more accurate; ii) the classifier becomes easier to train.
The overall architecture of our framework is demonstrated in Figure 2. Due to the dynamic crowd density between images and patches, we propose an adaptive network structure that includes a Density Adaption Network (DAN), a Low-density Counter Network (LCN) and a High-density Counter Network (HCN). DAN identifies each patch in the image if they belong to the low- or high-density domain; LCN and HCN are used to generate accurate head counts for low- or high-density patches. We note that using only the continuous density map or the discrete count map often leads to inaccurate estimation results. Instead, both of them will be considered in our framework.
|Type||Kernel Size /||In/Output||#Params|
|Stride / Dilation||Channels|
|conv1_1||/ 1 / 1||2.6|
|conv1_2||/ 1 / 1||20.3|
|conv1_3||/ 1 / 1||20.3|
|conv2||/ 1 / 2||40.7|
|conv3||/ 1 / 4||40.6|
|conv4||/ 1 / 2||10.2|
|conv5||/ 1 / 1||5.1|
|density map||/ 1 / 1||0.05|
|/ 64 / 1||16.0|
|/ 64 / 1||2304|
|/ 1 / 1||0.1|
Here we suppose the input image size is . Please note that conv1-5 correspond to the Basic Architecture in Figure 2, the layers in LCN/HCN and DAN correspond to the Local Sum and Local Classify in Figure 2 respectively.
As shown in Figure 2 and Table I, each network uses a similar CNN architecture to estimate the density map with independent Gaussian centered at each head. 111The symbols with hat denote the estimation of networks, while symbols without hat represent the ground truth. is pre-defined as in . Like , we divide the input image into grids. LCN leverages one additional convolutional layer to generate the count map locally, and and are defined similarly for HCN and DAN. We polarize the DAN output and generate the density class by:
where is a threshold, and therefore the spatial gating unit switches between LCN and HCN and the final head count in the image is given by:
where is element-wise production.
Training strategy. We note that although the structures for DAN, LCN and HCN are similar, they are recognized as different parts of the network. This is because they are explicitly trained from different density domains; multiple supervision signals are involved in the training. We observe in our experiments that DAN as well as the two counters will produce reasonable results only with good initiations. Therefore, we start training the Basic Architecture only with density map annotations; after convergence, density class and count map
annotations are added to fine-tune DAN, LCN and HCN on previous basic model. To convert a density map into a count map in our network inference, we use a consecutive convolutional layer, which we expect after a few epochs can learn a better mapping compared with a simple sum pooling. Losses for the density map, head count and density class for each network unit are defined as follows:
where is the network parameters, is the number of training samples and is the th image. , , and are the ground truth density map, count map and density class respectively222It’s worth noting that Least square errors (L2) and Least absolute deviations (L1) are applied to Eq. (3), (4) respectively. is usually much larger than , meaning that head count loss is more sensitive to outlier; and the variance will be further augmented and the sample with thousands of people may dominate the final function if L2 loss is employed in head count loss function.
, meaning that head count loss is more sensitive to outlier; and the variance will be further augmented and the sample with thousands of people may dominate the final function if L2 loss is employed in head count loss function.
. Therefore, the multi-part loss functions for different network parts are defined as combination of the density and count (density class) losses:
. We implement our framework in PyTorch333http://pytorch.org/ and we note a few practical issues here. First, all input images are resized to
with 3 channels as the network input and the aspect ratio are kept with zero padding. The first three layers inconv1 are three consecutive
convolutional layers. As dilated convolutional layers have been shown to be effective in many computer vision tasks[32, 33, 34, 35], the following 4 layers (conv 2-5) are dilated layers with dilation parameter = 2,4,2,1 respectively. The last convolutional layer is a bottleneck to regress the density level. Second, in order to choose the appropriate thresholds for different datasets, we add up the count values of patches in training images and the final thresholds are set according to the intermediate value of the statistics. Then, DAN connects two consecutive convolutional layers after conv5, the output serves as the density gate with size . LCN and HCN connect one consecutive convolutional layers after density map layer to obtain the corresponding count maps. Finally, we augment the training data with only random flips and we use Adam with learning rate=.
We demonstrate crowd counting results compared with previous works on two recent datasets: the ShanghaiTech Dataset  and the UCF_CC_50 Dataset . The effectiveness of each component in our module is evaluated and the influence of density class map size will be explored. We further consider transferring the learning between the datasets.
The metric we use include Mean Absolute Error, MAE=, and Mean Squared Error, MSE=, where is the number of the testing images and and are the ground truth and the predicted count number in the -th test image.
Iii-a Datasets and Results
The UCF_CC_50 Dataset  contains 50 images with head counts ranging from 94 to 4,543, and a total number of 63,974 individuals are annotated. Despite that the number of images is not large, the diversity of the scenarios makes the dataset extremely challenging. We conduct a five-fold cross-validation for training and testing, which is the standard evaluation setting used in . In the training, we generate the density map using the same spread () in the Gaussian kernel, and the threshold for the density boundary that decides the patch sparcity is set to 40 due to the high-density crowd in the dataset.
Table II shows the results on UCF_CC_50. We compare with [4, 5, 6, 7, 14] that are state-of-the-art CNN-based approaches, except for  that uses LSTM over a sequence of video frames. Recall that Shang et al.  use additional training images, and our method still achieves state-of-the-art MAE and MSE, as our networks can leverage different density level patches to their appropriate counter and achieve the more accurate results. Examples of the testing results can be seen in Figure 3(c).
|Zhang et al. ||467.0||498.5|
|Shang et al. ||270.3||-|
The ShanghaiTech Dataset  is one of the largest datasets available in terms of annotation. It contains 1,198 annotated images with a total of 330,165 people. The dataset consists of two subsets: Part A and Part B. Part A has 482 images collected from the Internet while Part B includes 716 images captured from downtown Shanghai. The dynamic scenarios make the dataset even more challenging. We conduct our experiments following setting of  where Part A is divided into 300 images for training and 182 images for testing, and 400 images in Part B for training and the rest for testing. In the training process, we generate the ground truth density map as in  with geometry-adaptive kernels for Part A and the same spread in Gaussian kernel for Part B. The density boundary threshold is set to 20 for Part A and 10 for Part B.
|Method||Part A||Part B|
|Zhang et al. ||181.8||277.7||32.0||49.8|
Table III demonstrates the comparison of our model with state-of-the-art approaches on the ShanghaiTech dataset: Zhang et al. , MCNN , Switching-CNN  and CP-CNN . Our approach achieves a promising improvement of 2.5 MAE and 3.3 MSE on Part B while producing comparable results on Part A. We note that our network structure is much simpler than the CP-CNN and hence is much faster. Our framework runs at 20FPS on an Nvidia X GPU (Maxwell), and qualitative results on Part A and Part B can be seen in Figure 3(a) and 3(b).
Iii-B Ablation Study
|LCN + HCN + DAN||234.5||289.6|
|Grid size||4 4||8 8||16 16|
The effectiveness of each part will be evaluated in this section. As shown in Table IV, LCN and HCN demonstrate MAE/MSE on the UCF_CC_50 dataset in which we use only LCN or HCN for crowd counting; using the counter from a single density domain produces much worse counting results due to the lack of context from other density level. The DAN in our framework achieves the density domain allocation accuracy of 0.96. The LCN + HCN + DAN demonstrates the performance combining LCN and HCN according to the classification results of DAN, while the last row shows the MAE/MSE using the ground truth density level rather than the predicted class map from DAN. It is clear that the density domain is a critical factor and there is still a gap between our results and optimal.
We also demonstrate the how the density class map grid size affects the results. We can see from the Table V that the best performance can be get when grid size and input size are set to 88 and 512512 respectively. Note that the grid size of 1616 is suboptimal for images with few heads due to division of a head into several parts. The grid size of 1616 is suboptimal for images with high-density crowd due to loss of local details. Overall, input size of 512512 preserves more details and achieves better results than that of 256256. We have also tried larger input sizes but the training becomes suboptimal.
Iii-C Dataset Transfer
We wonder how generalizable our proposed framework is. Similar to Zhang et al. , we verify the dataset transfer by using ShanghaiTech Part A as the source domain and UCF_CC_50 as the target domain. The results are reported in Table VI. We compare three training strategies. (i) W/O fine-tune: we use the base model pre-trained on the source domain only with density map annotations and test on the target domain on base architecture. (ii) Step learning on target: the base model is pre-trained only with density map on target dataset, and three subnetworks (DAN, LCN, HCN) are fine-tuned with pre-trained base model also using the target dataset. (iii) Fine-tune on target: the base model pre-trained on the source domain is used as the initialization of the entire framework and is fine-tuned on the target domain. The transfer results of MCNN  is illustrated for comparison. It is clear that w/o fine-tune achieves reasonable performance compared with MCNN ; fine-tuning on the target domain further improves 5.6 MAE and 6.4 MSE. These indicate that our model is flexible and can transfer between datasets with dynamic scenarios.
In this paper, we demonstrate the density adaption networks for crowd counting in dynamic scenarios. The framework leverages a density level estimator to adaptively choose between different counter networks that are explicitly trained for different crowd density domains. Experiments on two major crowd counting benchmarks show promising results of the proposed approach.
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,”IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
-  L. Wang, Y. Lu, H. Wang, Y. Zheng, H. Ye, and X. Xue, “Evolving boxes for fast vehicle detection,” in ICME, 2017, pp. 1135–1140.
-  S. Lyu et al., “Ua-detrac 2017: Report of avss2017 & iwt4s challenge on advanced traffic monitoring,” in AVSS, 2017, pp. 1–7.
C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” inCVPR, 2015, pp. 833–841.
-  Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-Image Crowd Counting via Multi-Column Convolutional Neural Network,” in CVPR, 2016, pp. 589–597.
-  D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in CVPR, vol. 1, no. 3, 2017, p. 6.
-  V. A. Sindagi and V. M. Patel, “Generating High-Quality Crowd Density Maps using Contextual Pyramid CNNs ,” in CVPR, 2017, pp. 1–14.
-  S. Zhang, G. Wu, J. P. Costeira, and J. M. Moura, “Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras,” in ICCV, 2017.
-  F. Xiong, X. Shi, and D.-Y. Yeung, “Spatiotemporal Modeling for Crowd Counting in Videos,” ICCV, 2017.
-  S. Kumagai, K. Hotta, and T. Kurita, “Mixture of counting cnns: Adaptive integration of cnns specialized to specific appearance for crowd counting,” arXiv preprint arXiv:1703.09393, 2017.
-  C. Arteta, V. Lempitsky, and A. Zisserman, “Counting in the wild,” in ECCV, 2016, pp. 483–498.
-  W. Xie, J. A. Noble, and A. Zisserman, “Microscopy Cell Counting with Fully Convolutional Regression Networks,” in MICCAI, 2015, pp. 1–8.
-  H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale counting in extremely dense crowd images,” in CVPR, 2013, pp. 2547–2554.
-  C. Shang, H. Ai, and B. Bai, “End-to-end crowd counting via joint learning local and global count,” in ICIP, 2016, pp. 1215–1219.
-  C. C. Loy, K. Chen, S. Gong, and T. Xiang, “Crowd counting and profiling: Methodology and evaluation,” in Modeling, Simulation and Visual Analysis of Crowds. Springer, 2013, pp. 347–382.
-  V. A. Sindagi and V. M. Patel, “A survey of recent advances in cnn-based single image crowd counting and density estimation,” Pattern Recognition Letters, 2017.
-  M. Li, Z. Zhang, K. Huang, and T. Tan, “Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection,” in ICPR, 2008, pp. 1–4.
-  B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in crowded scenes,” in CVPR, vol. 1, 2005, pp. 878–885.
-  L. Wang and N. H. Yung, “Crowd counting and segmentation in visual surveillance,” in ICIP, 2009, pp. 2573–2576.
-  A. B. Chan and N. Vasconcelos, “Bayesian poisson regression for crowd counting,” in ICCV, 2009, pp. 545–551.
-  K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting,” in BMVC, 2012.
-  S. Seguí, O. Pujol, and J. Vitrià, “Learning to count with deep object features,” in CVPR Workshops, 2015, pp. 90–96.
-  C. Arteta, V. Lempitsky, J. A. Noble, and A. Zisserman, “Interactive Object Counting,” in ECCV, 2014, pp. 1–15.
-  Q. Wang, J. Wan, and Y. Yuan, “Deep metric learning for crowdedness regression,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
-  A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in CVPR, 2008, pp. 1–7.
-  D. Ryan, S. Denman, C. Fookes, and S. Sridharan, “Crowd counting using multiple local features,” in Digital Image Computing: Techniques and Applications, 2009, pp. 81–88.
-  A. B. Chan and N. Vasconcelos, “Counting people with low-level features and bayesian regression,” IEEE Transactions on Image Processing, vol. 21, no. 4, pp. 2160–2177, 2012.
-  H. Fu, H. Ma, and H. Xiao, “Real-time accurate crowd counting based on rgb-d information,” in ICIP, 2012, pp. 2685–2688.
-  V. Lempitsky and A. Zisserman, “Learning to count objects in images,” in NIPS, 2010, pp. 1324–1332.
-  E. Walach and L. Wolf, “Learning to count with cnn boosting,” in ECCV. Springer, 2016, pp. 660–676.
-  D. Onoro-Rubio and R. J. López-Sastre, “Towards perspective-free object counting with deep learning,” in ECCV, 2016, pp. 615–629.
-  F. Yu, V. Koltun, and T. Funkhouser, “Dilated residual networks,” arXiv preprint arXiv:1705.09914, 2017.
Y. Zheng, H. Ye, L. Wang, and J. Pu, “Learning multiviewpoint context-aware representation for rgb-d scene classification,”IEEE Signal Processing Letters, vol. 25, no. 1, pp. 30–34, 2018.
-  C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017, pp. 1003–1012.
-  B. Xu, H. Ye, Y. Zheng, H. Wang, T. Luwang, and Y.-G. Jiang, “Dense dilated network for few shot action recognition,” in ICMR, 2018, pp. 379–387.