1 Introduction
Gathering of large crowds is commonplace nowadays, and estimating the size of a crowd is an important problem for different purposes ranging from journalism to public safety. Without turnstiles to provide a precise count, the media and crowd safety specialists must estimate the size of the crowd based on images and videos of the crowd. Manual visual estimation, however, is difficult and laborious for humans. Humans are good at subitizing, i.e., predicting fast and accurate counts for small number of items, but the accuracy with which humans count deteriorates as the number of items increase [7]. Furthermore, the addition of each new item beyond a few adds an extra processing time of around 250 to 300 milliseconds [17]
. As a result, any crowd monitoring system that relies on humans for counting people in crowded scenes will be slow and unreliable. There is a need for an automatic computer vision algorithm that can accurately count the number of people in crowded scenes based on images and videos of the crowds.
There exist a number of computer vision algorithms for crowd counting, and the current stateoftheart methods are based on density estimation rather than detectionthencounting. Densityestimation methods use Convolutional Neural Networks (CNNs) [9, 8] to output a map of density values, one for each pixel of the input image. The final count estimate can be obtained by summing over the predicted density map. Unlike the detectionthencounting approach (e.g., [5]
), the output of the density estimation approach at each pixel is not necessarily binary. Density estimation has been proved to be more robust than the detectionthencounting approach because the former does not have to commit to binarized decisions at an early stage.
Estimating the crowd density per pixel is a challenging task due to the large variation of the crowd density values. As shown in Figure 1, some images contain hundreds of people, while others have only a few. It is difficult for a single CNN to handle the entire spectrum of crowd densities. Earlier works [20, 15] have tackled this challenge by using a multicolumn or a switching CNN architecture. These CNN architectures consist of three parallel CNN branches with different receptive field sizes. In such architectures, a branch with smaller receptive fields could handle the high density images well, while a branch with larger receptive fields could handle the low density images. More recently, a fivebranch CNN architecture was proposed [16] where three of the branches resembled the previous multicolumn CNN [20]
, while the remaining two branches acted as global and local context estimators. These context estimator branches were trained beforehand on the related task of classifying the image into different density categories. Some of the key takeaways from these previous approaches are: (1) using a multicolumn CNN model with varying kernel sizes improves the performance of crowd density estimation; and (2) augmenting the feature set with the ones learned from a task related to density estimation, such as count range classification, improves the performance of the density estimation task.
In this work, we propose iterative counting Convolutional Neural Networks (icCNN), a CNNbased iterative approach for crowd counting. Unlike previous approaches, where three [20, 15] or more [16] columns are needed to achieve good performance, our icCNN approach has a simpler architecture comprising of two columns/branches. The first branch predicts a density map at a lower resolution of the size of the original image, and passes the predicted map and a set of convolutional features to the second branch. The second branch predicts a high resolution density map at the size of the original image. Density maps contain information about the spatial distribution of crowd in an image. Hence, the first stage map serves as an important feature for the high resolution density map prediction task. We also propose a multistage extension of icCNN where we combine multiple icCNNs sequentially to further improve the quality of the predicted density map. Each icCNN in the multistage pipeline provides both the low and high resolution density predictions to all subsequent stages. Figure 2 illustrates the schematic architecture for icCNN. icCNN has two branches: Low Resolution CNN (LRCNN) and High Resolution CNN (HRCNN). LRCNN predicts the density map at a low resolution while HRCNN predicts the density map at the original image resolution. The key highlights of our work are:

We propose icCNN, a twostage CNN framework for crowd density estimation and counting.

icCNN achieves state of the art results on multiple crowd counting datasets. On Shanghaitech Part B dataset, icCNN yields improvement in terms of mean absolute error over the previously published results [16].

We also propose a multistage extension of icCNN, which can combine predictions from multiple icCNN models.
2 Related Work
Crowd counting is an important research problem and a number of approaches have been proposed by the computer vision community. Earlier work tackled crowd counting as an object detection problem [11, 12]. Lin et al. [12] extracted Haar features for head like contours and used an SVM classifier to classify these features as the contour of a head or not. Li et al. [11] proposed a detection based approach where the input image was first segmented into foregroundbackground regions and a HOG feature based headshoulder detector was used to detect each person in the crowd. These detection based methods often fail to accurately count people in extremely dense scenes. To handle images of dense crowds, some methods [2, 3] proposed to use a regression approach to avoid the harder detection problem. They instead extracted local patch level features and learned a regression function to directly estimate the total count for an input image patch. These regression approaches, however, do not fully utilize the available annotation associated with training data; they ignore the spatial density and distribution of people in training images. Several researchers [10, 14] proposed to use a density estimation approach to take advantage of the provided crowd density annotation maps of training images. Lempitsky & Zisserman [10] learned a linear mapping between the crowd images and the corresponding ground truth density maps. Pham et al. [14]
learned a more robust mapping by using a random decision forest to estimate the crowd density map. These densitybased methods solve some of the challenges faced by the earlier detection and regression based approaches, by avoiding the harder detection problem and also utilizing the spatial annotation and correlation. All aforementioned methods predated the deeplearning era, and they used hand crafted features for crowd counting.
More recent methods [18, 4, 20, 15, 13, 16] used CNNs to tackle crowd counting. Wang et al. [18] posed crowd counting as a regression problem, and used a CNN model to map the input crowd image to its corresponding count. Instead of predicting the overall count, Fu et al. [4] classified an image into five broad crowd density categories and used a cascade of two CNNs in a boosting like strategy where the second CNN was trained on the images misclassified by the first CNN. These methods also overlooked the benefits provided by the crowd density annotation maps.
The methods that are most related to our work are [20, 15, 16]. Zhang et al. [20] proposed a CNNbased method to predict crowd density maps. To handle the large variation in crowd densities and sizes across different images, Zhang et al. [20] proposed a multicolumn CNN architecture (MCNN) with filters and receptive fields of various sizes. The CNN column with smaller receptive field and filter sizes were responsible for the denser crowd images, while the CNN columns with larger receptive fields and filter sizes were meant for the less dense crowd images. The features from the three columns were concatenated and processed by a convolution layer to predict the final density map. To handle the variations in density and size within an image, the authors divided each image into nonoverlapping patches, and trained the MCNN architecture on these patches. Given that the number of training samples in annotated crowd counting datasets is much smaller in comparison to the datasets pertaining to image classification and segmentation tasks, training a CNN from scratch on full images might lead to overfitting. Hence, patchbased training of MCNN was essential in preventing overfitting and also improving the overall performance by serving as a data augmentation strategy. One issue with MCNN was that it fused the features form three CNN columns for predicting the density map. For a given patch, it is expected that the counting performance can be made more accurate by choosing the right CNN column that specializes in analyzing images of similar density values. Sam et al. [15] built on this idea and decoupled the three columns into separate CNNs, each focused on a subset of the training patches. To decide which CNN to assign a patch to, the authors trained a CNNbased switch classifier. However, since the ground truth label needed to train the switch classifier was unavailable, the authors resorted to a multistage training strategy: 1) training the three density predicting CNNs on the entire set of training patches, 2) training the switch classifier using the count from the previous stage to decide the switch labels, and 3) retraining the three CNNs using the patches assigned by the switch classifier. In a more recent work, Sindagi et al. [16] further modified the MCNN architecture by adding two more branches for estimating global and local context maps. The global/local context prediction branches were trained beforehand for the related task of classifying an image/patch into five different count categories. The classification scores were used to create a feature map of the same size as the image/patch, which served as the global/local context map. These context maps were fused with the convolutional feature maps obtained using a three branch multicolumn CNN, and the resulting features were further processed by convolutional layers and a convolution layer to obtain the final density map.
3 Proposed Approach
In this section, we describe the architecture of icCNN, its multistage extension and the training strategy. icCNN is discussed in Section 3.1. The multi stage extension of icCNN is discussed in Section 3.2, and the training details are discussed in Sec 3.3.
3.1 Iterative Counting CNN
Let be the training set of (image, high resolution density map, low resolution density map) triplets, where is the image, is the corresponding crowd density map at the same resolution as the image , and is a low resolution version of the crowd density map. and have the same overall count. Let and be the mapping functions which transform the image into the low resolution and high resolution density maps, respectively. Let the parameters of the low resolution branch (LRCNN) and high resolution branch (HRCNN) be and respectively. Note that depends on only , while depends on both and . Given an input image , the low resolution density map can be obtained by a doing a forward pass through the LRCNN branch:
(1) 
The inputs to the high resolution branch HRCNN are: the image , the features computed by the low resolution branch LRCNN, and the low resolution prediction . HRCNN predicts a high resolution density map of the same size as the original image:
(2) 
The low resolution prediction contains information about the spatial distribution of the crowd in the image . It serves as an important feature map for the high resolution prediction task. We can learn the parameters and
by minimizing the loss function
:(3) 
where denotes the loss function, and a reasonable choice is to use the squared error between the estimated and ground truth values. and
are scalar hyperparameters which can be used to give more importance to one of the loss terms. Using Equations (
1) and (2), the right hand side can be further simplified as:(4) 
At test time, given an image , we first obtain the low resolution output by doing a forward pass through LRCNN and then pass the convolutional features and the low resolution map to HRCNN, which will predict the high resolution map . We use the high resolution output predicted by HRCNN as the final output of icCNN. The overall crowd count is obtained by summing over all the pixels in the density map .
Below we provide the architecture details for the LRCNN and HRCNN branches.
LRCNN. The LRCNN branch takes as input an image, and predicts a density map at the size of the original image. LRCNN has the following architecture:
Conv364, Conv364, MaxPool, Conv3128, Conv3128, MaxPool, Conv3256, Conv3256, Conv3256, Conv7196, Conv596, Conv332, Conv11. Here, ConvXY implies a convolution layer having Y filters with
kernel size. MaxPool is the max pooling layer. We use a ReLU nonlinearity after each convolutional layer.
HRCNN.
The HRCNN branch predicts the high resolution density map at the same size as the input image. HRCNN has the following architecture: Conv716, MaxPool, Conv524, MaxPool, Conv348, Conv348, Conv324, Conv7196, Conv596, Upsampling2, Conv332, Upsampling2, Conv11. Here, Upsampling2 is a bilinear interpolation layer which upsamples the input to twice its size.
3.2 Multistage Crowd Counting
A multistage icCNN is a network that combines multiple building blocks of icCNN described in the previous section. Each icCNN block inputs the low and high resolution prediction maps from all the previous blocks. Given an input image , the low resolution branch of the block, represented by the function , outputs the low resolution prediction:
(5) 
where represents the parameters of LRCNN, and represent the set of low and high level predictions from the first blocks for the input . The high resolution branch of the block, represented by the function , takes as input the image , the feature maps computed by the low resolution branch , the low resolution prediction , and the entire set of low and high resolution prediction maps from the first blocks. Hence, the output of the HRCNN can be computed using:
(6) 
Note that and do not depend on the parameters for the first blocks, and and are treated as fixed inputs (i.e., the parameters of the corresponding network blocks are frozen). We can learn the parameters and by minimizing the loss function :
(7) 
3.3 Training Details
An icCNN is trained by minimizing the loss function from Equation (3
). We use the Stochastic Gradient Descent algorithm with the following hyper parameters (unless specified otherwise): learning rate
, momentum , batch size 1. We give more importance to the high resolution loss term in Equation (3) and set and to and , respectively.We train a multistage icCNN in multiple stages. In the stage, we train the icCNN block by minimizing the loss function given in Equation (3.2), using the Stochastic Gradient Descent algorithm with the same hyper parameters as above. Once the training for the stage has converged, we freeze the parameters for the stage and proceed to the next stage.
The training data consists of crowd images and corresponding ground truth annotation files. A ground truth annotation for an image specifies the location of each person in the image with a single dot on the person. We convert this annotation into a binary map consisting of ’s at all locations, except for the annotated points which are assigned the value of
. We convolve this binary map with a Gaussian filter of standard deviation
. We use the resulting density map for training the networks.4 Experiments
We conduct experiments on three challenging datasets: Shanghaitech [20], WorldExpo’10 [19], and UCF Crowd Counting Dataset [6].
4.1 Evaluation Metrics
Following previous works for crowd counting, we use the Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) to evaluate the performance of our proposed method. If the predicted count for image is and the ground truth count is , the MAE and RMSE can be computed as:
(8) 
where is the number of test images.
4.2 Experiments on the Shanghaitech Dataset
Part A  Part B  

MAE  RMSE  MAE  RMSE  
Crowd CNN [19]  181.8  277.7  32.0  49.8 
MCNN [20] 
110.2  173.2  26.4  41.3 
Switching CNN [15]  90.4  135.0  21.6  33.4 
CPCNN [16]  73.6  106.4  20.1  30.1 
icCNN (one stage)  69.8  117.3  10.4  16.7 
icCNN (two stages)  68.5  116.2  10.7  16.0 
The Shanghaitech dataset [20] consists of annotated crowd images. The dataset is divided into two parts, PartA containing images and PartB containing images. PartA is split into train and test subsets consisting of 300 and 182 images, respectively. PartB is split into train and test subsets consisting of 400 and 316 images. Each person in a crowd image is annotated with one point close to the center of the head. In total, the dataset consists of 330,165 annotated people. Images from PartA were collected from the Internet, while images from PartB were collected on the busy streets of Shanghai. To avoid the risk of overfitting to the small number of training images, we trained icCNNs on random crops of size , where and are the height and width of a training image. In Table 1, we compare icCNNs with the previous stateoftheart approaches. icCNNs outperform the previous approaches in three out of four cases by a large margin. On PartB of the Shanghaitech dataset, using the onestage icCNN which has a simpler architecture than the fivebranch CPCNN [16], we improve on the previously reported state of the art results by for MAE metric and for the RMSE metric. On Part A of the Shanghaitech dataset, we achieve a absolute improvement in MAE over CPCNN. Furthermore, for Part A data, the twostage icCNN results in an improvement of MAE over the onestage icCNN. We also trained a threestage icCNN on Part A data, which resulted in MAE = 69.4 and RMSE = 116.0. Since adding the stage did not yield a significant performance gain, we did not experiment with more than three stages.
In Table 2, we analyze the effects of varying the resolution of the intermediate prediction on the overall performance. Using any resolution other than leads to a drop in the performance.
LRResolution  HRResolution  MAE  RMSE 

1/8  1  74.9  131.6 
1/4  1  69.8  117.3 
1/2  1  73.3  124.4 
1  1  74.4  128.3 
LRCNN  HRCNN  

73.7  78.8  
73.0  73.6  
75.1  73.3  
79.9  69.8  
432.6  74.4 
In Table 3, we analyze the effects of varying the hyperparameter on performance of icCNN. We use Shanghaitech PartA dataset for this experiment. We show the MAE of the high and low resolution branches as the scalar weight is varied. is kept fixed at . We can see that the LRCNN branch performs better when is comparable with , and its performance degrades when is too large. The performance of HRCNN improves as is varied from to . In the extreme case when is set to , there is a large degradation in the performance of the LRCNN branch, which affects the performance of the HRCNN branch. When is , the low resolution prediction task is possibly ignored, and the network solely focuses on solving the high resolution task. In such a scenario, the low resolution prediction does not contain any useful information, which affects the performance of the high resolution branch HRCNN. We obtain the best results for the HRCNN branch when is set to . In this case, the high resolution loss does not force the network to completely ignore the low resolution task.
In Table 4, we show the training time and the number of parameters of icCNN, MCNN, Switching CNN, and CPCNN. An icCNN takes 10 hours to train, while a Switching CNN takes around 22 hours. An icCNN has significantly fewer parameters than a CPCNN and a Switching CNN. We contacted the authors of MCNN and CPCNN, but we did not get a response for the training time of these networks.
Model  Training Time  Number of Parameters  MAE 

MCNN [20]  unknown  110.2  
Switching CNN [15]  22 hrs  90.4  
CPCNN [16]  unknown  73.6  
icCNN (proposed)  10 hrs  69.8 
In Table 5, we analyze the importance of each of the components of our proposed icCNN model. We see that both the feature sharing and the feedback of the low resolution prediction are important for icCNN. Removing any of these two components leads to significant drop in performance.
Method  MAE  RMSE 

LRCNN alone  78.5  133.2 
HRCNN alone  136.2  204.0 
HRCNN + LRCNN features (no lowres prediction)  75.1  129.0 
HRCNN + LRCNN lowres prediction (no features)  77.4  130.4 
icCNN (proposed)  69.8  117.3 
In Figure 3, we analyze the performance of icCNN across different groups of images with varying crowd counts.
4.3 Experiments on the WorldExpo’10 Dataset
The WorldExpo’10 dataset consists of annotated video sequences captured by surveillance cameras. Annotated frames from cameras are used for training and the annotated frames from the remaining cameras are used for testing. We trained icCNN networks using random crops of sizes . We used the networks trained on Shanghaitech Part A for initializing the models for the experiments on the WorldExpo dataset. In Table 6, we compare icCNN with other state of art approaches. icCNN outperforms these previous approaches on three out of five cases.
Method  S1  S2  S3  S4  S5  Avg 

Crowd CNN [19]  9.8  14.1  14.3  22.2  3.7  12.9 
MCNN [20]  3.4  20.6  12.9  13.0  8.1  11.6 
Switching CNN (sans perspective) [15]  4.4  15.7  10.0  11.0  5.9  9.4 
Switching CNN (with perspective) [15]  4.2  14.9  14.2  18.7  4.3  11.2 
CPCNN[16]  2.9  14.7  10.5  10.4  5.8  8.8 
icCNN (proposed)  17.0  12.3  9.2  8.1  4.7  10.3 
4.4 Experiments on the UCF Dataset
Method  MAE  RMSE 

Lempitsky & Zisserman [10]  493.4  487.1 
Idrees et. al [6]  419.5  487.1 
Crowd CNN [19]  467.0  498.5 
Crowdnet [1]  452.5   
MCNN [20]  377.6  509.1 
Hydra2s [13]  333.7  425.6 
Switch CNN [15]  318.1  439.2 
CPCNN [16]  295.8  320.9 
icCNN (proposed)  260.9  365.5 
The UCF Crowd Counting dataset [6] consists of crowd images collected from the web. Each person in the dataset is annotated with a single dot annotation. The numbers of people in the images vary from to with an average of people per image. The average count for the UCF dataset is much larger than the previous two datasets. Following previous works using this dataset, we performe fivefold cross validation and report the MAE and RMSE values. We trained icCNN networks using random crops of sizes . We compare icCNN with previous approaches and show the results in Table 7. Since the dataset is small, adding multiple stages to icCNN could lead to overfitting. Hence we only use onestage icCNN on the UCF dataset. icCNN achieves the best MAE on this dataset, outperforming CPCNN by a large margin.
4.5 Qualitative Results
In Figure 4, we show some qualitative results on images from the Shanghaitech PartA dataset obtained using icCNN. The first three are success cases for icCNN, while the last two are failure cases. In the failure cases, we see that icCNN sometimes misclassify tree leaves as tiny people in a crowd. In Figure 5, we show some qualitative results on images from Shanghaitech PartB dataset.
5 Conclusions
In this paper, we have proposed icCNN, a twobranch architecture for crowd counting via crowd density estimation based. We have also proposed a multistage pipeline comprising of multiple icCNNs, where each stage takes into account the predictions of all the previous stages. We performed experiments on three challenging crowd counting benchmark datasets and observed the effectiveness of our iterative approach.
Image  Ground truth  LR output  HR output 
502  793  512  
270  346  280  
86  114  89  
172  493  317  
566  961  744 
Image  Ground truth  LR output  HR output 
23  26  24  
252  257  252  
183  191  186  
181  167  164  
84  109  103 
Acknowledgements. This work was supported by SUNY2020 Infrastructure Transportation Security Center. The authors would like to thank Boyu Wang for participating on the discussions and experiments related to an earlier version of the proposed technique. The authors would like to thank NVIDIA for their GPU donation.
References
 [1] Boominathan, L., Kruthiventi, S.S., Babu, R.V.: Crowdnet: A deep convolutional network for dense crowd counting. In: Proceedings of the ACM Multimedia Conference (2016)
 [2] Chan, A.B., Vasconcelos, N.: Bayesian poisson regression for crowd counting. In: Proceedings of the International Conference on Computer Vision (2009)
 [3] Chen, K., Loy, C.C., Gong, S., Xiang, T.: Feature mining for localised crowd counting. In: Proceedings of the British Machine Vision Conference (2012)

[4]
Fu, M., Xu, P., Li, X., Liu, Q., Ye, M., Zhu, C.: Fast crowd density estimation with convolutional neural networks. Engineering Applications of Artificial Intelligence
43, 81–88 (2015) 
[5]
Hoai, M., Zisserman, A.: Talking heads: Detecting humans and recognizing their interactions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2014)
 [6] Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multisource multiscale counting in extremely dense crowd images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2013)
 [7] Kaufman, E.L., Lord, M.W., Reese, T.W., Volkmann, J.: The discrimination of visual number. The American journal of psychology 62(4), 498–525 (1949)

[8]
Krizhevsky, A., Sutskever, I., Hinton, G.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems (2012)

[9]
LeCun, Y., Boser, B., Denker, J.S., Henderson, D.: Backpropagation applied to handwritten zip code recognition. Neural Computation
1(4), 541–551 (1989)  [10] Lempitsky, V., Zisserman, A.: Learning to count objects in images. In: Advances in Neural Information Processing Systems (2010)
 [11] Li, M., Zhang, Z., Huang, K., Tan, T.: Estimating the number of people in crowded scenes by mid based foreground segmentation and headshoulder detection. In: Proceedings of the International Conference on Pattern Recognition (2008)
 [12] Lin, S.F., Chen, J.Y., Chao, H.X.: Estimation of number of people in crowded scenes using perspective transformation. IEEE Transactions on Systems, Man, and CyberneticsPart A: Systems and Humans 31(6), 645–654 (2001)
 [13] OnoroRubio, D., LópezSastre, R.J.: Towards perspectivefree object counting with deep learning. In: Proceedings of the European Conference on Computer Vision (2016)

[14]
Pham, V.Q., Kozakaya, T., Yamaguchi, O., Okada, R.: Count forest: Covoting uncertain number of targets using random forest for crowd density estimation. In: Proceedings of the International Conference on Computer Vision (2015)
 [15] Sam, D.B., Surya, S., Babu, R.V.: Switching convolutional neural network for crowd counting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
 [16] Sindagi, V.A., Patel, V.M.: Generating highquality crowd density maps using contextual pyramid cnns. In: Proceedings of the International Conference on Computer Vision (2017)
 [17] Trick, L.M., Pylyshyn, Z.W.: Why are small and large numbers enumerated differently? a limitedcapacity preattentive stage in vision. Psychological review 101(1), 80 (1994)
 [18] Wang, C., Zhang, H., Yang, L., Liu, S., Cao, X.: Deep people counting in extremely dense crowds. In: Proceedings of the ACM Multimedia Conference (2015)
 [19] Zhang, C., Li, H., Wang, X., Yang, X.: Crossscene crowd counting via deep convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015)
 [20] Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.: Singleimage crowd counting via multicolumn convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016)
Comments
There are no comments yet.