Crowd counting is important for applications such as video surveillance and traffic control. In recent years, the emphasis has been on developing counting-by-density
algorithms that rely on regressors trained to estimate the people density per unit area so that the total number can be obtained by integration, without explicit detection being required. The regressors can be based on Random Forests, Gaussian Processes , or more recently Deep Nets [41, 42, 26, 31, 40, 36, 32, 24, 19, 30, 33, 22, 15, 28, 5], with most state-of-the-art approaches now relying on the latter.
Standard convolutions are at the heart of these deep-learning-based approaches. By using the same filters and pooling operations over the whole image, these implicitly rely on the same receptive field everywhere. However, due to perspective distortion, one should instead change the receptive field size across the image. In the past, this has been addressed by combining either density maps extracted from image patches at different resolutions or feature maps obtained with convolutional filters of different sizes [42, 5]. However, by indiscriminately fusing information at all scales, these methods ignore the fact that scale varies continuously across the image. While this was addressed in [31, 30] by training classifiers to predict the size of the receptive field to use locally, the resulting methods are not end-to-end trainable; cannot account for rapid scale changes because they assign a single scale to relatively large patches; and can only exploit a small range of receptive fields for the networks to remain of a manageable size.
In this paper, we introduce a deep architecture that explicitly extracts features over multiple receptive field sizes and learns the importance of each such feature at every image location, thus accounting for potentially rapid scale changes. In other words, our approach adaptively encodes the scale of the contextual information necessary to predict crowd density. This is in contrast to crowd-counting approaches that also use contextual information to account for scaling effects as in 
, but only in the loss function as opposed to computing true multi-scale features as we do. We will show that it works better on uncalibrated images. When calibration data is available, we will also show that it can be leveraged to infer suitable local scales even better and further increase performance.
Our contribution is therefore an approach that incorporates multi-scale contextual information directly into an end-to-end trainable crowd counting pipeline, and learns to exploit the right context at each image location. As shown by our experiments, we consistently outperform the state of the art on all standard crowd counting benchmarks, such as ShanghaiTech, WorldExpo’10, UCF_CC_50 and UCF_QNRF, as well as on our own Venice dataset, which features strong perspective distortion.
2 Related Work
Early crowd counting methods [39, 38, 20] tended to rely on counting-by-detection, that is, explicitly detecting individual heads or bodies and then counting them. Unfortunately, in very crowded scenes, occlusions make detection difficult, and these approaches have been largely displaced by counting-by-density-estimation ones, which rely on training a regressor to estimate people density in various parts of the image and then integrating. This trend began in [7, 18, 10], using either Gaussian Process or Random Forests regressors. Even though approaches relying on low-level features [9, 6, 4, 27, 7, 14] can yield good results, they have now mostly been superseded by CNN-based methods [42, 31, 5], a survey of which can be found in . The same can be said about methods that count objects instead of people [1, 2, 8].
The people density we want to measure is the number of people per unit area on the ground. However, the deep nets operate in the image plane and, as a result, the density estimate can be severely affected by the local scale of a pixel, that is, the ratio between image area and corresponding ground area. This problem has long been recognized. For example, the algorithms of [41, 17] use geometric information to adapt the network to different scene geometries. Because this information is not always readily available, other works have focused on handling the scale implicitly within the model. In , this was done by learning to predict pre-defined density levels. These levels, however, need to be provided by a human annotator at training time. By contrast, the algorithms of [26, 32] use image patches extracted at multiple scales as input to a multi-stream network. They then either fuse the features for final density prediction  without accounting for continuous scale changes or introduce an ad hoc term in the training loss function  to enforce prediction consistency across scales. This, however, does not encode contextual information into the features produced by the network and therefore has limited impact. While [42, 5] aim to learn multi-scale features, by using different receptive fields, they combine all of these features to predict the density.
In other words, while the previous methods account for scale, they ignore the fact that the suitable scale varies smoothly over the image and should be handled adaptively. This was addressed in 
by weighting different density maps generated from input images at various scales. However, the density map at each scale only depends on features extracted at this particular scale, and thus may already be corrupted by the lack of adaptive-scale reasoning. Here, we argue that one should rather extractfeatures at multiple scales and learn how to adaptively combine them. While this, in essence, was also the motivation of [31, 30], which train an extra classifier to assign the best receptive field for each image patch, these methods remain limited in several important ways. First, they rely on classifiers, which requires pre-training the network before training the classifier, and thus is not end-to-end trainable. Second, they typically assign a single scale to an entire image patch that can still be large and thus do not account for rapid scale changes. Last, but not least, the range of receptive field sizes they rely on remains limited in part because using much larger ones would require using much deeper architectures, which may not be easy to train given the kind of networks being used.
By contrast, in this paper, we introduce an end-to-end trainable architecture that adaptively fuses multi-scale features, without explicitly requiring defining patches, but rather by learning how to weigh these features for each individual pixel, thus allowing us to accommodate rapid scale changes. By leveraging multi-scale pooling operations, our framework can cover an arbitrarily large range of receptive fields, thus enabling us to account for much larger context than with the multiple receptive fields used by the above-mentioned methods. In Section 4, we will demonstrate that it delivers superior performance.
As discussed above, we aim to exploit context, that is, the large-scale consistencies that often appear in images. However, properly assessing what the scope and extent of this context should be in images that have undergone perspective distortion is a challenge. To meet it, we introduce a new deep net architecture that adaptively encodes multi-level contextual information into the features it produces. We then show how to use these scale-aware features to regress to a final density map, both when the cameras are not calibrated and when they are.
3.1 Scale-Aware Contextual Features
We formulate crowd counting as regressing a people density map from an image. Given a set of training images with corresponding ground-truth density maps , our goal is to learn a non-linear mapping parameterized by that maps an input image to an estimated density map that is as similar as possible to in norm terms.
which we take as base features to build our scale-aware ones.
As discussed in Section 2, the limitation of is that it encodes the same receptive field over the entire image. To remedy this, we compute scale-aware features by performing Spatial Pyramid Pooling  to extract multi-scale context information from the VGG features of Eq. 1. Specifically, as illustrated at the bottom of Fig. 1, we compute these scale-aware features as
where, for each scale , averages the VGG features into blocks; is a convolutional network with kernel size 1 to weigh the context features without changing their dimensions, which is in contrast to earlier arthitectures that convolve to reduce the dimension [37, 43]; and
represents bilinear interpolation to up-sample the array of contextual features to be of the same size as. In practice, we use different scales, with corresponding block sizes .
The simplest way to use our scale-aware features would be to concatenate all of them to the original VGG features . This, however, would not account for the fact that scale varies across the image. To model this, we propose to learn to predict weight maps that set the relative influence of each scale-aware feature at each spatial location. To this end, we first define contrast features as
They capture the differences between the features at a specific location and those in the neighborhood, which often is an important visual cue that denotes saliency. Note that, for human beings, saliency matters. For example, in the image of Fig. 2, the eye is naturally drawn to the woman at the center in part because edges in the rest of the image all point in her direction and that edges at her location do not. In our context, these contrast features provide us with important information to understand the local scale of each image region. We therefore exploit them as input to auxiliary networks with weights that compute the weights assigned to each one of the different scales we use. Each such network outputs a scale-specific weight map of the form
We then employ these weights to compute our final contextual features as
where denotes the channel-wise concatenation operation, and is the element-wise product between a weight map and a feature map.
Altogether, as illustrated in Fig. 1, the network extracts the contextual features as discussed above, which are then passed to a decoder consisting of several dilated convolutions that produces the density map. The specific architecture of the network is described in Table 1. As shown by our experiments, this network already outperforms the state of the art on all benchmark datasets, without explicitly using information about camera geometry. As discussed below, however, these results can be further improved when such information is available.
3.2 Geometry-Guided Context Learning
Because of perspective distortion, the contextual scope suitable for each region varies across the image plane. Hence, scene geometry is highly related to contextual information and could be used to guide the network to better adjust to the scene context it needs.
We therefore extend the previous approach to exploiting geometry information when it is available. To this end, we represent the scene geometry of image with a perspective map , which encodes the number of pixels per meter in the image plane. Note that this perspective map has the same spatial resolution as the input image. We therefore use it as input to a truncated VGG-16 network. In other words, the base features of Eq. 1 are then replaced by features of the form
where is a modified VGG-16 network with a single input channel. To initialize the weights corresponding to this channel, we average those of the original three RGB channels. Note that we also normalize the perspective map to lie within the same range as the RGB images. Even though this initialization does not bring any obvious difference in the final counting accuracy, it makes the network converge much faster.
To further propagate the geometry information to later stages of our network, we exploit the modified VGG features described above, which inherently contain geometry information, as an additional input to the auxiliary network of Eq. 4. Specifically, the weight map for each scale is then computed as
|1 - 2||3364 conv-1||1||33512 conv-2|
|3 - 4||33128 conv-1||3||33512 conv-2|
|2 2 max pooling||4||33256 conv-2|
|5 - 7||33256 conv-1||5||33128 conv-2|
|2 2 max pooling||6||3364 conv-2|
|8 - 10||33256 conv-1||7||111 conv-1|
3.3 Training Details and Loss Function
Whether with or without geometry information, our networks are trained using the loss defined as
where is the batch size. To obtain the ground-truth density maps , we rely on the same strategy as previous work [19, 31, 42, 30]. Specifically, to each image , we associate a set of 2D points that denote the position of each human head in the scene. The corresponding ground-truth density map is obtained by convolving an image containing ones at these locations and zeroes elsewhere with a Gaussian kernel . We write
where and4, we use the same as the methods we compare against.
To minimize the loss of Eq. 8the size of the original image at different locations. These patches are further mirrored to double the training set.
In this section, we evaluate the proposed approach. We first introduce the evaluation metrics and benchmark datasets we use in our experiments. We then compare our approach to state-of-the-art methods, and finally perform a detailed ablation study.
4.1 Evaluation Metrics
where is the number of test images, denotes the true number of people inside the ROI of the th image and the estimated number of people. In the benchmark datasets discussed below, the ROI is the whole image except when explicitly stated otherwise. Note that number of people can be recovered by integrating over the pixels of the predicted density maps as .
4.2 Benchmark Datasets and Ground-truth Data
We use five different datasets to compare our approach to recent ones. The first four were released along with recent papers and have already been used for comparison purposes since. We created the fifth one ourselves and will make it publicly available as well.
It comprises 1,198 annotated images with 330,165 people in them. It is divided in part A with 482 images and part B with 716. In part A, 300 images form the training set and, in part B, 400. The remainder are used for testing purposes. For a fair comparison with earlier work [42, 32, 19, 33], we created the ground-truth density maps in the same manner as they did. Specifically, for Part A, we used the geometry-adaptive kernels introduced in , and for part B, fixed kernels. In Fig. 4, we show one image from each part, along with the ground-truth density maps and those estimated by our algorithm.
It comprises 1,535 jpeg images with 1,251,642 people in them. The training set is made of 1,201 of these images. Unlike in ShanghaiTech, there are dramatic variations both in crowd density and image resolution. The ground-truth density maps were generated by adaptive Gaussian kernels as in .
It contains only 50 images with a people count varying from 94 to 4,543, which makes it challenging for a deep-learning approach. For a fair comparison again, the ground-truth density maps were generated using fixed kernels and we follow the same 5-fold cross-validation protocol as in : We partition the images into 5 10-image groups. In turn, we then pick four groups for training and the remaining one for testing. This gives us 5 sets of results and we report their average.
It comprises 1,132 annotated video sequences collected from 103 different scenes. There are 3,980 annotated frames, with 3,380 of them used for training purposes. Each scene contains a Region Of Interest (ROI) in which people are counted. The bottom row of Fig. 5 depicts three of these images and the associated camera calibration data. We generate the ground-truth density maps as in our baselines [31, 19, 5]. As in previous work [41, 42, 31, 30, 19, 5, 21, 36, 32, 28, 33] on this dataset, we report the MAE of each scene, as well as the average over all scenes.
The four datasets discussed above have the advantage of being publicly available but do not contain precise calibration information. In practice, however, it can be readily obtained using either standard photogrammetry techniques or onboard sensors, for example when using a drone to acquire the images. To test this kind of scenario, we used a cellphone to film additional sequences of the Piazza San Marco in Venice, as seen from various viewpoints on the second floor of the basilica, as shown in the top two rows of Fig. 5. We then used the white lines on the ground to compute camera models. As shown in the bottom two rows of Fig. 5, this yields a more accurate calibration than in WorldExpo’10. The resulting dataset contains 4 different sequences and in total 167 annotated frames with fixed 1,280 720 resolution. 80 images from a single long sequence are taken as training data, and we use the images from the remaining 3 sequences for testing purposes. The ground-truth density maps were generated using fixed Gaussian kernels as in part B of the ShanghaiTech dataset.
4.3 Comparing against Recent Techniques
|Zhang et al. ||181.8||277.7||32.0||49.8|
|Liu et al. ||73.6||112.0||13.7||21.4|
|Idrees et al. ||315||508|
|Idrees et al. ||132||191|
|Idrees et al.||419.5||541.6|
|Zhang et al. ||467.0||498.5|
|Liu et al. ||337.6||434.3|
|Zhang et al. ||9.8||14.1||14.3||22.2||3.7||12.9|
In Tables 2, 3, 4, and 5, we compare our results to those of the method that returns the best results for each one of the 4 public datasets, as currently reported in the literature. They are those of , , , and , respectively. In each case, we reprint the results as given in these papers and add those of OURS-CAN, that is, our method as described in Section 3.1. On the first three datasets, we consistently and clearly outperform all other methods. On the WorldExpo’10 dataset, we also outperform them on average, but not in every scene. More specifically, in Scenes 2 and 4 that are crowded, we do very well. By contrast, the crowds are far less dense in Scenes 1 and 5. This makes context less informative and our approach still performs honorably but looses its edge compared to the others. Interestingly, as can be seen in Table 5, in such uncrowded scenes, a detection-based method such as DecideNet  becomes competitive whereas it isn’t in the more crowded ones. In Fig. 6, we use a Venice image to show how well our approach does compared to the others in the crowded parts of the scene.
|Original image||Region of interest||Ground truth||MCNN |
|Switch-CNN ||CSRNet ||OURS-CAN||OURS-ECAN|
The first three datasets do not have any associated camera calibration data, whereas WorldExpo’10 comes with a rough estimation of the image plane to ground plane homography and Venice with an accurate one. We therefore used these homographies to run OURS-ECAN, our method as described in Section 3.2. We report the results in Tables 5 and 6. Unsurprisingly, OURS-ECAN clearly further improves on OURS-CAN when the calibration data is accurate as for Venice and even when it is less so as for WorldExpo, but by a smaller margin. We refer the interested reader to additional results that did not fit in the main paper due to space limitations.
4.4 Ablation Study
Finally, we perform an ablation study to confirm the benefits of encoding multiple level contextual information and of introducing contrast features.
Concatenating and Weighting VGG Features.
We compare our complete model without geometry, OURS-CAN, against two simplified versions of it. The first one, VGG-SIMPLE, directly uses VGG-16 base features as input to the decoder subnetwork. In other words, it does not adapt for scale. The second one, VGG-CONCAT, concatenates all scale-aware features to the base features instead of computing their weighted linear combination, and then passes the resulting features to the decoder.
To demonstrate the importance of using contrast features to learn the network weights, we compare OURS-CAN against VGG-NCONT that uses the scale features instead of the contrast ones to learn the weight maps. As can be seen in Table 7, this also results in a substantial performance loss.
5 Conclusion and Future Perspectives
In this paper, we have shown that encoding multi-scale context adaptively, along with providing an explicit model of perspective distortion effects as input to a deep net, substantially increases crowd counting performance. In particular, it yields much better density estimates in high-density regions.
This is of particular interest for crowd counting from mobile cameras, such as those carried by drones. In future work, we will therefore augment the image data with the information provided by the drone’s inertial measurement unit to compute perspective distortions on the fly and allow monitoring from the moving drone.
We will also expand our approach to process consecutive images simultaneously and enforce temporal consistency, which among other things implies correcting ground-truth densities to also account for perspective distortions and be able to properly reason in the terms of ground-plane densities instead of image-plane densities, which none of the approaches discussed in this paper do. We did not do it either so that our results could be properly compared to the state of the art. However, as shown in Fig. 7, the price to pay is that the estimated densities, because they are close to this image-based ground truth, need to be corrected for perspective distortion before they can be treated as ground-plane densities. An obvious improvement would therefore be to directly regress to ground densities.
-  C. Arteta., V. Lempitsky, J. Noble, and A. Zisserman. Interactive Object Counting. In European Conference on Computer Vision, 2014.
-  C. Arteta., V. Lempitsky, and A. Zisserman. Counting in the Wild. In European Conference on Computer Vision, 2016.
-  V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. arXiv Preprint, 2015.
G. J. Brostow and R. Cipolla.
Unsupervised Bayesian Detection of Independent Motion in Crowds.
Conference on Computer Vision and Pattern Recognition, pages 594–601, 2006.
-  X. Cao, Z. Wang, Y. Zhao, and F. Su. Scale Aggregation Network for Accurate and Efficient Crowd Counting. In European Conference on Computer Vision, 2018.
-  A. Chan, Z. Liang, and N. Vasconcelos. Privacy Preserving Crowd Monitoring: Counting People Without People Models or Tracking. In Conference on Computer Vision and Pattern Recognition, 2008.
-  A. Chan and N. Vasconcelos. Bayesian Poisson Regression for Crowd Counting. In International Conference on Computer Vision, pages 545–551, 2009.
-  P. Chattopadhyay, R. Vedantam, R. Selvaju, D. Batra, and D. Parikh. Counting Everyday Objects in Everyday Scenes. In Conference on Computer Vision and Pattern Recognition, 2017.
-  K. Chen, C. Loy, S. Gong, and T. Xiang. Feature Mining for Localised Crowd Counting. In British Machine Vision Conference, page 3, 2012.
-  L. Fiaschi, U. Koethe, R. Nair, and F. Hamprecht. Learning to Count with Regression Forest and Structured Labels. In International Conference on Pattern Recognition, pages 2685–2688, 2012.
-  K. He, X. Zhang, S. Ren, and J. Sun. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. In European Conference on Computer Vision, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  G. Huang, Z. Liu, K. Weinberger, and L. van der Maaten. Densely Connected Convolutional Networks. In Conference on Computer Vision and Pattern Recognition, 2017.
-  H. Idrees, I. Saleemi, C. Seibert, and M. Shah. Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images. In Conference on Computer Vision and Pattern Recognition, pages 2547–2554, 2013.
-  H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah. Composition Loss for Counting, Density Map Estimation and Localization in Dense Crowds. In European Conference on Computer Vision, 2018.
-  D. Kang and A. Chan. Crowd Counting by Adaptively Fusing Predictions from an Image Pyramid. In British Machine Vision Conference, 2018.
-  D. Kang, D. Dhar, and A. Chan. Incorporating Side Information by Adaptive Convolution. In Advances in Neural Information Processing Systems, 2017.
-  V. Lempitsky and A. Zisserman. Learning to Count Objects in Images. In Advances in Neural Information Processing Systems, 2010.
Y. Li, X. Zhang, and D. Chen.
CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.In Conference on Computer Vision and Pattern Recognition, 2018.
-  Z. Lin and L. Davis. Shape-Based Human Detection and Segmentation via Hierarchical Part-Template Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4):604–618, 2010.
-  J. Liu, C. Gao, D. Meng, and A. Hauptmann1. Decidenet: Counting Varying Density Crowds through Attention Guided Detection and Density Estimation. In Conference on Computer Vision and Pattern Recognition, 2018.
L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin.
Crowd Counting Using Deep Recurrent Spatial-Aware Network.
International Joint Conference on Artificial Intelligence, 2018.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg. SSD: Single Shot Multibox Detector. In European Conference on Computer Vision, 2016.
-  X. Liu, J. Weijer, and A. Bagdanov. Leveraging Unlabeled Data for Crowd Counting by Learning to Rank. In Conference on Computer Vision and Pattern Recognition, 2018.
-  J. Long, E. Shelhamer, and T. Darrell. Fully Convolutional Networks for Semantic Segmentation. In Conference on Computer Vision and Pattern Recognition, 2015.
-  D. Onoro-Rubio and R. López-Sastre. Towards Perspective-Free Object Counting with Deep Learning. In European Conference on Computer Vision, pages 615–629, 2016.
-  V. Rabaud and S. Belongie. Counting Crowded Moving Objects. In Conference on Computer Vision and Pattern Recognition, pages 705–711, 2006.
-  V. Ranjan, H. Le, and M. Hoai. Iterative Crowd Counting. In European Conference on Computer Vision, 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems, 2015.
-  D. Sam, N. Sajjan, R. Babu, and M. Srinivasan. Divide and Grow: Capturing Huge Diversity in Crowd Images with Incrementally Growing CNN. In Conference on Computer Vision and Pattern Recognition, 2018.
-  D. Sam, S. Surya, and R. Babu. Switching Convolutional Neural Network for Crowd Counting. In Conference on Computer Vision and Pattern Recognition, page 6, 2017.
-  Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang. Crowd Counting via Adversarial Cross-Scale Consistency Pursuit. In Conference on Computer Vision and Pattern Recognition, 2018.
-  Z. Shi, L. Zhang, Y. Liu, and X. Cao. Crowd Counting with Deep Negative Correlation Learning. In Conference on Computer Vision and Pattern Recognition, 2018.
-  K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015.
-  V. Sindagi and V. Patel. CNN-based Cascaded Multi-task Learning of High-level Prior and Density Estimation for Crowd Counting. In International Conference on Advanced Video and Signal Based Surveillance, 2017.
-  V. Sindagi and V. Patel. Generating High-Quality Crowd Density Maps Using Contextual Pyramid CNNs. In International Conference on Computer Vision, pages 1879–1888, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going Deeper with Convolutions. In Conference on Computer Vision and Pattern Recognition, pages 1–9, June 2015.
-  X. Wang, B. Wang, and L. Zhang. Airport Detection in Remote Sensing Images Based on Visual Attention. In International Conference on Neural Information Processing, 2011.
-  B. Wu and R. Nevatia. Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In International Conference on Computer Vision, 2005.
-  F. Xiong, X. Shi, and D. Yeung. Spatiotemporal Modeling for Crowd Counting in Videos. In International Conference on Computer Vision, pages 5161–5169, 2017.
-  C. Zhang, H. Li, X. Wang, and X. Yang. Cross-Scene Crowd Counting via Deep Convolutional Neural Networks. In Conference on Computer Vision and Pattern Recognition, pages 833–841, 2015.
-  Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma. Single-Image Crowd Counting via Multi-Column Convolutional Neural Network. In Conference on Computer Vision and Pattern Recognition, pages 589–597, 2016.
-  H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene Parsing Network. In Conference on Computer Vision and Pattern Recognition, 2017.