Crowd counting is important for applications such as video surveillance and traffic control. In recent years, the emphasis has been on developing counting-by-density
algorithms that rely on regressors trained to estimate the density of crowd per unit area so that the total numbers of people can be obtained by integration, without explicit detection being required. The regressors can be based on Random Forests, Gaussian Processes , or more recently Deep Nets [3, 4, 5, 6, 7], with most state-of-the-art approaches now relying on the latter.
While effective, these algorithms all estimate density in the image plane. As a consequence, and as can be seen in Fig. 1(a,b), two regions of the scene containing the same number of people per square meter can be assigned different densities. However, for most practical purposes, one is more interested in estimating the density of people on the ground in the real world, shown in Fig. 1(c), which is not subject to such distortions.
In this paper, we therefore introduce a crowd density estimation method that explicitly accounts for perspective distortion to produce a real-world density map, as opposed to an image-based one. This contrasts with methods that implicitly deal with perspective effects by either learning scale-invariant features [3, 6, 8] or estimating density in patches of different sizes . Unlike these, we model perspective distortion globally and account for the fact that people’s projected size changes consistently across the image. To this end, we feed to our density-estimation CNN not only the original image but also an identically-sized image that contains the local scale, which is a function of the camera orientation with respect to the ground plane.
An additional benefit of reasoning in the real world is that we can encode physical constraints to model the motion of people in a video sequence. Specifically, given a short sequence as input to our network, we impose temporal consistency by forcing the densities in the various images to correspond to physically possible people flows. In other words, we explicitly model the motion of people, with physically-justified constraints, instead of implicitly learning long-term dependencies only across annotated frames, which are typically sparse over time, via LSTMs, as is commonly done in the literature .
Our contribution is therefore an approach that incorporates geometric and physical constraints directly into an end-to-end learning formalism for crowd counting. As evidenced by our experiments, this enables us to outperform the state-of-the-art on standard benchmarks, in particular on difficult scenes that are subject to severe perspective distortion.
2 Related work
Early crowd counting methods [9, 10, 11] tended to rely on counting-by-detection, that is, explicitly detecting individual heads or bodies and then counting them. Unfortunately, in very crowded scenes, occlusions make detection difficult, and these approaches have been largely displaced by counting-by-density-estimation ones, which rely on training a regressor to estimate people density in various parts of the image and then integrating. This trend essentially started with [1, 12] and , which used Random Forests and Gaussian Process regressors, respectively. Even though approaches relying on low-level features [13, 14, 15, 16, 2, 17] can yield good results under favorable conditions, they have now mostly been superseded by CNN-based methods, a survey of which can be found in . The same can be said about methods that count objects instead of people [18, 19, 20].
An early approach to handle such distortions, proposed in , learns to regress to both a crowd count and a density map. It employs a perspective map as a metric to retrieve candidate training scenes with similar distortions before tuning the model, unlike our approach which directly passes the perspective map as an input to the deep network. This limits their algorithm’s performance and complicates the training process, making it not end-to-end.
This approach was recently extended by , whose SwitchCNN
exploits a classifier that greedily chooses the sub-network that yields the best crowd counting performance. Max pooling is used extensively to down-scale the density map output, which improves the overall accuracy of the counts but decreases that of the density maps as pooling incurs a loss in localization precision.
Perspective distortion is also addressed in  via a scale-aware model called HydraCNN, which uses different-sized patches as input to the CNN to achieve scale-invariance. Similarly, extensive data augmentation by sampling patches from multi-scale image representations is performed in . In , different kernel sizes are used instead. In the even more recent method of , a network dubbed CP-CNN combines local and global information obtained by learning density at different resolutions. It also accounts for density map quality by adding extra information about the pre-defined density level of different patches and images. While useful, this information is highly scene specific and would make generalization difficult.
In any event, all the approaches mentioned above rely on the network learning about perspective effects without explicitly modeling them. As evidenced by our results, this is suboptimal given the finite amounts of data available in practical situations. Furthermore, while learning about perspective effects to account for the varying people sizes, these methods still predict density in the image plane, thus leading to the unnatural phenomenon that real-world regions with the same number of people are assigned different densities, as shown in Fig. 1(b). By contrast, we produce densities expressed in terms of number of people per square meter of ground, such as the ones shown in Fig. 1(c), and thus are immune to this problem.
The recent method of  is representative of current approaches in enforcing temporal consistency by incorporating an LSTM module [22, 23] to perform feature fusion over time. This helps but can only capture temporal correlation across annotated frames, which are widely separated in most existing training sets. In other words, the network can only learn long-term dependencies at the expense of shorter-term ones.
By contrast, since we reason about crowd density in real world space, we can model physically-correct temporal dependencies via frame-to-frame feasibility constraints, without any additional annotations.
3 Perspective Distortion
All existing approaches estimate the crowd density in the image plane and in terms of people per square pixel, which changes across the image even if the true crowd density per square meter is constant. For example, in many scenes such as the one of Fig. 1(a), the people density in farther regions is higher than that in closer regions, as can be seen in Fig. 1(b).
In this work, we train the system to directly predict the crowd density in the physical world, which does not suffer from this problem and is therefore unaffected by perspective distortion, assuming that people are standing on an approximately flat surface. Our approach could easily be extended to a non flat one given a terrain model. In a crowded scene, people’s heads are more often visible than their feet. Consequently, it is a common practice to provide annotations in the form of a dot on the head for supervision purposes. To account for this, we define a head plane, parallel to the ground and lifted above it by the average person’s height. We assume that the camera has been calibrated so that we are given the homography between the image and the head plane. We will see in the results section that, in practice, this homography can easily be estimated with sufficient accuracy when the ground is close to being planar.
3.1 Image Plane versus Head Plane Density
Let be the homography from an image to its corresponding head plane. We define the ground-truth density as a sum of Gaussian kernels centered on peoples’ heads in the head plane. Because we work in the physical world, we can use the same kernel size across the entire scene and across all scenes. A head annotation , that is, a 2D image point expressed in projective coordinates, is mapped to in the head plane. Given a set of such annotations, we take the head plane density at point expressed in head plane coordinates to be
where is a 2D Gaussian kernel with mean
and variance. We can then map this head plane density to the image coordinates, which yields a density at pixel location given by
An example density is shown in Fig. 1(c). Note that, while the density is Gaussian in the head plane, it is not in the image plane.
3.2 Geometry-Aware Crowd Counting
Since the head plane density map can be transformed into an image of the same size as that of the original image, we could simply train a deep network to take a 3-channel RGB image as input and output the corresponding density map. However, this would mean neglecting the geometry encoded by the ground plane homography, namely the fact that the local scale does not vary arbitrarily across the image and must remain globally consistent.
To account for this, we associate to each image a perspective map of the same size as containing the local scale of each pixel, that is, the factor by which a small area around the pixel is multiplied when projected to the head plane. We then use a UNet  with 4 input channels instead of only 3. The first three are the usual RGB channels, while the fourth contains the perspective map. We will show in the result section that this substantially increases accuracy over using the RGB channels only. This network is one of the spatial streams depicted by Fig. 2. To learn its weights , we minimize the head plane loss , which we take to be the mean square error between the predicted head plane density and the ground-truth one.
To compute the perspective map , let us first consider the image pixel and an infinitesimal area surrounding it. Let and be their respective projections on the head plane. We take , the scale at , to be , which we compute as follows. Using the variable substitution equation, we write
where is the Jacobian matrix of the coordinate transformation at the point :
The scale map is therefore equal to
The detailed solution can be found in . Eq. 5 enables us to compute the perspective map that we use as an input to our network, as discussed above. It also allows us to convert between people density in image space, that is, people per square pixel, and people density on the head plane. More precisely, let us consider a surface element in the image around point . It is scaled by into . Since the projection does not change the number of people, we have
Expressed in image coordinates, this becomes
which we use in the results section to compare our algorithm that produces head plane densities against the baselines that estimate image plane densities.
4 Temporal Consistency
The spatial stream network introduced in Section 3.2 (top of Fig. 2) operates on single frames of a video sequence. To increase robustness, we now show how to enforce temporal consistency across triplets of frames. Unlike in an LSTM-based approach, such as , we can do this across any three frames instead of only across annotated frames. Furthermore, by working in the real world plane instead of the image plane, we can explicitly exploit physical constraints on people’s motion.
4.1 People Conservation
An important constraint is that people do not appear or disappear from the head plane except at the edges or at specific points that can be marked as exits or entrances. To model this, we partition the head plane into blocks, as depicted by Fig. 3. Let for denote the neighborhood of block , including itself. Let be the number of people in at time and let be three different time instants.
If we take the blocks to be large enough for people not be able to traverse more than one block between two time instants, people in the interior blocks shown in orange in Fig. 3 can only come from a block in at the previous instant and move to a block in at the next. As a consequence, we can write
In fact, an even stronger equality constraint could be imposed as in  by explicitly modeling people flows from one block to the next with additional variables predicted by the network. However, not only would this increase the number of variables to be estimated, but it would also require enforcing hard constraints between different network’s outputs.
4.2 Siamese architecture
To enforce these constraints, we introduce the siamese architecture depicted by Fig. 2, with weights . It comprises three identical streams, which take as input images acquired at times , , and along with their corresponding perspective maps, as described in Section 3.2. Each one produces a head plane density estimate , and we define the temporal loss term as
where is the sum of predicted densities in block , as in Eq. 9, and is the sum of densities in the neighborhood of , that is,
In other words, penalizes violations of the constraints of Eq. 8. At training time, we minimize the composite loss , which we take to be
where is the head plane loss introduced in Section 3.2. Since the loss just uses the ground truth density for frame , we only need annotations for that frame. Therefore, as will be shown in the following section, we can use arbitrarily-spaced and unannotated frames to impose temporal consistency and improve robustness, which is not something LSTM-based methods can do.
5.1 Datasets and Scene Geometry
Our approach is designed both to handle perspective effects and to enforce temporal consistency. We first test our algorithm’s ability to handle perspective effects on the publicly available ShanghaiTech Part_B dataset , which is widely used for single image crowd estimation. To assess the performance of our temporal consistency constraints as well, we evaluate our model on the WorldExpo dataset , which has been extensively used to benchmark video-based methods. Finally, to further challenge our algorithm, we filmed additional sequences of the Piazza San Marco in Venice, as seen from various viewpoints on the second floor of the basilica. They feature much stronger perspective distortions than in the other two datasets and we will make this Venice dataset publicly available. In Fig. 4, we show 3 images from each dataset and list their characteristics.
These three datasets are both different and complementary. ShangaiTech only contains single images while the other two feature video sequences. The WorldExpo images were acquired by fixed surveillance cameras and the Venice ones by a moving cell phone. This means that perspective distortion changes from frame to frame in Venice but not in WorldExpo. Furthermore, the WorldExpo images come with a pre-defined region of interest (ROI), denoting an image plane area for which we have annotations that is fairly small. This inherently limits the perspective changes. By contrast, the corresponding ROI in the Venice images is much larger and much more affected by them. The horizon line in an image maps to infinity in the head plane, so we limit our ROI to an area below the horizon whose size in the head plane is reasonable. The ROI introduced in the ShanghaiTech dataset is shown in Fig. 6(b).
For ShanghaiTech, we were able to compute an accurate image to head-plane homography on the basis of the square patterns visible on the ground. However, since such patterns are not always visible, for WorldExpo and Venice, we used an easier to generalize, if approximate, approach to estimating . In both cases, the image to head plane homography can be inferred up to an affine transform using the horizon line , that is, the line connecting the vanishing points for two non-parallel directions, as shown in Fig. 5. is therefore an approximation of the real homography, but, as we will see, this suffices in practice.
We benchmark our approach against three recent methods for which the code is publicly available: HydraCNN , MCNN  and SwitchCNN . As discussed in the related work section, they are representative of current approaches to handling the fact that people’s sizes vary depending on their distance to the camera.
We will refer to our complete approach as OURS. To tease out the individual contributions of its components, we also evaluate two degraded versions of it. OURS-NoGeom uses the CNN to predict densities but does not feed it the perspective map as input. OURS-GeomOnly uses the full approach described in Section 3 but does not impose temporal consistency.
5.3 Evaluation Metrics
Previous works in crowd density estimation use mean absolute error () and mean squared error (
) as their evaluation metric[3, 4, 5, 6, 7, 8]. They are defined as
where is the number of test images, denotes the true number of people inside the ROI of the th image and the estimated number of people. While indicative, these two metrics are very coarse. First, it is widely accepted that the is an ambiguous metric, because its magnitude is a function not only of the average error but also of other factors, such as the variance in the error distribution. As such, as argued in , the is a more natural measure of average error. Second, and more importantly, these two metrics only take into consideration the total number of people irrespective of where in the scene they may be, so they are incapable of evaluating the correctness of the spatial distribution of crowd density. A false positive in one region, coupled with a false negative in another, can still yield a perfect total number of people. Furthermore, since they rely on absolute numbers of people as opposed to relative numbers, they cannot easily be used to compare performance on different datasets featuring different crowd sizes.
We therefore introduce two additional metrics that provide finer grained measures, accounting for localization errors. We name them the relative total counting accuracy () and relative pixel-level accuracy () and take them to be
where is the ground-truth density of the th image at pixel , is the corresponding estimated density, is the ROI of the th image, is the indicator function, and and are the image dimensions. The max operation ensures that and remain between 0 and 1. captures how well the model estimates the number of people in relative terms, instead of in absolute terms as the standard measures, while quantifies how correctly localized the densities are.
Note that the baseline models [3, 5, 6] are designed to predict density in image plane instead of head plane as our model does. However, the densities in image plane and head plane can be easily transferred to each other as shown in Section 3. For our comparison to be fair, we therefore train the baseline models [3, 5, 6] as they did in the papers to estimate density in the image plane and then transfer it to the head plane. Therefore, all the models can be evaluated with the same metric defined in terms of head plane density.
These are the numbers we will report below because they are the ones relevant to our work. However, since most other authors report their results in the image plane, we also computed image plane versions, by transferring our head plane estimates to the image plane, and show in the supplementary material that they yield the same ranking of the methods we tested.
Since this dataset only contains individual images instead of sequences, OURS-GeomOnly and OURS are the same in this case. In Table 1, we report our results and that of the baselines in terms of the metrics of Section 5.3 and illustrate them in Fig. 6.
Recall that and are absolute error measures that should be small while and are relative accuracy measures that should be large. Using the head plane density already confers a small advantage to OURS-NoGeom but also taking the perspective map explicitly as an input gives an even more significant one to OURS-GeomOnly in all four metrics. Note also the much smoother density map shown in Fig. 6(h).
|OURS (frame interval 1)||5.9||7.9||83.1||70.9|
|OURS (frame interval 5)||4.9||6.9||85.9||73.3|
|OURS (frame interval 10)||4.9||6.6||85.8||72.8|
Since we now have video sequences, we can enforce temporal consistency in addition to geometric awareness. Recall that the method of Section 4 requires the central frame to be annotated but that the other two can be arbitrarily chosen. In Table 2, we therefore report results obtained using triplets of images temporally separated by 1, 5, or 10 frames and we illustrate them in Fig. 7.
OURS-GeomOnly again outperform the baselines and temporal consistency gives a further boost to OURS, with a frame interval of 5 appearing to be the optimal choice. Note in particular the very significant jump in terms, which underscores the benefits of our approach when it comes to localizing people as opposed to simply counting them.
|OURS (frame interval 1)||16.8||20.4||91.6||59.5|
|OURS (frame interval 5)||15.2||17.9||92.4||58.6|
|OURS (frame interval 10)||18.6||24.8||90.7||58.9|
The Venice dataset has more significant perspective distortion than the other two because the ROI is much larger. However, the ranking of different methods is similar, as reported in Table 3 and depicted in Fig. 8. Note in particular the large jump in when going from OURS-NoGeom to OURS-GeomOnly and OURS, which underscores the importance of using the perspective map as an input to the CNN.
In this paper, we have shown that providing an explicit model of perspective distortion effects as an input to a deep net, along with enforcing physics-based spatio-temporal constraints, substantially increases performance. In particular, it yields not only an accurate count of the total number of the people in the scene but also a much better localization of the high-density areas.
This is of particular interest for crowd counting from mobile cameras, such as those carried by drones. In future work, we will augment the purely image-based data with the information provided by the drone’s inertial measurement unit to compute perspective distortions on the fly and allow monitoring from the moving drone.
-  Lempitsky, V., Zisserman, A.: Learning to Count Objects in Images. In: Advances in Neural Information Processing Systems. (2010)
-  Chan, A., Vasconcelos, N.: Bayesian Poisson Regression for Crowd Counting. In: International Conference on Computer Vision. (2009) 545–551
Zhang, Y., Zhou, D., Chen, S., Gao, S., Ma, Y.:
Single-Image Crowd Counting via Multi-Column Convolutional Neural Network.
In: Conference on Computer Vision and Pattern Recognition. (2016) 589–597
-  Zhang, C., Li, H., Wang, X., Yang, X.: Cross-Scene Crowd Counting via Deep Convolutional Neural Networks. In: Conference on Computer Vision and Pattern Recognition. (2015) 833–841
Onoro-Rubio, D., López-Sastre, R.:
Towards Perspective-Free Object Counting with Deep Learning.In: European Conference on Computer Vision. (2016) 615–629
-  Sam, D., Surya, S., Babu, R.: Switching Convolutional Neural Network for Crowd Counting. In: Conference on Computer Vision and Pattern Recognition. (2017) 6
-  Xiong, F., Shi, X., Yeung, D.: Spatiotemporal Modeling for Crowd Counting in Videos. In: International Conference on Computer Vision. (2017) 5161–5169
-  Sindagi, V., Patel, V.: Generating High-Quality Crowd Density Maps Using Contextual Pyramid Cnns. In: International Conference on Computer Vision. (2017) 1879–1888
-  Wu, B., Nevatia, R.: Detection of Multiple, Partially Occluded Humans in a Single Image by Bayesian Combination of Edgelet Part Detectors. In: International Conference on Computer Vision. (2005)
-  Wang, X., Wang, B., Zhang, L.: Airport Detection in Remote Sensing Images Based on Visual Attention. In: International Conference on Neural Information Processing. (2011)
-  Lin, Z., Davis, L.: Shape-Based Human Detection and Segmentation via Hierarchical Part-Template Matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(4) (2010) 604–618
-  Fiaschi, L., Koethe, U., Nair, R., Hamprecht, F.: Learning to Count with Regression Forest and Structured Labels. In: International Conference on Pattern Recognition. (2012) 2685–2688
-  Chen, K., Loy, C., Gong, S., Xiang, T.: Feature Mining for Localised Crowd Counting. In: British Machine Vision Conference. (2012) 3
-  Chan, A., Liang, Z., Vasconcelos, N.: Privacy Preserving Crowd Monitoring: Counting People Without People Models or Tracking. In: Conference on Computer Vision and Pattern Recognition. (2008)
-  Brostow, G.J., Cipolla, R.: Unsupervised Bayesian Detection of Independent Motion in Crowds. In: Conference on Computer Vision and Pattern Recognition. (2006) 594–601
-  Rabaud, V., Belongie, S.: Counting Crowded Moving Objects. In: Conference on Computer Vision and Pattern Recognition. (2006) 705–711
-  Idrees, H., Saleemi, I., Seibert, C., Shah, M.: Multi-Source Multi-Scale Counting in Extremely Dense Crowd Images. In: Conference on Computer Vision and Pattern Recognition. (2013) 2547–2554
-  Arteta., C., Lempitsky, V., Noble, J., Zisserman, A.: Interactive Object Counting. In: European Conference on Computer Vision. (2014)
-  Arteta., C., Lempitsky, V., Zisserman, A.: Counting in the Wild. In: European Conference on Computer Vision. (2016)
-  Chattopadhyay, P., Vedantam, R., Selvaju, R., Batra, D., Parikh, D.: Counting Everyday Objects in Everyday Scenes. In: Conference on Computer Vision and Pattern Recognition. (2017)
-  Boominathan, L., Kruthiventi, S., Babu, R.: Crowdnet: A Deep Convolutional Network for Dense Crowd Counting. In: Multimedia Conference. (2016) 640–644
-  Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation 9(8) (1997) 1735–1780
Shi, X., Chen, Z., Wang, H., Yeung, D., Wong, W., Woo, W.:
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting.In: Advances in Neural Information Processing Systems. (2015) 802–810
-  Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Conference on Medical Image Computing and Computer Assisted Intervention. (2015)
-  Chum, O., Matas, J.: Planar affine rectification from change of scale. In: Asian Conference on Computer Vision. (2010) 347–360
-  Berclaz, J., Fleuret, F., Türetken, E., Fua, P.: Multiple Object Tracking Using K-Shortest Paths Optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(11) (2011) 1806–1819
-  Liebowitz, D., Zisserman, A.: Metric Rectification for Perspective Images of Planes. In: Conference on Computer Vision and Pattern Recognition. (1998) 482–488
-  Willmott, C., Matsuura, K.: Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate research 30 (2005) 79–82