Analyzing large crowds quickly, is one of the highly sought-after capabilities nowadays. Especially in terms of public security and planning, this assumes prime importance. But automated reasoning of crowd images or videos is a challengingComputer Vision task. The difficulty is extreme in dense crowds that the task is typically narrowed down to estimating the number of people. Since the count or distribution of people in the scene itself can be very valuable information, this field of research has gained traction.
There exists a huge body of works on crowd counting. They range from initial detection based methods ([5, 6, 7, 8], etc.) to later models regressing crowd density ([9, 10, 4, 11, 12, 13, 14], etc.). The detection approaches, in general, seem to scale poorly across the entire spectrum of diversity evident in typical crowd scenes. Note the crucial difference between the normal face detection problem with crowd counting; faces may not be visible for people in all cases (see Figure 1). In fact, due to extreme pose, scale and view point variations , learning a consistent feature set to discriminate people seems difficult. Though faces might be largely visible in sparse assemblies, people become tiny blobs in highly dense crowds. This makes it cumbersome to put bounding boxes in dense crowds, not to mention the sheer number of people, in the order of thousands, that need to be annotated per image. Consequently, the problem is more conveniently reduced to that of density regression.
In density estimation, a model is trained to map an input image to its crowd density, where the spatial values indicate the number of people per unit pixel. To facilitate this, the heads of people are annotated, which is much easier than specifying bounding box for crowd images . These point annotations are converted to density map by convolving with a Gaussian kernel such that simple spatial summation gives out the crowd count. Though regression is the dominant paradigm in crowd analysis and delivers excellent count estimation, there are some serious limitations. The first being the inability to pinpoint persons as these models predict crowd density, which is a regional feature (see the density maps in Figure 7). Any simple post-processing of density maps to extract positions of people, does not seem to scale across the density ranges and results in poor counting performance (Section 4.2). Ideally, we expect the model to deliver accurate localization on every person in the scene possibly with bounding box. Such a system paves way for downstream applications other than predicting just the crowd distribution. With accurate bounding box for heads of people in dense crowds, one could do person recognition, tracking etc., which are practically more valuable. Hence, we try to go beyond the popular density regression framework and create a dense detection system for crowd counting.
Basically, our objective is to locate and predict bounding boxes on heads of people, irrespective of any kind of variations. Developing such a detection framework is a challenging task and cannot be easily achieved with trivial changes to existing detection frameworks ([15, 16, 17, 18, 1], etc.). This is because of the following reasons:
Diversity: Any counting model has to handle huge diversity in appearance of individual persons and their assemblies. There exist an interplay of multiple variables, including but not limited to pose, view-point and illumination variations within the same crowd as well as across crowd images.
Scale: The extreme scale and density variations in crowd scenes pose unique challenges in formulating a dense detection framework. In normal detection scenarios, this could be mitigated using a multi-scale architecture, where images are fed to the model at different scales and trained. A large face in a sparse crowd is not simply a scaled up version of that of persons in dense regions. The pattern of appearance itself is changing across scales or density.
Resolution: Usual detection models predict at a down-sampled spatial resolution, typically one-sixteenth or one-eighth of the input dimensions. But this approach does not scale across density ranges. Especially, highly dense regions require fine grained detection of people, with the possibility of hundreds of instances being present in a small region, at a level difficult with the conventional frameworks.
Extreme box sizes: Since the densities vary drastically, so should be the box sizes. The size of boxes must vary from as small as 1 pixel in highly dense crowds to more than 300 pixels in sparser regions, which is several folds beyond the setting under which normal detectors operate.
: Another problem due to density variation is the imbalance in box sizes for people across dataset. The distribution is so skewed that the majority of samples are crowded to certain set of box sizes while only a few are available for the remaining.
Only point annotation: Since only point head annotation is available with crowd datasets, bounding boxes are absent for training detectors.
Local minima: Training the model to predict at higher resolutions causes the gradient updates to be averaged over a larger spatial area. This, especially with the diverse data, increases the chances of optimization being stuck in local minimas, leading to suboptimal performance.
Hence, we try to tackle these challenges and develop a tailor-made detection framework for dense crowd counting. Our objective is to locate every person in the scene, size each detection with bounding box on the head and finally give the crowd count. This LSC-CNN, at a functional view, is trained for pixel-wise classification task and detects the presence of persons along with the size of the heads. Cross entropy loss is used for training instead of the widely employed regression loss in density estimation. We devise novel solutions to each of the problems listed before, including a method to dynamically estimate bounding box sizes from point annotations. In summary, this work contributes the following:
A radical shift to the prevalent density regression paradigm with dense detection for crowd counting.
A novel CNN framework, different from conventional object detectors, that provides fine-grained localization of persons at very high resolution.
A unique fusion configuration with top-down feedback that facilitates joint processing of multi-scale information to better resolve people.
A practical training regime that only requires point annotations, but can dynamically estimate bounding boxes for heads of people.
A new winner-take-all based loss formulation for better training at higher resolutions.
A benchmarked model that delivers impressive performance in localization, sizing and counting.
2 Previous Work
The topic of crowd counting broadly might have started with the detection of people in crowded scenes. These methods use appearance features from still images or motion vectors in videos to detect individual persons ([5, 6, 7]). Idrees et al. 
leverage local scale prior and global occlusion reasoning to detect humans in crowds. With features extracted from a deep network, run a recurrent framework to sequentially detect and count people. In general, the person detection based methods are limited by their inability to operate faithfully in highly dense crowds and require bounding box annotations for training. Consequently, density regression becomes popular.
Density Regression: Idrees et al.  introduce an approach where features from head detections, interest points and frequency analysis are used to regress the crowd density. A shallow CNN is employed as density regressor in 
, where the training is done by alternatively backpropagating the regression and direct count loss. There are works like, where the model directly regresses crowd count instead of density map. But such methods are shown to perform inferior due to the lack of spatial information in the loss.
Multiple and Multi-column CNNs: The next wave of methods focuses on addressing the huge diversity of appearance of people in crowd images through multiple networks. Walach et al.  use a cascade of CNN regressors to sequentially correct the errors of previous networks. The outputs of multiple networks, each being trained with images of different scales, are fused in  to generate the final density map. Extending the trend, architecture with multiple columns of CNN having different receptive fields starts to emerge. The receptive field determines the affinity towards certain density types. For example, the deep network in  is supposed to capture sparse crowds, while the shallow network is for the blob like people in dense regions. The MCNN  model leverages three networks with filter sizes tuned for different density types. The specialization acquired by individual columns in these approaches are improved through a differential training procedure by . On a similar theme, Sam et al.  create a hierarchical tree of expert networks for fine-tuned regression. Going further, the multiple columns are combined into a single network, with parallel convolutional blocks of different filters by  and is trained along with an additional local pattern consistency loss formulation.
) supply local or global level auxiliary information through a dedicated classifier. Sam et al. show a top-down feedback mechanism can effectively provide global context to iteratively improve density prediction made by a CNN regressor. A similar incremental refinement of density estimation is employed in .
Better and easier architectures: Since density regression suites better for denser crowds, Decide-Net architecture from  adaptively leverages predictions from Faster RCNN  detector in sparse regions to improve performance. Though the predictions seems to be better in sparse crowds, the performance on dense datasets is not very evident. Also note that the focus of this work is to aid better regression with a detector and is not a person detection model. In fact, Decide-Net requires some bounding box annotation for training, which is infeasible for dense crowds. Striving for simpler architectures have always been a theme. Li et al.  employ a VGG based model with additional dilated convolutional layers and obtain better count estimation. Further, a DenseNet  model is trained in  for density regression at different resolutions with composition loss.
Other counting works:
An alternative set of works try to incorporate flavours of unsupervised learning and mitigate the issue of annotation difficulty. Liu et al.
use count ranking as a self-supervision task on unlabeled data in a multitask framework along with regression supervision on labeled data. The Grid Winner-Take-All autoencoder, introduced in, trains almost 99% of the model parameters with unlabeled data and the acquired features are shown to be better for density regression. Other counting works employ Negative Correlation Learning  and Adversarial training to improve regression . In contrast to all these regression approaches, we move to the paradigm of dense detection, where the objective is to predict bounding box on heads of people in crowd of any density type.
Object/face Detectors: Since our model is a detector tailor-made for dense crowds, here we briefly compare with other detection works as well. Object detection has seen a shift from early methods relying on interest points (like SIFT ) to CNNs. Early CNN based detectors operate on the philosophy of first extracting features from a deep CNN and then have a classifier on the region proposals ([36, 37, 38]) or a Region Proposal Network (RPN)  to jointly predict the presence and boxes for objects. But the current dominant methods ([17, 16]) have simpler end-to-end architecture without region proposals. They divide the input image in to a grid of cells and boxes are predicted with confidences for each cell. But these works generally suite for relatively large objects, with less number instances per image. Hence to capture multiple small objects (like faces), many models are proposed. Zhu et al.  adapt Faster RCNN with multi-scale ROI features to better detect small faces. On similar lines, a pyramid of images is leveraged in , with each scale being separately processed to detect faces of varied sizes. The SSH model  detects faces from multiple scales in a single stage using features from different layers. Our proposed architecture differs from these models in many aspects as described in Section 1. Though it has some similarity with the SSH model in terms of the single stage architecture, we output predictions at resolutions higher than any face detector. This is to handle extremely small heads (of few pixels size) occurring very close to each other, a typical characteristic of dense crowds. Moreover, bounding box annotation is not available per se from crowd datasets and has to rely on pseudo data. Due to this approximated box data, we prefer not to regress or adjust the template box sizes as the normal detectors do, instead just classifies every person to one of the predefined boxes. Above all, dense crowd analysis is generally considered a harder problem due to the large diversity.
A concurrent work: We note a recent paper  which proposes a detection framework, PSDNN, for crowd counting. But this is a concurrent work which has appeared while this manuscript is under preparation. PSDNN uses a Faster RCNN model trained on crowd dataset with pseudo ground truth generated from point annotation. A locally constrained regression loss and an iterative ground truth box updation scheme is employed to improve performance. Though the idea of generating pseudo ground truth boxes is similar, we do not actually create (or store) the annotations, instead a box is chosen from head locations dynamically (as given by equation 2 in Section 3.2). We do not regress box location or size as normal detectors and avoid any complicated ground truth updation schemes. Also, PSDNN employs Faster RCNN with minimal changes, but we use a custom completely end-to-end and single stage architecture tailor-made for the nuances of dense crowd detection and achieve better performance in almost all benchmarks (Section 4.4).
3 Our Approach
As motivated in Section 1, we drop the prevalent density regression paradigm and develop a dense detection model for dense crowd counting. Our model named, LSC-CNN, predicts accurately localized boxes on heads of people in crowd images. Though it seems like a multi-stage task of first locating and sizing the each person, we formulate it as an end-to-end single stage process. Figure 2 depicts a high-level view of our architecture. LSC-CNN has three functional parts; the first is to extract features at multiple resolution by the Feature Extractor. These feature maps are fed to a set of Multi-scale Feedback Reasoning (MFR) networks, where information across the scales are fused and box predictions are made. Then Non-Maximum Suppression (NMS) selects valid detections from multiple resolutions and is combined to generate the final output. For training of the model, the last stage is replaced with the GWTA Training module, where the winners-take-all (WTA) loss backpropagation and adaptive ground truth box selection are implemented. We describe each functional part of LSC-CNN and the training algorithm in the following sections.
3.1 Locate Heads
3.1.1 Feature Extractor
Almost all existing CNN object detectors operate on a backbone deep feature extractor network. The quality of features seems to directly affect the detection performance. For crowd counting, VGG-16 based networks are widely used in a variety of ways ([12, 23, 13, 25]) and delivers near state-of-the-art performance . In line with the trend, we also employ several of VGG-16 convolution layers for better crowd feature extraction. But, as shown in Figure 3, some blocks are replicated and manipulated to facilitate feature extraction at multiple resolutions. The first five
convolution blocks of VGG-16, initialized with ImageNet trained weights, form the backbone network. The input to the network is RGB crowd image of fixed size (
) with the output at each block being downsampled due to max-pooling. At every block, except for the last, the network branches with the next block being duplicated (weights are copied at initialization, not tied). We tap from these copied blocks to create feature maps at one-half, one-fourth, one-eighth and one-sixteenth of the input resolution. This is in slight contrast to typical hypercolumn features and helps to specialize each scale branches by sharing low-level features without any conflict. The low-level features with half the spatial size that of the input, could potentially capture and resolve highly dense crowds. The other lower resolution scale branches have progressively higher receptive field and are suitable for relatively less packed ones. In fact, people appearing large in very sparse crowds could be faithfully discriminated by the one-sixteenth features.
The multi-scale architecture of feature extractor is motivated to solve many roadblocks in dense detection. It could simultaneously address the diversity, scale and resolution issues mentioned in Section 1. The diversity aspect is taken care by having multiple scale columns, so that each one can specialize to a different crowd type. Since the typical multi-scale input paradigm is replaced with extraction of multi-resolution features, the Scale issue is mitigated to certain extent. Further, the increased resolution for branches helps to better resolve people appearing very close to each other.
3.1.2 Multi-scale Feedback Reasoning
One major issue with the multi-scale representations from the feature extractor is that the higher resolution feature maps have limited context to discriminate persons. More clearly, many patterns in the image formed by leaves of trees, structures of buildings, cluttered backgrounds etc. resemble formation of people in highly dense crowds . As a result, these crowd like patterns could be misclassified as people, especially at the higher resolution scales that have low receptive field for making predictions. We cannot avoid these low-level representations as it is crucial for resolving people in highly dense crowds. The problem mainly arises due to the absence of high-level context information about the crowd regions in the scene. Hence, we evaluate global context from scales with higher receptive fields and jointly process these feedbacks to detect persons.
As shown in Figure 2, a set of Multi-scale Feedback Reasoning (MFR) modules feed on the output by crowd feature extractor. There is one MFR network for each scale branch and acts as a person detector at that scale. The MFR also have feedback connections from all previous low resolution scale branches. For example, in the case of one-fourth branch MFR, it receives connections from one-eighth as well as one-sixteenth branches and generates feedback for one-half scale branch. If there are feedback connections from high-level branches, then it uniquely identifies an MFR network as MFR(). is also indicative of the scale and takes values from zero to , where is the number of scale branches. For instance, MFR with is for the lowest resolution scale (one-sixteenth) and takes no feedback inputs. Any MFR() with receives feedback connections from all MFR() modules where . At a functional view, the MFR predicts the presence of a person at every pixel for the given scale branch by coalescing all the scale feedbacks. This multi-scale feedback processing helps to drive global context information to all the scale branches and suppress spurious detections, apart from aiding scale specialization.
Figure 4 illustrates the internal implementation of the MFR module. The MFR can take a variable number of feedback features maps (through input terminal labeled ). The main input to any MFR module is one of the scale feature set (labeled as input terminal ) and is passed through a convolution layer. We set the number of filters for this convolution, , as one-half that of the incoming scale branch ( channels from terminal ). To be specific, , where denotes floor operation. This reduction in feature maps is to accommodate the feedback representations and decrease computational overhead for the final layers. Note that the feedback for next MFR module is also drawn from this convolution layer (output terminal ). For the feedback processing, each of the feedback inputs is operated by a set of two convolution layers. The first is a transposed convolution (also know as deconvolution) to upsample feedback feature maps to the same size as the scale branch. The upsampling is followed by a convolution with filters. Each processed feature set has the same number of channels () as that of the scale input, which forces them to be weighed equally by the subsequent layers. All these feature maps are concatenated along the channel dimension and fed to a series of convolutions with progressive reduction in number of filters to give the final prediction. These set of layers fuse the crowd features with feedback from other scales to improve discrimination of people. The output basically classifies every pixel into either background or to one of the predefined bounding boxes for the detected head. Softmax nonlinearity is applied on these output maps to generate per-pixel confidences over the classes, where is the number of predefined boxes. is a hyper-parameter to control the fineness of the sizes and is typically set as 3, making a total of boxes for all the branches. The first channel of the prediction for scale , , is for background and remaining maps are for the boxes (see Section 3.2.1 for more details).
The feedback processing architecture helps in fine-grained localization of persons spatially as well as in the scale pyramid. The Diversity, Scale and Resolution bottlenecks (Section 1) are further mitigated by the feedback mechanism, which could selectively identify the appropriate scale branch for a person to resolve it more faithfully. This is further ensured through the training regime we employ (Section 3.2.2). Scaling across the extreme box sizes is also made possible to certain extent as each branch could focus on an appropriate subset of the box sizes.
3.2 Size Heads
3.2.1 Box classification
As described previously, LSC-CNN with the help of MFR modules locates people and has to put appropriate bounding boxes on the detected heads. For this sizing, we choose a per-pixel classification paradigm. Basically, a set of bounding boxes are fixed with predefined sizes and the model simply classifies each head to one of the boxes or as background. This is in contrast to the anchor box paradigm typically being employed in detectors ([15, 17]), where the parameters of the boxes are regressed. Every scale branch () of the model outputs a set of maps, , indicating the per-pixel confidence for the box classes (see Figure 4). Now we require the ground truth sizes of heads to train the model, which is not available and not convenient to annotate for typical dense crowd datasets. Hence, we devise a method to approximate the sizes of the heads.
For ground truth generation, we rely on the point annotations available with crowd datasets. These point annotations specify the locations of heads of people. The location is approximately at the center of head, but can vary significantly for sparse crowds (where the point could be any where on the large face or head). Apart from locating every person in the crowd, the point annotations also give some scale information. The distance between two adjacent persons could indicate the bounding box size for the heads, under the assumption of a uniform crowd density. Note that we consider only square boxes. In short, the size for any head can simply be taken as the distance to the nearest neighbour. While this approach makes sense in medium to dense crowds, it might result in incorrect box sizes for people in sparse crowds, where the nearest neighbour is typically far. Nevertheless, empirically it is seen to give out reasonable sizing of heads over a wide range of densities.
Here we mathematically explain the pseudo ground truth creation. Let be the set of all annotated locations of people in the given image patch. Then for every point in , the box size is defined as,
the distance to the nearest neighbour. If there is only one person in the image patch, the box size is taken as . Now we discretize the values to predefined bins, which specifies the box sizes. If are the predefined box sizes for scale , then the box label for classification at location is given by,
A general philosophy is followed in choosing the box sizes s for all the scales. The size of the first box () at the highest resolution scale () is always fixed to one, which improves the resolving capacity for highly dense crowds (Resolution issue in Section 1). We choose larger sizes for the remaining boxes in the same scale with a constant increment. This increment is fine-grained in higher resolution branches, but the coarseness progressively increases for low resolution scales. To be specific, if represent the size increment for scale , then the box sizes are set as,
The typical values of the size increment for different scales are . Note that the high resolution branches (one-half & one-fourth) have boxes with finer sizes than the low resolution ones (one-sixteenth & one-eighth), where coarse resolving capacity would suffice (see Figure 5).
There are many reasons to discretize the head sizes and classify the boxes instead of regressing size values. The first is due to the use of pseudo ground truth. Since the size of heads itself is approximate, tight estimation of sizes proves to be difficult (see Section 5.2). Similar sized heads in two images could have different ground truths depending on the density. This might lead to some inconsistencies in training and could result in suboptimal performance. Moreover, the sizes of heads vary extremely across density ranges at a level not expected for normal detectors. This requires heavy normalization of value ranges along with complex data balancing schemes. But our per-pixel box classification paradigm effectively addresses these extreme box sizes and only point annotation issues (Section 1).
3.2.2 GWTA Training
Loss: We train the LSC-CNN by back-propagating per-pixel cross entropy loss. The loss for a pixel is defined as,
where is the set of probability values (softmax outputs) for the predefined box classes and refers to corresponding ground truth labels. All s take zero value except for the correct class. The s are weights to class balance the training. Now the loss for the entire prediction of scale branch would be,
where the inputs are the set of predictions and pseudo ground truths (the set limits might be dropped for convenience). Note that are the spatial sizes of these prediction maps and the cross-entropy loss is averaged over it. The final loss for LSC-CNN after combining losses from all the branches is,
Weighting: As mentioned in Section 1, the data imbalance issue is severe in the case of crowd datasets. Class wise weighting assumes prime importance for effective backpropagation of (see Section 5.2). We follow a simple formulation to fix the values. Once the box sizes are set, the number of data points available for each class is computed from the entire train set. Let denote this frequency count for the box in scale . Then for every scale branch, we sum the box counts as and the scale with minimum number of samples is identified. This minimum value , is used to balance the training data across the branches. Basically, we scale down the weights for the branches with higher counts such that the minimum count branch has weight one. Note that training points for all the branches as well as the classes within a branch need to be balanced. Usually the data samples would be highly skewed towards the background class () in all the scales. To mitigate this, we scale up the weights of all box classes based on its ratio with background frequency of the same branch. Numerically, the balancing is done jointly as,
The term can be large since the frequency of background to box is usually skewed. So we limit the value to 10 for better training stability. Further note that for some box size settings, values itself could be very skewed, which depends on the distribution of dataset under consideration. Any difference in the values more than an order of magnitude is found to be affecting the proper training. Hence, the box size increments (s) are chosen not only to roughly cover the density ranges in the dataset, but also such that the s are close within an order of magnitude.
GWTA: However, even after this balancing, training LSC-CNN by optimizing joint loss does not achieve acceptable performance (see Section 5.2). This is because the model predicts at a high resolution than any typical crowd counting network and the loss is averaged over a relatively larger spatial area. The weighing scheme only makes sure that the averaged loss values across branches and classes is in similar range. But the scales with larger spatial area could have more instances of one particular class than others. For instance in dense regions, the one-half resolution scale () would have more person instances and are typically very diverse. This causes the optimization to focus on all instances equally and might lead to a local minima solution. A strategy is needed to focus on a small region at a time, update the weights and repeat this for another region.
For solving this local minima issue (Section 1), we rely on the Grid Winner-Take-All (GWTA) approach introduced in . Though GWTA is originally used for unsupervised learning, we repurpose it to our loss formulation. The basic idea is to divide the prediction map into a grid of cells with fixed size and compute the loss only for one cell. Since only a small region is included in the loss, this acts as tie breaker and avoids the gradient averaging effect, reducing the chances of the optimization reaching a local minima. Now the question is how to select the cells. The ‘winner’ cell is chosen as the one which incurs the highest loss. At every iteration of training, we concentrate more on the high loss making regions in the image and learn better features. This has slight resemblance to hard mining approaches, where difficult instances are sampled more during training. In short, GWTA training selects ‘hard’ regions and try to improve the prediction.
Figure 6 shows the implementation of GWTA training. For each scale, we apply GWTA non-linearity on the loss. The cell size for all branches is taken as the dimensions of the lowest resolution prediction map . There is only one cell for scale (one-sixteenth branch), but the grows by power of four () for subsequent branches as the spatial dimensions consecutively doubles. Now we compute the cross-entropy loss for any cell at location (top-left corner) in the grid as,
where the summation of losses runs over for all pixels in the cell under consideration. Also note that is computed using equation 6 with , in order to account for the change in spatial size of the predictions. The winner cell is the one with the highest loss and the location is given by,
Note that the argmax operator finds an pair that identifies the top-left corner of the cell. The combined loss becomes,
We optimize the parameters of LSC-CNN by backpropagating using standard mini-batch gradient descent with momentum. Batch size is typically 4. Momentum parameter is set as 0.9 and a fixed learning rate schedule of is used. The training is continued till the counting performance (MAE in Section 4.2) on a validation set saturates.
3.3 Count Heads
For testing the model, the GWTA training module is replaced with the prediction fusion operation as shown in Figure 2. The input image is evaluated by all the branches and results in predictions at multiple resolutions. Box locations are extracted from these prediction maps and are linearly scaled to the input resolution. Then standard Non-Maximum Suppression (NMS) is applied to remove boxes with overlap more than a threshold. The boxes after the NMS form the final prediction of the model and are enumerated to output the crowd count. Note that to facilitate intermediate evaluations during training, the NMS threshold is set to 0.3 (30% area overlap). But for the best model after training, we run a threshold search to minimize the counting error (MAE, Section 4.2)) over the validation set (typical value ranges from 0.2 to 0.3).
4 Performance Evaluation
4.1 Experimental Setup and Datasets
We evaluate LSC-CNN for localization and counting performance on all major crowd datasets. Since these datasets have only point head annotations, sizing capability cannot be benchmarked. Hence, we use one face detection dataset where bounding box ground truth is available. Further, LSC-CNN is trained on vehicle counting dataset to show generalization. Figure 7 displays some of the box detections by our model on all datasets. Note that unless otherwise specified, we use the same architecture and hyper-parameters given in Section 3. The remaining part of this section introduces the datasets along with the hyper-parameters if there is any change.
|Dataset / Method||CSR-A-thr||LSC-CNN||CSR-A||LSC-CNN||CSR-A||LSC-CNN||CSR-A||LSC-CNN||CSR-A||LSC-CNN|
Shanghaitech: The Shanghaitech dataset  consists of total 1,198 crowd images with more than 0.3 million head annotations. It is divided into two sets, namely, Part_A and Part_B. Part_A has density variations ranging from 33 to 3139 people per image with average count being 501.4. In contrast, images in Part_B are relatively less diverse and sparser with an average density of 123.6.
UCF-QNRF: Idrees et al.  introduce UCF-QNRF dataset and by large the biggest, with 1201 images for training and 334 images for testing. The 1.2 million head annotations come from diverse crowd images with density varying from as small as 49 people per image to massive 12865.
UCF_CC_50: UCF_CC_50  is a challenging dataset on multiple counts; the first is
due to being a small set of 50 images and the second results from the extreme diversity, with crowd counts ranging
in 50 to 4543. The small size poses a serious problem for training deep neural networks. Hence to
reduce the number of parameters for training, we only use the one-eighth and one-fourth scale branches for this dataset.
The prediction at one-sixteenth scale is avoided as sparse crowds are very less, but the feedback connections are kept as
it is in Figure
is a challenging dataset on multiple counts; the first is due to being a small set of 50 images and the second results from the extreme diversity, with crowd counts ranging in 50 to 4543. The small size poses a serious problem for training deep neural networks. Hence to reduce the number of parameters for training, we only use the one-eighth and one-fourth scale branches for this dataset. The prediction at one-sixteenth scale is avoided as sparse crowds are very less, but the feedback connections are kept as it is in Figure2. The box increments are chosen as . Following , we perform 5-fold cross validation for evaluation.
WorldExpo’10: An array of 3980 frames collected from different video sequences of surveillance cameras forms the WorldExpo’10 dataset . It has sparse crowds with an average density of only 50 people per image. There are 3,380 images for training and 600 images from five different scenes form the test set. Region of Interest (RoI) masks are also provided for every scene. Since the dataset has mostly sparse crowds, only the low-resolution scales one-sixteenth and one-eighth are used with . We use a higher batch size of 32 as there many no people images and follow training/testing protocols in .
TRANCOS: The vehicle counting dataset, TRANCOS , has 1244 images captured by various traffic surveillance cameras. In total, there are 46,796 vehicle point annotations. Also, RoIs are specified on every image for evaluation. We use the same architecture and box sizes as that of WorldExpo’10.
WIDERFACE: WIDERFACE  is a face detection dataset with more than 0.3 million bounding box annotations, spanning 32,203 images. The images, in general, have sparse crowds having variations in pose and scale with some level of occlusions. We remove the one-half scale branch for this dataset as highly dense images are not present. To compare with existing methods on fitness of bounding box predictions, the fineness of the box sizes are increased by using five boxes per scale (). The is set as and learning rate is made lower to . Note that for fair comparison, we train LSC-CNN without using the actual ground truth bounding boxes. Instead, point face annotations are created by taking centers of the boxes, from which pseudo ground truth is generated as per the training regime of LSC-CNN. But the performance is evaluated with the actual ground truth.
4.2 Evaluation of Localization
The widely used metric for crowd counting is the Mean Absolute Error or MAE. MAE is simply the
absolute difference between the predicted and actual crowd counts averaged over all the images in
the test set. Mathematically,
where is the count predicted for input image for which the ground truth is .
The counting performance of a model is directly evident from the MAE value. Further to estimate
the variance and hence robustness of the count prediction, Mean Squared Error or MSE is used.
It is given by
. The counting performance of a model is directly evident from the MAE value. Further to estimate the variance and hence robustness of the count prediction, Mean Squared Error or MSE is used. It is given byThough these metrics measure the accuracy of overall count prediction, localization of the predictions is not very evident. Hence, apart from standard MAE, we evaluate the ability of LSC-CNN to accurately pinpoint individual persons. An existing metric called Grid Average Mean absolute Error or GAME , can roughly indicate coarse localization of count predictions. To compute GAME, the prediction map is divided into a grid of cells and the absolute count errors within each cell are averaged over grid. Table I compares the GAME values of LSC-CNN against a regression baseline model for different grid sizes. Note that GAME with only one cell, GAME(0), is same as MAE. We take the baseline as CSRNet-A  (labeled CSR-A) model as it has similarity to the Feature Extractor and delivers near state-of-the-art results. Clearly, LSC-CNN has superior count localization than the density regression based CSR-A.
One could also measure localization in terms of how close the prediction matches with ground truth point annotation. For this, we define a metric named Mean Localization Error (MLE), which computes the distance in pixels between the predicted person location to its ground truth averaged over test set. The predictions are matched to head annotations in a one-to-one fashion and a fixed penalty of 16 pixels is added for absent or spurious detections. Since CSR-A or any other density regression based counting models do not individually locate persons, we apply threshold on the density map to get detections (CSR-A-thr). But it is difficult to threshold density maps without loss of counting accuracy. We choose a threshold such that the resultant MAE is minimum over validation set. For CSR-A, the best thresholded MAE comes to be 167.1, instead of the original 72.6. As expected, MLE scores for LSC-CNN is significantly better than CSR-A, indicating sharp localization capacity.
|Two Stage CNN ||68.1||61.4||32.3|
|Hu et al. ||92.5||91.0||80.6|
|Najibi et al. ||93.1||92.1||84.5|
4.3 Evaluation of Sizing
We follow other face detection works ([18, 1]) and use the standard mean Average Precision or mAP metric to assess the sizing ability of our model. For this, LSC-CNN is trained on WIDERFACE face dataset without the actual box ground truth as mentioned in Section 4.1. Table II reports the comparison of mAP scores obtained by our model against other works. Despite using pseudo ground truth for training, LSC-CNN achieves a competitive performance, especially on Hard and Medium test sets, against the methods that use full box supervision. We also compare with PSDNN model  which trains on pseudo box ground truth similar to our model. Interestingly, LSC-CNN has higher mAP in the two difficult sets than that of PSDNN. Note that the images in Easy set are mostly of very sparse crowds with faces appearing large. We lose out in mAP mainly due to the high discretization of box sizes on large faces. This is not unexpected as LSC-CNN is designed for dense crowds without bounding box annotations. But the fact that it works well on the relatively denser other two test sets, clearly shows the effectiveness of our proposed framework.
|UCF-QNRF (crowd)||TRANCOS (vehicle)|
|Idrees et al. ||315||508||Guerrero et al. ||13.3|
|MCNN ||277||426||Hydra CNN ||10.9|
|CMTL ||252||514||Zhang et al. ||4.2|
|SCNN ||228||445||Li et al. ||3.7|
|Idrees et al. ||132||191||PSDNN ||4.8|
|LSC-CNN (Ours)||120.5||218.2||LSC-CNN (Ours)||4.6|
We also compute the average classification accuracy of boxes with respect to the pseudo ground truth on test set. LSC-CNN has an accuracy of around 94.56% for ST_PartA dataset and 93.97% for UCF_QNRF, indicative of proper data fitting.
4.4 Evaluation of Counting
|ST Part_A||ST Part_B||UCF_CC_50|
|Zhang et al. ||181.8||277.7||32.0||49.8||467.0||498.5|
|Liu et al. ||72.0||106.6||14.4||23.8||279.6||388.9|
Here we compare LSC-CNN with other crowd counting models on the standard MAE and MSE metrics. Table III lists the evaluation results on UCF-QNRF dataset. Our model achieves an MAE of 120.5, which is lower than that of  by a significant margin of 12.5. Evaluation on the next set of datasets is available in Table IV. On Part_A of Shanghaitech, LSC-CNN performs better than all the other density regression methods and has very competitive MAE to that of PSDNN , with the difference being just 0.5. But note that PSDNN is trained with a curriculum learning strategy and the MAE without it seems to be significantly higher (above 80). This along with the fact that LSC-CNN has lower count error than PSDNN in all other datasets, indicates the strength of our proposed architecture. In fact, state-of-the-art performance is obtained in both Shanghaitech Part_B and UCF_CC_50 datasets. Despite having just 50 images with extreme diversity in the UCF_CC_50, our model delivers a substantial decrease of 3̃3 points in MAE. A similar trend is observed in WorldExpo dataset as well, with LSC-CNN acheiving lower MAE than existing methods (Table V). Further to explore the generalization of LSC-CNN, we evaluate on a vehicle counting dataset TRANCOS. The results from Table III evidence a lower MAE than PSDNN, and is highly competitive with the best method. These experiments evidence the top-notch crowd counting ability of LSC-CNN compared to other density regressors, with all the merits of a detection model.
|Zhang et al. ||9.8||14.1||14.3||22.2||3.7||12.9|
|Liu et al. ||2.0||13.1||8.9||17.4||4.8||9.2|
5 Analysis and Ablations
5.1 Effect of Multi-Scale Box Classification
As mentioned in Section 3.2, in general, we use 3 box sizes () for each scale branch and employ 4 scales (). Here we ablate over the choice of and . The results of the experiments are presented in Table VI. It is intuitive to expect higher counting accuracy with more number of scale branches (from to ) as people at all the scales are resolved better. Although this is true in theory, as the number of scales increase, so do the number of trainable parameters for the same amount of data. This might be the cause for slight increase in counting error for . Regarding the ablations on the number of boxes, we train LSC-CNN for to (maintaining the same size increments as specified in Section 3.2 for all). Initially, we observe a progressive gain in the counting accuracy till , but seems to saturate after that. This could be attributed to the decrease in training samples per box class as increases.
|Scale & Box||Architecture|
5.2 Architectural Ablations
In this section, the advantage of various architectural choices made for our model is established through experiments. LSC-CNN employs multi-scale top-down feedbacks through the MFR modules (Section 3.1.2). We train LSC-CNN without these feedback connections (terminal 2 in Figure 4 is removed for all MFR networks) and the resultant MAE is labeled as No MFR in Table VI. We also ablate with a sequential MFR (Seq MFR), in which every branch gets only one feedback connection from its previous scale as opposed to feedback from all lower resolution scales in LSC-CNN. The results evidence that having feedback is effective in leveraging high-level scene context and helps improve count accuracy. But the improvement is drastic with the proposed multiple feedback connections and seems to aid better extraction of context information. The top-down feedback can be incorporated in many ways, with LSC-CNN using concatenation of feedback features with that of bottom-up. Following , we generate feedback to gate the bottom-up feature maps (Mult MFR). Specifically, we modify the second convolution layer for feedback processing in Figure 4 with Sigmoid activation. The Sigmoid output from each feedback branch is element-wise multiplied to the incoming scale feature maps. A slight performance drop is observed with this setup, but the MAE is close to that of LSC-CNN, again emphasizing that feedback in any form could be useful.
Now we ablate the training regime of LSC-CNN. The experiment labeled No GWTA in Table VI corresponds to LSC-CNN trained with just the loss (equation 5). A significant improvement in MAE is observed with GWTA, validating the hypothesis that it aids better optimization of the model. Another important aspect of the training is the class balancing scheme employed. LSC-CNN is trained with no weighting, essentially with all s set to 1. As expected, the counting error reaches an unacceptable level, mainly due to the skewness in the distribution of persons across scales. Lastly, instead of our per-pixel box classification framework, we train LSC-CNN to regress the box sizes. Box regression is done for all the branches by replicating the last five convolutional layers of the MFR (Figure 4) into two arms; one for the per-pixel binary classification to locate person and the other for estimating the corresponding head sizes (the sizes are normalized to 0-1 for all scales). However, this setting could not achieve good MAE, possibly due to class imbalance across box sizes (Section 3.2).
5.3 Comparison with Object/Face Detectors
To further demonstrate the utility of our framework beyond any doubt, we train existing detectors like FRCNN , SSH  and TinyFaces  on dense crowd datasets. The anchors for these models are adjusted to match the box sizes (s) of LSC-CNN for fair comparison. The models are optimized with the pseudo box ground truth generated from point annotations (equation 2). For these, we compute counting metrics MAE and MSE along with point localization measure MLE in Table VII. Note that the SSH and TinyFaces face detectors are also trained with the default anchor box setting as specified by their authors (labeled as def). The evaluation points to the poor counting performance of the detectors, which incur high MAE scores. This is mainly due to the inability to capture dense crowds as evident from Figure 8. LSC-CNN, on the other hand, works well across density ranges, with quite convincing detections even on sparse crowd images from WIDERFACE .
This paper introduces a new dense detection framework for crowd counting and renders the prevalent paradigm of density regression obsolete. The proposed LSC-CNN model uses a multi-column architecture with top-down feedback processing to resolve people in dense crowds. Though only point head annotations are available for training, LSC-CNN puts bounding box on every located person. Experiments indicate that the model achieves not only better crowd counting performance than existing regression methods, but also has superior localization with all the merits of a detection system. We hope that the community would switch from the current regression approach to more practical dense detection. Future research could address spurious detections and make sizing of heads further accurate.
This work was supported by SERB, Dept. of Science and Technology, Govt. of India (Proj: SB/S3/EECE/0127/2015).
P. Hu and D. Ramanan, “Finding tiny faces,” in
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
-  S. Yang, P. Luo, C. C. Loy, and X. Tang, “Wider face: A face detection benchmark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  “World’s largest selfie,” https://www.gsmarena.com/nokia_lumia_730_captures_worlds_largest_selfie-news-10285.php, accessed: 2019-05-31.
Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting via multi-column convolutional neural network,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 589–597.
-  B. Wu and R. Nevatia, “Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors,” in IEEE International Conference on Computer Vision, vol. 1, 2005, pp. 90–97.
-  P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians using patterns of motion and appearance,” International Journal of Computer Vision, vol. 63, no. 2, pp. 153–161, 2005.
-  M. Wang and X. Wang, “Automatic adaptation of a generic pedestrian detector to a specific traffic scene,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 3401–3408.
-  H. Idrees, K. Soomro, and M. Shah, “Detecting humans in dense crowds using locally-consistent scale prior and global occlusion reasoning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1986–1998, 2015.
-  H. Idrees, I. Saleemi, C. Seibert, and M. Shah, “Multi-source multi-scale counting in extremely dense crowd images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2547–2554.
-  C. Zhang, H. Li, X. Wang, and X. Yang, “Cross-scene crowd counting via deep convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 833–841.
-  D. Onoro-Rubio and R. J. López-Sastre, “Towards perspective-free object counting with deep learning,” in European Conference on Computer Vision. Springer, 2016, pp. 615–629.
-  D. B. Sam, S. Surya, and R. V. Babu, “Switching convolutional neural network for crowd counting,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 3, 2017, p. 6.
-  V. A. Sindagi and V. M.Patel, “Generating high quality crowd density maps using contextual pyramid CNNs,” in European Conference on Computer Vision. Springer, 2016, pp. 660–676.
-  Y. Li, X. Zhang, and D. Chen, “CSRNet: Dilated convolutional neural networks for understanding the highly congested scenes,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems (NIPS), 2015.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision. Springer, 2016.
-  J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
-  M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, “Ssh: Single stage headless face detector,” in The IEEE International Conference on Computer Vision (ICCV), 2017.
-  R. Stewart and M. Andriluka, “End-to-end people detection in crowded scenes,” arXiv preprint arXiv:1506.04878, 2015.
-  C. Wang, H. Zhang, L. Yang, S. Liu, and X. Cao, “Deep people counting in extremely dense crowds,” in Proceedings of the 23rd ACM international conference on Multimedia, 2015, pp. 1299–1302.
-  E. Walach and L. Wolf, “Learning to count with cnn boosting,” in European Conference on Computer Vision. Springer, 2016, pp. 660–676.
-  L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A deep convolutional network for dense crowd counting,” in Proceedings of the 2016 ACM on Multimedia Conference, 2016, pp. 640–644.
-  D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan, “Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  X. Cao, Z. Wang, Y. Zhao, and F. Su, “Scale aggregation network for accurate and efficient crowd counting,” in The European Conference on Computer Vision (ECCV), September 2018.
-  V. A. Sindagi and V. M. Patel, “CNN-based cascaded multi-task learning of high-level prior and density estimation for crowd counting,” in Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on. IEEE, 2017, pp. 1–6.
D. Babu Sam and R. V. Babu, “Top-down feedback for crowd counting
convolutional neural network,” in
AAAI Conference on Artificial Intelligence, 2018.
-  V. Ranjan, H. Le, and M. Hoai, “Iterative crowd counting,” in The European Conference on Computer Vision (ECCV), September 2018.
-  J. Liu, C. Gao, D. Meng, and A. G. Hauptmann, “Decidenet: Counting varying density crowds through attention guided detection and density estimation,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
-  H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah, “Composition loss for counting, density map estimation and localization in dense crowds,” in The European Conference on Computer Vision (ECCV), September 2018.
X. Liu, J. Van De Weijer, and A. D. Bagdanov, “Exploiting unlabeled data in cnns by self-supervised learning to rank,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
-  D. B. Sam, N. Sajjan, H. Maurya, and R. V. Babu, “Almost unsupervised learning for dense crowd counting,” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019.
-  Z. Shi, L. Zhang, Y. Liu, X. Cao, Y. Ye, M.-M. Cheng, and G. Zheng, “Crowd counting with deep negative correlation learning,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang, “Crowd counting via adversarial cross-scale consistency pursuit,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
-  D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, 2004.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 37, no. 9, pp. 1904–1916, 2015.
-  S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
-  C. Zhu, Y. Zheng, K. Luu, and M. Savvides, “Cms-rcnn: contextual multi-scale region-based cnn for unconstrained face detection,” in Deep Learning for Biometrics. Springer, 2017.
-  Y. Liu, M. Shi, Q. Zhao, and X. Wang, “Point in, box out: Beyond counting persons in crowds,” 2019.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), 2015.
-  R. Guerrero-Gómez-Olmedo, B. Torre-Jiménez, R. López-Sastre, S. M. Bascón, and D. Oñoro-Rubio, “Extremely overlapping vehicle counting,” in Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), 2015.
-  S. Yang, P. Luo, C.-C. Loy, and X. Tang, “From facial parts responses to face detection: A deep learning approach,” in The IEEE International Conference on Computer Vision (ICCV), 2015.
-  S. Zhang, G. Wu, J. P. Costeira, and J. M. F. Moura, “Fcn-rlstm: Deep spatio-temporal neural networks for vehicle counting in city cameras,” in The IEEE International Conference on Computer Vision (ICCV), 2017.
-  X. Liu, J. van de Weijer, and A. D. Bagdanov, “Leveraging unlabeled data for crowd counting by learning to rank,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.