Crowd counting is a heated research topic that employed in many fields, such as video surveillance. It aims to predict the average densities over the scene image captured by cameras, which enables video surveillance system to perform reasoning about both the count and location of the persons.
Most of current state-of-the-art methods for crowd counting are based on density map estimation[29, 15], which uses a fixed Gaussian kernel to normalize a dotted annotation. Lempitsk et al.  and Zhang et al.  proposed the density map generation methods of the fixed convolution kernel and adaptive convolution kernel, respectively. In recent years, scholars have designed many different models based on the two kinds of density map mentioned above, continuously reducing errors on several commonly-used benchmark datasets.
Multi-column architectures [29, 23] and VGG-16 [25, 15, 2] are the mainstream backbones for most current methods. A question arises: why the great backbones like ResNet  or DenseNet  seem not so effective for crowd counting? We found that the performance of CrowdNet  and CSRNet  on UCF_CC_50  differs greatly, with the mean absolute error of 452.5 and 266.1 respectively, although they all used VGG-16  as feature extractor. Thus, we can speculate that these mentioned works might be trapped in poor local minima.
We identify the main bottleneck as the data imbalance issue, which stops density regression from achieving a satisfied accuracy. A density map generated through a fixed Gaussian kernel is shown in Figure 1. From the histogram of the values in the density map, it is quite obvious that there is a serious frequency imbalance of pixel values, and even the peak value in the density map is only 0.003. This phenomenon results in very small cost during the training phase, which will hinder the optimization of deep nerual network. Furthermore, the distance among different examples is very close, including foreground and background, and different pixel values. This also brings challenges to the optimization of the network.
In this paper, we propose a novel learning strategy to learn from an error-driven curriculum, which can be seen as a particular form of Continuation Method. Our learning strategy starts with the target training set and uses a tutoring network called TutorNet to repetitively indicate the errors that the main network makes, which can enhance the effectiveness of learning. Specifically, TutorNet generates a weight map to adjust the learning progress of the main network. For a training example with a smaller error, TutorNet will assign a smaller weight to balance the severity of the examples. In order to enlarge the distance between different examples, we scale the density map by a factor without losing the counting information of the density map.
To evaluate the effectiveness of our learning strategy, four networks previously used for crowd couting are trained under TutorNet’s guidance on ShanghaiTech dataset . The results show that our approach can significantly improve the performance of different backbone networks. Notably, we evaluate our approach with a modified DenseNet  as our main network on two challenging benchmark datasets. Compared to current methods, our approach has achieved state-of-the-art performance.
The main contributions of our work can be summarized as follows: i) We propose a learning strategy for curriculum formulation, which uses weight maps to adjust the learning progress of the main network; ii) We introduce scale factor to enlarge the distance among different examples; iii) Extensive experiments validate the effectiveness of our approach.
Ii Related Work
Ii-a Deep learning for crowd counting
Recently, deep learning has greatly stimulated the progress of crowd counting. Inspired by the great success in deep learning on other tasks, Zhang et al.
focused on the Convolutional Neural Networks (CNN) based approaches to predict the number of crowd firstly. However, their model ignored the important perspective geometry of scene images and the fully connected layers in them throw away spatial coordinates. Boominathan et al. used the framework of fully convolutional neural networks which combined the deep and shallow networks. Zhang et al.  proposed a multi-column convolutional neural network (MCNN) where each column has a convolution kernel with different sizes for different scales. Afterward, multi-column architecture became a common way to deal with scale issue [23, 26]. For example, Switching-CNN 24] used non-overlapping patches and proposed adversarial loss to extend traditional Euclidean loss. Xiong et al.  proposed a model called ConvLSTM and it is the first time incorporating temporal stream for crowd counting. More recently, Li et al.  proposed a model called CSRNet that used dilated convolution instead of the last two pooling layers of VGG-16  and outperformed most of the previous methods. They developed a single-column network to replace the multi-column network, which has fewer parameters. Liu et al.  incorporated deformable convolution to address the multi-scale problem.
Ii-B Data imbalance
The problem of data imbalance, which is common in many vision tasks, has been extensively studied in recent years. One way to solve this problem is re-sampling, including the over-sampling method  and down-sampling method . Another solution is cost-sensitive learning represented by focal loss , which down-weight the numerous easy negatives. In fact, the data imbalance problem is rarely studied in regression networks. Lu et al. 
proposed Shrinkage-loss which takes a step and tries to penalize the easy samples, without decreasing the weight of hard examples. Although Shrinkage-loss may be helpful in solving data imbalance, the extra hyperparameters need to be manually adjusted, which requires complicated experiment.
Ii-C Continuation methods
Continuation methods  tackle optimization problems involving minimizing non-convex criteria. Curriculum learning  is a type of continuation method and self-paced learning  is an extension of curriculum learning. Both of them suggest that samples should be selected in a meaningful order for training. However, their curriculums need to be predefined. Jiang et al.  and Kim et al.  address this limitation using an attaching network called MentorNet and ScreenerNet respectively. The difference is that the former uses other datasets for pre-training while the latter uses training data directly. Inspired by MentorNet and ScreenerNet, we designed an additional network to develop curriculums for the main network of density regression. Unlike the curriculum learning starting with a small set of easy examples, our learning strategy starts with the target training set and adjust the curriculums according to the severity of the error.
Iii Our Approach
We propose an effective learning strategy that uses TutorNet to tutor the main network from an error record established during the training phase. Moreover, we scale the density maps by a factor to enlarge the distance between different examples during the training phase.
Inspired by the recent success of curriculum learning [3, 13], we tackle data imbalance by learning a curriculum. Students build knowledge system in learning, and targeted learning for the loopholes in the knowledge system is more effective than extensive learning. In other words, the errors of knowledge point that students make will be repetitively indicated during the curriculum, so as to enhance the effectiveness of the learning. There are two important components in our learning strategy. One is a main network to generate the density map, the other is a tutoring network. We design a TutorNet as the tutoring network to generate the weight map, in which each value represents the pixel-level learning rate of the error between the ground truth and density map predicted by the main network. The predicted density map and the weight map have the same shape . We define the output weights bounded between and 1. When is close to 1, it means that optimization needs to be continued here. Otherwise, when is close to
, it shows that the network has achieved good performance here. Formally, we define an activation function to generate:
where denotes the adjustable weight for penalizing the easy example. If the weight is too small, it will cause the imbalance to tilt from one extreme to the other. We set to 0.5 in this paper.
TutorNet is used to dynamically generate a weight map which is a curriculum for the main network. To complete this task, we proposed a loss function to optimize TutorNet:
where is a margin hyperparameter and denotes the error computed between the predicted density map and the ground truth. We use mean squared error (MSE) for in this paper. The visualization of our loss function is shown in Figure 2. The gradient of the objective function is given by:
This non-negative error is given by main network and only used to optimize the TutorNet. A gradient descent step is
where is learning rate. As the gradient shown above, when the error is larger than , the gradient is a negative value which will increase the value of . When the error is less than , the weight begin its descent. We set the to be 0.8 when using Gaussian kernel (sigma=15) to generate ground truth. The error of one sample during training phase is shown in Figure 3.
Optimization path of both TutorNet and the main network is shown in Figure 4. The loss function to optimize the main network is:
where the function represents the main network and is the parameters of it. denotes the input image and is the ground truth of it.
TutorNet architecture: Our TutorNet is used for a tutorial during the training phase. The network needs to be able to quickly find the corresponding that conforms to the current sample. Otherwise, it will reduce the performance of the main network. Our experiment shows that a pre-trained ResNet  is more likely to converge quickly and achieve good performance. Finally, we design TutorNet based on ResNet and we will evaluate TutorNet with different depths in Section IV-C2.
Iii-B Density map with scale factor
As we introduced in Section I, the values in the density map are very small, leading to a smaller distance between foreground and background examples which is not conducive to network learning. However, it is a common knowledge that excellent training samples are characterized by the large distance among inter-class and the small distance among intra-class. Obviously, it is necessary to enlarge the values in the density map.
Note that simply normalizing each density map is not available. The count of objects is given by integral over any image region and simply normalizing will change what the density map can represent. We address the problem through scaling density map by a factor, which could linearly transform ground truth by amplifying the values of the non-zero region. More experiments about choosing a scale factor will be introduced in SectionIV-C1.
From another more intuitive perspective, we approximate each value in density map to four decimal places to achieve clustering. We neglect the number of samples in each group, only calculate the mean of these groups. In other words, we calculate the mean of the values represented by each group. Then, the euclidean distance of each group to this mean is calculated as shown in Figure 5. It can be seen that the scale factor can effectively enlarge the distance, which makes it easier for the network to distinguish different examples. In particular, the scale factor can force the network to fit the foreground (non-zero region) rather than the background (zero region).
We first perform ablation experiments to determine the appropriate scale factor and effective TutorNet structure. Then our best experimental results are compared with the most state-of-the-art methods.
Iv-a Evaluation Metric
The count error is measured using two metrics: Mean Absolute Error (MAE) and Mean Squared Error (MSE). The MAE is defined as
and the MSE is defined as
where is the number of images in test set, is the estimated number of people, is the number of people in ground truth.
Iv-B1 ShanghaiTech dataset
We first evaluate our method on the ShanghaiTech . It contains 1,198 annotated images with a total of 330,165 people with centers of their heads annotated. The dataset consists of two parts, including Part A and Part B. The images in Part A are from the internet, and those in Part B are from the busy streets of metropolitan areas in Shanghai. There are more congested scenes in Part A and more sparse scenes in Part B. We choose the Part B, which is closer to the real scene, for the experiment. In Part B, 716 images are used for training and 400 images for testing.
Iv-B2 Fudan-ShanghaiTech dataset
FDST dataset  is the largest video crowd counting dataset. It contains 100 video sequences captured by 13 surveillance cameras with different scenes. There are 9,000 annotated frames in the training set, which are from 60 videos. The test set consists of 40 videos, 6000 frames.
Iv-C Ablation Study
Iv-C1 Scale Factor
As we introduced in Section III-B, the scale factor is used to linearly transform the density map. For this purpose, we evaluated the performance of different scale factors with DenseNet  as shown in Table II. The experimental results show the highly performance when the scale factor is 1000. Therefore, we set the scale factor to be 1000 for the following experiments.
|convolution||77, 64, stride 2||2|
3 max pool, stride 2
|convolution||11, 1||11, 1||11, 128||11, 128||1|
|11, 1||11, 1|
|Density Map * 1||13.0||22.7|
|Density Map * 10||8.3||15.0|
|Density Map * 100||8.2||15.8|
|Density Map * 1000||7.5||12.8|
|Density Map * 2000||7.6||13.7|
The configurations of TutorNet we designed are shown in Table I. We perform experiments on several TutorNet architecture using 1000 as the scale factor. The baseline is the main network only (DenseNet with scale factor 1000) which is the same as Section IV-C1. As we know, an excellent tutor knows how to adjust the learning progress, neither neglecting the mastery of students nor having a lot of repeated learning. The same is true of our TutorNet. It is difficult for a shallow network to master the learning progress of the main network, and the slow convergence speed caused by a very deep network is also an obstacle to the main network. The experimental results which are shown in Table III confirm our analysis. Therefore, we use the TutorNet with 43 layers for the following experiments.
Iv-C3 Different architecture
The ablation study are shown in Table IV. We choose four typical network architectures as main network to demonstrate our method, a multi-column network called MCNN , a VGG-based network with pre-training called CSRNet , a fully convolutional network called U-Net  and a very deep convolutional network DenseNet . Scale Factor and TutorNet are individually added to the model training process. The experiments verify the effectiveness of our methods in overcoming the data imbalance. To evaluate the quality of generated density map, we compare original methods and those methods trained with scale factor and TutorNet. Samples of the test cases can be found in Figure 6. The weight maps in the figure are generated during the training phase of MCNN  . We can see that our method can generate more accurate density maps than the individual main network.
Iv-D Comparison with the state-of-the-art method
Iv-D1 Experiments on ShanghaiTech
Table V compares the performance of our best approach, DenseNet+SF+TN, with state-of-the-art methods. The results indicate that a DenseNet-based network outperforms most of the previous methods.
Iv-D2 Experiments on FDST dataset
Our best approach, DenseNet+SF+TN, is evaluated against a method for single image crowd counting called MCNN  and three methods for video crowd counting called ConvLSTM , LSTN w/o LST  and LSTN  respectively. The latter three methods exploit the spatial-temporal consistency between frames. The results are shown in Table VI. Our model outperform the previous methods with only exploiting the spatial information.
In this work, we identify that data imbalance is the primary obstacle impeding the crowd counting method from achieving state-of-the-art accuracy. Thus, we propose the TutorNet for curriculum formulation, which uses weight maps to adjust the learning progress of the main network. Experiments in Section IV-C1 validate the effect of scale factor and our analysis in Section III-B. An ablation study on four architectures verify the effectiveness of our methods on data imbalance. It is worth mentioning that our approach can be easily extended to any other previous network and future network. Future work will focus on exploiting the spatial-temporal consistency between video frames.
This work was supported by Military Key Research Foundation Project (No. AWS15J005), National Natural Science Foundation of China (No. 61672165 and No. 61732004), Shanghai Municipal Science and Technology Major Project (2018SHZDZX01) and ZJLab.
-  (1990) Numerical continuation methods: an introduction. Cited by: §II-C.
-  (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In , pp. 3618–3626. Cited by: §I, TABLE V.
International Conference on Machine Learning, ICML, pp. 41–48. Cited by: §II-C, §III-A.
-  (2016) Crowdnet: a deep convolutional network for dense crowd counting. In International Conference on Multimedia, ACM Multimedia, pp. 640–644. Cited by: §I, §II-A.
SMOTE: synthetic minority over-sampling technique.
Journal of Artificial Intelligence Research16, pp. 321–357. Cited by: §II-B.
-  (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, Vol. 11, pp. 1–8. Cited by: §II-B.
Locality-constrained spatial transformer network for video crowd counting. In International Conference on Multimedia and Expo, ICME, pp. 814–819. Cited by: §IV-B2, §IV-B, §IV-D2, TABLE VI.
-  (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 770–778. Cited by: §I, §III-A.
-  (2017) Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4700–4708. Cited by: §I, §I, §IV-C1, §IV-C3, TABLE IV.
-  (2013) Multi-source multi-scale counting in extremely dense crowd images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2547–2554. Cited by: §I.
-  (2018) Mentornet: regularizing very deep neural networks on corrupted labels. Cited by: §II-C.
-  (2018) ScreenerNet: learning self-paced curriculum for deep neural networks. arXiv preprint arXiv:1801.00904. Cited by: §II-C.
-  (2010) Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, NIPS, pp. 1189–1197. Cited by: §II-C, §III-A.
-  (2010) Learning to count objects in images. In Advances in Neural Information Processing Systems, NIPS, pp. 1324–1332. Cited by: §I.
-  (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1091–1100. Cited by: §I, §I, §II-A, §IV-C3, TABLE IV, TABLE V.
-  (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV, pp. 2980–2988. Cited by: §II-B.
-  (2018) Crowd counting using deep recurrent spatial-aware network. In International Joint Conference on Artificial Intelligence, IJCAI, Cited by: TABLE V.
-  (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3225–3234. Cited by: §II-A, TABLE V.
-  (2018) Leveraging unlabeled data for crowd counting by learning to rank. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 7661–7669. Cited by: TABLE V.
-  (2018) Deep regression tracking with shrinkage loss. In European Conference on Computer Vision, ECCV, pp. 353–369. Cited by: §II-B.
-  (2019) Bayesian loss for crowd count estimation with point supervision. In IEEE International Conference on Computer Vision, ICCV, pp. 6142–6151. Cited by: TABLE V.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI, pp. 234–241. Cited by: §IV-C3, TABLE IV.
-  (2017) Switching convolutional neural network for crowd counting. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4031–4039. Cited by: §I, §II-A, TABLE V.
-  (2018) Crowd counting via adversarial cross-scale consistency pursuit. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5245–5254. Cited by: §II-A, TABLE V.
-  (2014) Very deep convolutional networks for large-scale image recognition. Computer Science. Cited by: §I, §II-A.
-  (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In IEEE International Conference on Computer Vision, ICCV, pp. 1861–1870. Cited by: §II-A.
-  (2017) Spatiotemporal modeling for crowd counting in videos. In IEEE International Conference on Computer Vision, ICCV, pp. 5151–5159. Cited by: §II-A, §IV-D2, TABLE VI.
-  (2015) Cross-scene crowd counting via deep convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 833–841. Cited by: §II-A.
-  (2016) Single-image crowd counting via multi-column convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 589–597. Cited by: §I, §I, §I, §II-A, Fig. 6, §IV-B1, §IV-B, §IV-C3, §IV-D2, TABLE IV, TABLE V, TABLE VI.