Learning Error-Driven Curriculum for Crowd Counting

by   Wenxi Li, et al.
FUDAN University

Density regression has been widely employed in crowd counting. However, the frequency imbalance of pixel values in the density map is still an obstacle to improve the performance. In this paper, we propose a novel learning strategy for learning error-driven curriculum, which uses an additional network to supervise the training of the main network. A tutoring network called TutorNet is proposed to repetitively indicate the critical errors of the main network. TutorNet generates pixel-level weights to formulate the curriculum for the main network during training, so that the main network will assign a higher weight to those hard examples than easy examples. Furthermore, we scale the density map by a factor to enlarge the distance among inter-examples, which is well known to improve the performance. Extensive experiments on two challenging benchmark datasets show that our method has achieved state-of-the-art performance.



There are no comments yet.


page 1

page 6


Cascaded Residual Density Network for Crowd Counting

Crowd counting is a challenging task due to the issues such as scale var...

SSR-HEF: Crowd Counting with Multi-Scale Semantic Refining and Hard Example Focusing

Crowd counting based on density maps is generally regarded as a regressi...

Structured Inhomogeneous Density Map Learning for Crowd Counting

In this paper, we aim at tackling the problem of crowd counting in extre...

Learning Spatial Awareness to Improve Crowd Counting

The aim of crowd counting is to estimate the number of people in images ...

STNet: Scale Tree Network with Multi-level Auxiliator for Crowd Counting

Crowd counting remains a challenging task because the presence of drasti...

An Aggregated Multicolumn Dilated Convolution Network for Perspective-Free Counting

We propose the use of dilated filters to construct an aggregation module...

Hierarchical Curriculum Learning for AMR Parsing

Abstract Meaning Representation (AMR) parsing translates sentences to th...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Crowd counting is a heated research topic that employed in many fields, such as video surveillance. It aims to predict the average densities over the scene image captured by cameras, which enables video surveillance system to perform reasoning about both the count and location of the persons.

Most of current state-of-the-art methods for crowd counting are based on density map estimation 

[29, 15], which uses a fixed Gaussian kernel to normalize a dotted annotation. Lempitsk et al. [14] and Zhang et al. [29] proposed the density map generation methods of the fixed convolution kernel and adaptive convolution kernel, respectively. In recent years, scholars have designed many different models based on the two kinds of density map mentioned above, continuously reducing errors on several commonly-used benchmark datasets.

(a) Input image (b) Density map (c) Hist. of density map
Fig. 1: (a) Input image. (b) Normalize a dotted annotation by a fixed gaussian kernel. (c) The histogram of the values in density map.

Multi-column architectures [29, 23] and VGG-16 [25, 15, 2] are the mainstream backbones for most current methods. A question arises: why the great backbones like ResNet [8] or DenseNet [9] seem not so effective for crowd counting? We found that the performance of CrowdNet [4] and CSRNet [15] on UCF_CC_50 [10] differs greatly, with the mean absolute error of 452.5 and 266.1 respectively, although they all used VGG-16 [25] as feature extractor. Thus, we can speculate that these mentioned works might be trapped in poor local minima.

We identify the main bottleneck as the data imbalance issue, which stops density regression from achieving a satisfied accuracy. A density map generated through a fixed Gaussian kernel is shown in Figure 1. From the histogram of the values in the density map, it is quite obvious that there is a serious frequency imbalance of pixel values, and even the peak value in the density map is only 0.003. This phenomenon results in very small cost during the training phase, which will hinder the optimization of deep nerual network. Furthermore, the distance among different examples is very close, including foreground and background, and different pixel values. This also brings challenges to the optimization of the network.

In this paper, we propose a novel learning strategy to learn from an error-driven curriculum, which can be seen as a particular form of Continuation Method. Our learning strategy starts with the target training set and uses a tutoring network called TutorNet to repetitively indicate the errors that the main network makes, which can enhance the effectiveness of learning. Specifically, TutorNet generates a weight map to adjust the learning progress of the main network. For a training example with a smaller error, TutorNet will assign a smaller weight to balance the severity of the examples. In order to enlarge the distance between different examples, we scale the density map by a factor without losing the counting information of the density map.

To evaluate the effectiveness of our learning strategy, four networks previously used for crowd couting are trained under TutorNet’s guidance on ShanghaiTech dataset [29]. The results show that our approach can significantly improve the performance of different backbone networks. Notably, we evaluate our approach with a modified DenseNet [9] as our main network on two challenging benchmark datasets. Compared to current methods, our approach has achieved state-of-the-art performance.

The main contributions of our work can be summarized as follows: i) We propose a learning strategy for curriculum formulation, which uses weight maps to adjust the learning progress of the main network; ii) We introduce scale factor to enlarge the distance among different examples; iii) Extensive experiments validate the effectiveness of our approach.

Ii Related Work

Ii-a Deep learning for crowd counting

Recently, deep learning has greatly stimulated the progress of crowd counting. Inspired by the great success in deep learning on other tasks, Zhang et al. 


focused on the Convolutional Neural Networks (CNN) based approaches to predict the number of crowd firstly. However, their model ignored the important perspective geometry of scene images and the fully connected layers in them throw away spatial coordinates. Boominathan et al. 

[4] used the framework of fully convolutional neural networks which combined the deep and shallow networks. Zhang et al. [29] proposed a multi-column convolutional neural network (MCNN) where each column has a convolution kernel with different sizes for different scales. Afterward, multi-column architecture became a common way to deal with scale issue [23, 26]. For example, Switching-CNN [23]

divided each image into non-overlapping patches and used a switch classifier to choose columns for patches. Since Euclidean loss is sensitive to outliers and have the issue of image blur, ACSCP 

[24] used non-overlapping patches and proposed adversarial loss to extend traditional Euclidean loss. Xiong et al. [27] proposed a model called ConvLSTM and it is the first time incorporating temporal stream for crowd counting. More recently, Li et al. [15] proposed a model called CSRNet that used dilated convolution instead of the last two pooling layers of VGG-16 [25] and outperformed most of the previous methods. They developed a single-column network to replace the multi-column network, which has fewer parameters. Liu et al. [18] incorporated deformable convolution to address the multi-scale problem.

Ii-B Data imbalance

The problem of data imbalance, which is common in many vision tasks, has been extensively studied in recent years. One way to solve this problem is re-sampling, including the over-sampling method [5] and down-sampling method [6]. Another solution is cost-sensitive learning represented by focal loss [16], which down-weight the numerous easy negatives. In fact, the data imbalance problem is rarely studied in regression networks. Lu et al. [20]

proposed Shrinkage-loss which takes a step and tries to penalize the easy samples, without decreasing the weight of hard examples. Although Shrinkage-loss may be helpful in solving data imbalance, the extra hyperparameters need to be manually adjusted, which requires complicated experiment.

Ii-C Continuation methods

Continuation methods [1] tackle optimization problems involving minimizing non-convex criteria. Curriculum learning [3] is a type of continuation method and self-paced learning [13] is an extension of curriculum learning. Both of them suggest that samples should be selected in a meaningful order for training. However, their curriculums need to be predefined. Jiang et al. [11] and Kim et al. [12] address this limitation using an attaching network called MentorNet and ScreenerNet respectively. The difference is that the former uses other datasets for pre-training while the latter uses training data directly. Inspired by MentorNet and ScreenerNet, we designed an additional network to develop curriculums for the main network of density regression. Unlike the curriculum learning starting with a small set of easy examples, our learning strategy starts with the target training set and adjust the curriculums according to the severity of the error.

Iii Our Approach

We propose an effective learning strategy that uses TutorNet to tutor the main network from an error record established during the training phase. Moreover, we scale the density maps by a factor to enlarge the distance between different examples during the training phase.

Iii-a TutorNet

Inspired by the recent success of curriculum learning [3, 13], we tackle data imbalance by learning a curriculum. Students build knowledge system in learning, and targeted learning for the loopholes in the knowledge system is more effective than extensive learning. In other words, the errors of knowledge point that students make will be repetitively indicated during the curriculum, so as to enhance the effectiveness of the learning. There are two important components in our learning strategy. One is a main network to generate the density map, the other is a tutoring network. We design a TutorNet as the tutoring network to generate the weight map, in which each value represents the pixel-level learning rate of the error between the ground truth and density map predicted by the main network. The predicted density map and the weight map have the same shape . We define the output weights bounded between and 1. When is close to 1, it means that optimization needs to be continued here. Otherwise, when is close to

, it shows that the network has achieved good performance here. Formally, we define an activation function to generate



where denotes the adjustable weight for penalizing the easy example. If the weight is too small, it will cause the imbalance to tilt from one extreme to the other. We set to 0.5 in this paper.

TutorNet is used to dynamically generate a weight map which is a curriculum for the main network. To complete this task, we proposed a loss function to optimize TutorNet:


where is a margin hyperparameter and denotes the error computed between the predicted density map and the ground truth. We use mean squared error (MSE) for in this paper. The visualization of our loss function is shown in Figure 2. The gradient of the objective function is given by:

Fig. 2: Loss function with M = 1. The error is given by main network and used to optimize the from TutorNet.

This non-negative error is given by main network and only used to optimize the TutorNet. A gradient descent step is


where is learning rate. As the gradient shown above, when the error is larger than , the gradient is a negative value which will increase the value of . When the error is less than , the weight begin its descent. We set the to be 0.8 when using Gaussian kernel (sigma=15) to generate ground truth. The error of one sample during training phase is shown in Figure 3.

Fig. 3: Hist. of the error between the ground truth and the main network prediction.
Fig. 4: Overview of the our learning strategy.

Optimization path of both TutorNet and the main network is shown in Figure 4. The loss function to optimize the main network is:


where the function represents the main network and is the parameters of it. denotes the input image and is the ground truth of it.

TutorNet architecture: Our TutorNet is used for a tutorial during the training phase. The network needs to be able to quickly find the corresponding that conforms to the current sample. Otherwise, it will reduce the performance of the main network. Our experiment shows that a pre-trained ResNet [8] is more likely to converge quickly and achieve good performance. Finally, we design TutorNet based on ResNet and we will evaluate TutorNet with different depths in Section IV-C2.

Iii-B Density map with scale factor

As we introduced in Section I, the values in the density map are very small, leading to a smaller distance between foreground and background examples which is not conducive to network learning. However, it is a common knowledge that excellent training samples are characterized by the large distance among inter-class and the small distance among intra-class. Obviously, it is necessary to enlarge the values in the density map.

Note that simply normalizing each density map is not available. The count of objects is given by integral over any image region and simply normalizing will change what the density map can represent. We address the problem through scaling density map by a factor, which could linearly transform ground truth by amplifying the values of the non-zero region. More experiments about choosing a scale factor will be introduced in Section 


Fig. 5: The distance of each group to the mean of different scale factor.

From another more intuitive perspective, we approximate each value in density map to four decimal places to achieve clustering. We neglect the number of samples in each group, only calculate the mean of these groups. In other words, we calculate the mean of the values represented by each group. Then, the euclidean distance of each group to this mean is calculated as shown in Figure 5. It can be seen that the scale factor can effectively enlarge the distance, which makes it easier for the network to distinguish different examples. In particular, the scale factor can force the network to fit the foreground (non-zero region) rather than the background (zero region).

Iv Experiments

We first perform ablation experiments to determine the appropriate scale factor and effective TutorNet structure. Then our best experimental results are compared with the most state-of-the-art methods.

Iv-a Evaluation Metric

The count error is measured using two metrics: Mean Absolute Error (MAE) and Mean Squared Error (MSE). The MAE is defined as


and the MSE is defined as


where is the number of images in test set, is the estimated number of people, is the number of people in ground truth.

Iv-B Datasets

We evaluate our models on ShanghaiTech [29] and Fudan-ShanghaiTech (FDST) video crowd counting dataset [7], because both datasets are captured by surveillance cameras.

Iv-B1 ShanghaiTech dataset

We first evaluate our method on the ShanghaiTech [29]. It contains 1,198 annotated images with a total of 330,165 people with centers of their heads annotated. The dataset consists of two parts, including Part A and Part B. The images in Part A are from the internet, and those in Part B are from the busy streets of metropolitan areas in Shanghai. There are more congested scenes in Part A and more sparse scenes in Part B. We choose the Part B, which is closer to the real scene, for the experiment. In Part B, 716 images are used for training and 400 images for testing.

Iv-B2 Fudan-ShanghaiTech dataset

FDST dataset [7] is the largest video crowd counting dataset. It contains 100 video sequences captured by 13 surveillance cameras with different scenes. There are 9,000 annotated frames in the training set, which are from 60 videos. The test set consists of 40 videos, 6000 frames.

Iv-C Ablation Study

Iv-C1 Scale Factor

As we introduced in Section III-B, the scale factor is used to linearly transform the density map. For this purpose, we evaluated the performance of different scale factors with DenseNet [9] as shown in Table II. The experimental results show the highly performance when the scale factor is 1000. Therefore, we set the scale factor to be 1000 for the following experiments.

layer name 15-layer 29-layer 43-layer 94-layer

layer stride

convolution 77, 64, stride 2 2
pooling 3

3 max pool, stride 2

residual block(1) 1
residual block(2) 2
residual block(3) 1
convolution 11, 1 11, 1 11, 128 11, 128 1
11, 1 11, 1
TABLE I: The architecture of TutorNet. Downsampling is performed by the first convolution in residual block(2) with a stride of 2. The downsampling rate for all the networks is 8.
Scale Factor MAE MSE
Density Map * 1 13.0 22.7
Density Map * 10 8.3 15.0
Density Map * 100 8.2 15.8
Density Map * 1000 7.5 12.8
Density Map * 2000 7.6 13.7
TABLE II: Comparison of different scale factors on the ShanghaiTech Part B dataset.
Method MAE MSE
Baseline 7.5 12.8
Baseline+15-layer 7.6 13.0
Baseline+29-layer 7.3 13.1
Baseline+43-layer 7.0 12.2
Baseline+94-layer 7.4 12.8
TABLE III: Comparison of different TutorNet with scale factor 1000 on the ShanghaiTech Part B dataset.
Method MAE MSE
MCNN [29] 26.4 41.3
MCNN+SF 15.3 35.2
MCNN+SF+TN 14.4 25.1
CSRNet [15] 10.6 16.0
CSRNet+SF 10.4 15.9
CSRNet+SF+TN 9.4 15.6
U-net [22] 26.8 39.7
U-net+SF 13.5 23.0
U-net+SF+TN 12.1 19.7
DenseNet [9] 13.0 22.7
DenseNet+SF 7.5 12.8
DenseNet+SF+TN 7.0 12.2
TABLE IV: Comparison among various methods with our learning strategy on the ShanghaiTech B dataset. SF and TN respectively refer to the scale factor and TutorNet.

Iv-C2 TutorNet

The configurations of TutorNet we designed are shown in Table I. We perform experiments on several TutorNet architecture using 1000 as the scale factor. The baseline is the main network only (DenseNet with scale factor 1000) which is the same as Section IV-C1. As we know, an excellent tutor knows how to adjust the learning progress, neither neglecting the mastery of students nor having a lot of repeated learning. The same is true of our TutorNet. It is difficult for a shallow network to master the learning progress of the main network, and the slow convergence speed caused by a very deep network is also an obstacle to the main network. The experimental results which are shown in Table III confirm our analysis. Therefore, we use the TutorNet with 43 layers for the following experiments.

Iv-C3 Different architecture

The ablation study are shown in Table IV. We choose four typical network architectures as main network to demonstrate our method, a multi-column network called MCNN [29], a VGG-based network with pre-training called CSRNet [15], a fully convolutional network called U-Net [22] and a very deep convolutional network DenseNet [9]. Scale Factor and TutorNet are individually added to the model training process. The experiments verify the effectiveness of our methods in overcoming the data imbalance. To evaluate the quality of generated density map, we compare original methods and those methods trained with scale factor and TutorNet. Samples of the test cases can be found in Figure 6. The weight maps in the figure are generated during the training phase of MCNN [29] . We can see that our method can generate more accurate density maps than the individual main network.

Iv-D Comparison with the state-of-the-art method

Iv-D1 Experiments on ShanghaiTech

Table V compares the performance of our best approach, DenseNet+SF+TN, with state-of-the-art methods. The results indicate that a DenseNet-based network outperforms most of the previous methods.

Method MAE MSE
MCNN [29] 26.4 41.3
Switching-CNN [23] 21.6 33.4
L2R [19] 13.7 21.4
ACSCP [24] 17.2 27.4
DRSAN [17] 11.1 18.2
IG-CNN [2] 10.7 16.0
CSRNet [15] 10.6 16.0
ADCrowdNet [18] 7.6 13.9
BL [21] 7.7 12.7
Ours 7.0 12.2
TABLE V: Comparison of our best approach with state-of-the-art on the ShanghaiTech Part B dataset.
Fig. 6: Samples generated by different model from ShanghaiTech Part B [29].

Iv-D2 Experiments on FDST dataset

Our best approach, DenseNet+SF+TN, is evaluated against a method for single image crowd counting called MCNN [29] and three methods for video crowd counting called ConvLSTM [27], LSTN w/o LST [7] and LSTN [7] respectively. The latter three methods exploit the spatial-temporal consistency between frames. The results are shown in Table VI. Our model outperform the previous methods with only exploiting the spatial information.

Method MAE MSE
MCNN [29] 3.77 4.88
ConvLSTM [27] 4.48 5.82
LSTN w/o LST [7] 3.87 5.16
LSTN [7] 3.35 4.45
Ours 3.05 4.30
TABLE VI: Comparison of our best approach with state-of-the-art on the FDST dataset.

V Conclusion

In this work, we identify that data imbalance is the primary obstacle impeding the crowd counting method from achieving state-of-the-art accuracy. Thus, we propose the TutorNet for curriculum formulation, which uses weight maps to adjust the learning progress of the main network. Experiments in Section IV-C1 validate the effect of scale factor and our analysis in Section III-B. An ablation study on four architectures verify the effectiveness of our methods on data imbalance. It is worth mentioning that our approach can be easily extended to any other previous network and future network. Future work will focus on exploiting the spatial-temporal consistency between video frames.


This work was supported by Military Key Research Foundation Project (No. AWS15J005), National Natural Science Foundation of China (No. 61672165 and No. 61732004), Shanghai Municipal Science and Technology Major Project (2018SHZDZX01) and ZJLab.


  • [1] E. L. Allgower and K. Georg (1990) Numerical continuation methods: an introduction. Cited by: §II-C.
  • [2] D. Babu Sam, N. N. Sajjan, R. Venkatesh Babu, and M. Srinivasan (2018) Divide and grow: capturing huge diversity in crowd images with incrementally growing cnn. In

    IEEE Conference on Computer Vision and Pattern Recognition, CVPR

    pp. 3618–3626. Cited by: §I, TABLE V.
  • [3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston (2009) Curriculum learning. In

    International Conference on Machine Learning, ICML

    pp. 41–48. Cited by: §II-C, §III-A.
  • [4] L. Boominathan, S. S. Kruthiventi, and R. V. Babu (2016) Crowdnet: a deep convolutional network for dense crowd counting. In International Conference on Multimedia, ACM Multimedia, pp. 640–644. Cited by: §I, §II-A.
  • [5] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of Artificial Intelligence Research

    16, pp. 321–357.
    Cited by: §II-B.
  • [6] C. Drummond, R. C. Holte, et al. (2003) C4. 5, class imbalance, and cost sensitivity: why under-sampling beats over-sampling. In Workshop on learning from imbalanced datasets II, Vol. 11, pp. 1–8. Cited by: §II-B.
  • [7] Y. Fang, B. Zhan, W. Cai, S. Gao, and B. Hu (2019)

    Locality-constrained spatial transformer network for video crowd counting

    In International Conference on Multimedia and Expo, ICME, pp. 814–819. Cited by: §IV-B2, §IV-B, §IV-D2, TABLE VI.
  • [8] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 770–778. Cited by: §I, §III-A.
  • [9] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4700–4708. Cited by: §I, §I, §IV-C1, §IV-C3, TABLE IV.
  • [10] H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 2547–2554. Cited by: §I.
  • [11] L. Jiang, Z. Zhou, T. Leung, L. Li, and L. Fei-Fei (2018) Mentornet: regularizing very deep neural networks on corrupted labels. Cited by: §II-C.
  • [12] T. Kim and J. Choi (2018) ScreenerNet: learning self-paced curriculum for deep neural networks. arXiv preprint arXiv:1801.00904. Cited by: §II-C.
  • [13] M. P. Kumar, B. Packer, and D. Koller (2010) Self-paced learning for latent variable models. In Advances in Neural Information Processing Systems, NIPS, pp. 1189–1197. Cited by: §II-C, §III-A.
  • [14] V. Lempitsky and A. Zisserman (2010) Learning to count objects in images. In Advances in Neural Information Processing Systems, NIPS, pp. 1324–1332. Cited by: §I.
  • [15] Y. Li, X. Zhang, and D. Chen (2018) CSRNet: dilated convolutional neural networks for understanding the highly congested scenes. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 1091–1100. Cited by: §I, §I, §II-A, §IV-C3, TABLE IV, TABLE V.
  • [16] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In IEEE International Conference on Computer Vision, ICCV, pp. 2980–2988. Cited by: §II-B.
  • [17] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin (2018) Crowd counting using deep recurrent spatial-aware network. In International Joint Conference on Artificial Intelligence, IJCAI, Cited by: TABLE V.
  • [18] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu (2019) ADCrowdNet: an attention-injective deformable convolutional network for crowd understanding. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 3225–3234. Cited by: §II-A, TABLE V.
  • [19] X. Liu, J. van de Weijer, and A. D. Bagdanov (2018) Leveraging unlabeled data for crowd counting by learning to rank. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 7661–7669. Cited by: TABLE V.
  • [20] X. Lu, C. Ma, B. Ni, X. Yang, I. Reid, and M. Yang (2018) Deep regression tracking with shrinkage loss. In European Conference on Computer Vision, ECCV, pp. 353–369. Cited by: §II-B.
  • [21] Z. Ma, X. Wei, X. Hong, and Y. Gong (2019) Bayesian loss for crowd count estimation with point supervision. In IEEE International Conference on Computer Vision, ICCV, pp. 6142–6151. Cited by: TABLE V.
  • [22] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer Assisted Intervention, MICCAI, pp. 234–241. Cited by: §IV-C3, TABLE IV.
  • [23] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 4031–4039. Cited by: §I, §II-A, TABLE V.
  • [24] Z. Shen, Y. Xu, B. Ni, M. Wang, J. Hu, and X. Yang (2018) Crowd counting via adversarial cross-scale consistency pursuit. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 5245–5254. Cited by: §II-A, TABLE V.
  • [25] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. Computer Science. Cited by: §I, §II-A.
  • [26] V. A. Sindagi and V. M. Patel (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In IEEE International Conference on Computer Vision, ICCV, pp. 1861–1870. Cited by: §II-A.
  • [27] F. Xiong, X. Shi, and D. Yeung (2017) Spatiotemporal modeling for crowd counting in videos. In IEEE International Conference on Computer Vision, ICCV, pp. 5151–5159. Cited by: §II-A, §IV-D2, TABLE VI.
  • [28] C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 833–841. Cited by: §II-A.
  • [29] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR, pp. 589–597. Cited by: §I, §I, §I, §II-A, Fig. 6, §IV-B1, §IV-B, §IV-C3, §IV-D2, TABLE IV, TABLE V, TABLE VI.