In Defense of Single-column Networks for Crowd Counting

08/18/2018 ∙ by Ze Wang, et al. ∙ 0

Crowd counting usually addressed by density estimation becomes an increasingly important topic in computer vision due to its widespread applications in video surveillance, urban planning, and intelligence gathering. However, it is essentially a challenging task because of the greatly varied sizes of objects, coupled with severe occlusions and vague appearance of extremely small individuals. Existing methods heavily rely on multi-column learning architectures to extract multi-scale features, which however suffer from heavy computational cost, especially undesired for crowd counting. In this paper, we propose the single-column counting network (SCNet) for efficient crowd counting without relying on multi-column networks. SCNet consists of residual fusion modules (RFMs) for multi-scale feature extraction, a pyramid pooling module (PPM) for information fusion, and a sub-pixel convolutional module (SPCM) followed by a bilinear upsampling layer for resolution recovery. Those proposed modules enable our SCNet to fully capture multi-scale features in a compact single-column architecture and estimate high-resolution density map in an efficient way. In addition, we provide a principled paradigm for density map generation and data augmentation for training, which shows further improved performance. Extensive experiments on three benchmark datasets show that our SCNet delivers new state-of-the-art performance and surpasses previous methods by large margins, which demonstrates the great effectiveness of SCNet as a single-column network for crowd counting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 7

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Counting the number of people by estimating their density distribution from crowd images has attracted increasing attention because of its wide range of applications, such as safety monitoring, disaster management, public spaces design, and intelligence gathering [Sindagi and Patel(2017c)], especially in the congested scenes like arenas, shopping malls, and airports. However, it is not a trivial task due to great challenges caused by occlusion, clutter scene, irregular object distribution, non-uniform object scale, and inconstant perspective and background.

Recently, in an attempt to deal with those challenges, CNN based methods have been developed for crowd counting and density estimation [Walach and Wolf(2016), Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma, Marsden et al.(2016)Marsden, McGuiness, Little, and O’Connor, Sam et al.(2017)Sam, Surya, and Babu, Sindagi and Patel(2017b)], among which CNNs with a multi-column architecture, referred as multi-column CNN (MCNN) [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma] were extensively studied. MCNN typically employs multi-branch sub-networks with filters of different sizes to extract multi-scale features for addressing the issue of various individual sizes. Existing methods mainly follow the multi-column architecture that shares the similar topology of multi-branch sub-networks. However, networks with the multi-column architecture usually introduce heavy computational overhead, being nontrivial to optimize [Sindagi and Patel(2017c)], especially when the network goes deeper, which therefore makes them unfavorable for the task of crowd counting.

In this paper, we propose the single-column counting network (SCNet) for crowd counting and density estimation, which establishes a simple but effective network without relying on the multi-scale architecture. Our SCNet consists of several conjunctive modules which are designed for efficient crowd counting. Specifically, residual fusion modules (RFMs), composed of several nested dilated layers and short-cut connections, are stacked for multi-scale features extraction; a pyramid pooling module (PPM) is deployed to fuse hierarchically contextual information and forms the entire feature encoder for the generation of the semantic feature map in conjunction with the RFMs; a sub-pixel convolutional module (SPCM) with a bilinear interpolation operation is used to decode the semantic feature map for the high-resolution density estimation, and provides a parameter-free way for resolution recovery without compromising accuracy.

Moreover, we provide a principled paradigm for density map generation and data augmentation, based on which we introduce online sampling and multi-scale training to further enhance the overall performance.

In general, we can summarize our major contributions in the following three aspects:

  • We propose single-column counting network (SCNet) for crowd counting and density estimation with a simple, easy-to-implement architecture which achieves competitive and even better performance compared to the multi-column counterparts.

  • We design the residual fusion modules and the pyramid pooling module for capturing multi-scale features to handle the great variation of object sizes; and we adopt the sub-pixel convolutional module for feature resolution recovery and density map generation in an efficient, nonparametric way.

  • We conduct a throughout study on the experimental settings and data preparations in crowd counting and provide a principled paradigm to fully utilizing the limited training data.

Extensive experiments on three public benchmark datasets show that our SCNet can achieve state-of-the-art performance, surpassing most previous methods, which demonstrates its great effectiveness as a single-column network for crowd counting.

Figure 1: The architecture of our Single-column Counting Net (SCNet).

2 Related Work

We describe the related work of crowd counting in two parts as in [Sindagi and Patel(2017c)]: traditional approaches and CNN-based approaches.

Various methods have been proposed to address the problem of crowd counting. According to [Loy et al.(2013)Loy, Chen, Gong, and Xiang], the traditional approaches can be roughly divided into three categories: detection-based approaches, regression-based approaches, and density estimation-based approaches.

Most of the early work on crowd counting used sliding window detectors to detect people and count the number of them [Dalal and Triggs(2005), Leibe et al.(2005)Leibe, Seemann, and Schiele, Enzweiler and Gavrila(2009)]

. These methods extracted features of the whole pedestrian to train their classifiers and achieved successful results in low-density crowd scenes

[Viola and Jones(2004), Dalal and Triggs(2005)]. Going a step further, for better performance in high-density scenes, researchers adopted part-based detection methods that detect particular body parts rather than the whole body to estimate the people count [Felzenszwalb et al.(2010)Felzenszwalb, Girshick, McAllester, and Ramanan, Wu and Nevatia(2007)].

Although the part-based detection methods lightened the problem of occlusion, they behaved poor performance in extremely dense crowd scenes and high background clutter scenes. Researchers then tried to use regression-based approaches to overcome these challenges [Chan et al.(2008)Chan, Liang, and Vasconcelos, Chan and Vasconcelos(2009), Chen et al.(2013)Chen, Gong, Xiang, and Loy, Chen et al.(2012)Chen, Loy, Gong, and Xiang, Idrees et al.(2013)Idrees, Saleemi, Seibert, and Shah]. The regression-based approaches learn mappings between features extracted from images and the number of people in these images [Chan et al.(2008)Chan, Liang, and Vasconcelos, Paragios and Ramesh(2001), Chen et al.(2012)Chen, Loy, Gong, and Xiang].

While the regression-based methods addressed the problems of occlusion and background clutter well, most of them ignored the important spatial information [Sindagi and Patel(2017c)]. Lempitsky et al. [Lempitsky and Zisserman(2010)] proposed to learn a linear mapping between local patch features and corresponding object density maps to make full use of the spatial information. Differently, Pham et al. [Pham et al.(2015)Pham, Kozakaya, Yamaguchi, and Okada] proposed to learn a non-linear one due to the difficulty of learning a linear mapping.

Walach et al. [Walach and Wolf(2016)] proposed a CNN-based method with layered boosting and selective sampling. In contrast to this patch-based training method, Shang et al. [Shang et al.(2016)Shang, Ai, and Bai] used an end-to-end CNN method that took the entire image as input and outputted the total crowd count. Boominathan et al. [Boominathan et al.(2016)Boominathan, Kruthiventi, and Babu] presented a fully convolutional network that combined a deep network and a shallow network to predict the density map while addressing scale variations across images.

Motivated by the multi-column networks for image classification [Ciregan et al.(2012)Ciregan, Meier, and Schmidhuber], the MCNN proposed by Zhang et al. [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma] generated the density map by merging multi-scale features extracted by networks with different receptive fields. Similarly, Onoro et al. [Onoro-Rubio and López-Sastre(2016)] developed a scale aware counting model named Hydra CNN to estimate object densities. However, Marsden et al. [Marsden et al.(2016)Marsden, McGuiness, Little, and O’Connor] proposed a single column fully convolutional network after observing the optimization difficulties and complicated calculations of earlier scale aware methods.

More recently, Sam et al. [Sam et al.(2017)Sam, Surya, and Babu] proposed a switching CNN architecture that smartly selected the most suitable regressor for the particular input patch. Sindagi et al. [Sindagi and Patel(2017a)] developed a cascaded CNN network that applied the high-level prior to promote the prediction performance. The Contextual Pyramid CNN [Sindagi and Patel(2017b)] extracted global and local contextual information by CNN networks and utilized the contextual information to achieve lower count error and improve the quality of density maps. These CNN-based methods achieved the state-of-the-art performance in crowd counting.

3 Single-column Counting Network

Our SCNet consists of four residual fusion modules (RFM), a pyramid pooling module (PPM), and a sub-pixel convolutional module (SPCM), as shown in Fig. 1. Specifically, RFMs and the PPM as the feature encoder transform input images to high-dimensional feature maps. The SPCM and a bilinear upsampling layer decode the high-dimensional feature maps to the high-resolution density maps to achieve crowd counting.

3.1 Residual fusion module

The major issue is to deal with highly varied sizes of objects, which poses great challenges to regular convolutional networks because they perform feature extraction by sliding fixed-size convolutional kernels on the input feature map. The multi-column architecture is extensively explored which however is computationally expensive and hard to optimize when the network goes deeper. Rather than relying on the multi-column structure, we propose the residual fusion module for multi-scale feature extraction. We introduce dilated convolution to enlarge the reception field which captures contextual information from a larger range comparing to standard convolution [Chen et al.(2017)Chen, Papandreou, Schroff, and Adam, Yu and Koltun(2015)]. Specifically, the RFM is built by integrating convolutional kernels with multiple dilated rates, which establishes nested dilated convolutional layers to extract multi-scale features. To be more precise, we divide the kernels in each nested dilated convolutional layer into groups, where each group uses a dilation rate of where . We would like to highlight that replacing the standard convolution kernels with dilated kernels introduces no additional parameters or computational cost, which makes our network to be computationally affordable.

Moreover, to leverage the effectiveness of the residual learning for effective training without suffering from degradation, we adopt the short-cut connection in our RFMs, which is typically implemented by an identity mapping and an convolutional layer [He et al.(2016)He, Zhang, Ren, and Sun]. We incorporate the short-cut connection to every two nested convolutional layers. In particular, the projection short-cut implemented by an convolutional layer is used for dimension matching when the resolution or the number of channels of feature maps change.

Four nested dilated convolutional layers in conjunction with two short-cut connections constitute a residual fusion module, and four RFMs are stacked for the hierarchical multi-scale feature extraction, with sub-sampling operations by a factor of 2 implemented when features are transmitted between residual fusion modules.

In the RFMs, the feature map has undergone 4 downsampling operations, which means that the hight and width of the final feature map reduce to and , respectively, where and correspond to the hight and width of the input image, respectively. Considering an input image with large resolution, the corresponding largest valid receptive field still covers a limited spatial area, which might limit the contextual information provided by the receptive fields. To further enhance contextual information aggregation without inducing much additional computation, we introduce the pyramid pooling module (PPM) to efficiently fuse the features at multiple scales [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia].

3.2 Pyramid pooling module

Specifically, given the final feature map of , we apply average pooling operations at multiple scales to aggregate sub-regional contextual information at different scales, where the kernel sizes of the average pooling layers are , and . After pooling the feature using kernels and getting the pyramid features of resolutions, we simply resize them back to using nearest neighbour interpolation, which produces a series of features of the same resolution . Together with the original feature, we concatenate them and derive the pyramid feature of size . An convolutional layer is followed to aggregate the feature back to . We set in experiments for a balance between performance and computation.

The PPM further expands the receptive fields to different scales and abstracts the information of sub-regions in different sizes by adopting multi-scale pooling kernels in a few strides. The aggregation of multi-scale contextual information provides more powerful representations to distinguish individuals in different sizes from the background, which can aid our network to make more accurate estimation and get better density maps.

3.3 Sub-pixel convolutional module

The convolution operations with pooling layers progressively reduce the feature resolution in exchange for larger valid reception field and invariance. Directly generating a low-resolution density prediction from the final convolutional feature would be not optimal, since the low-resolution prediction results in blurry estimation, which always performs poorly, especially at the points that density varies dramatically. Therefore, it is necessary to generate the density map in the same size as the input images. Deconvolutional layers could be adopted for recovering feature resolution but the considerable computation cost and the difficulty of training make it a sub-optimal choice for crowd counting.

To implement an efficient and easy-to-train feature resolution recovery, we introduce the sub-pixel convolutional module (SPCM) to leverage its great effectiveness in resolution recovery [Shi et al.(2016)Shi, Caballero, Huszár, Totz, Aitken, Bishop, Rueckert, and Wang] to crowd counting and density estimation. The SPCM rearranges the elements of the feature map in a size of to a feature map of the shape , which recovers the resolution of the feature maps in a precise way with nearly no computational cost. In practice, the parameter is set to in our network, which means that the SPCM increases the spatial resolution by a factor of . Then a bilinear interpolation operation follows to upsample the feature map to the final density map in the same size as the input image.

The SPCM with a following bilinear upsampling layer provides a nonparametric way to recover the resolution of the feature maps and generate the final density map. Moreover, in our work, the rearrangement of the elements in the feature map explicitly guide the network to use the information encoded in the channel dimension to compensate the loss of spatial resolution, so that even though no additional computation cost is introduced, the spatial information can still be well preserved in the channel dimension.

4 Data Preparation

We train our network to estimate the density map instead of directly predicting the total amount for crowd counting like most of the recent methods, because density map preserves more information and improves the performance of crowd counting [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma]. To further augment the limited training data while maintaining the preciseness of the data, we put forward two principles for data preparation and propose online sampling and multi-scale training following these principles to train our network, which further enhances the robustness and performance of the network.

4.1 Density map generation

In existing methods [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma], the Gaussian kernel is widely used for density map generation, which specifically puts a Gaussian kernel normalized to on each of the pedestrian annotations. Based on the generated density maps, data augmentation, e.g. scaling, is usually adopted to generate more training samples. However, this paradigm of data augmentation would induce misleading information. It is obvious that the sizes and peaks of the annotations are only related to the density of pedestrians and independent of the sizes of pedestrians, which means that the information provided by our ground truth should depend only on the crowd density rather than the pedestrian sizes.

To sum up, we come up with two rules for data preparation, which should be followed in density map generation for crowd counting:

  • Each pedestrian in the image should be annotated only according to the density distribution rather than the size of the object. This is because it is the number of objects that matters in crowd counting rather than the size of the objects.

  • The information provided by the annotation of each object should be consistent during data augmentation, e.g., scaling, that is, the sizes of Gaussians associated with objects should be the same.

Figure 2: Illustration of online sampling. The online sampling achieves a powerful data augmentation without changing the distribution information in the ground truth density map.

4.2 Data augmentation

As mentioned in previous sections, the robustness to various pedestrian sizes is the key to high-quality crowd counting and density estimation, which needs a large amount of training data and full utilization of the data. However, the training samples of the public datasets are limited due to the heavy cost of data annotation. To further augment the training data and make full use of the data without introducing misleading information in conventional augmentation methods, we propose online sampling and multi-scale training by following the aforementioned rules.

Online sampling.

In existing work, data augmentation, e.g., random cropping and scaling the input image, has been widely used, which is realized by generating the density maps for the original images before training in an off-line way. Accordingly, ground truth density maps are generated based on the augmented data; and then normalization is applied to the resized density map to ensure that the sum of the density map remains unchanged after resizing [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma].

However, ground truth generation in this way changes the sizes of the Gaussian kernels in the density map which violates the aforementioned rules. Therefore, we propose to use online training sample generation. Specifically, the annotations of the objects are not transformed to density maps before training. During training, we first randomly sample a square area in the size of from the input image with the size of and scale the sampled image to the size of , where and we assume that for the sake of discussion. Afterwards, we compute the relative coordinates of the objects in the sampled image and put fixed size Gaussian kernels on these coordinates to get the ground truth density map of size . The online sampling achieves a powerful data augmentation without changing the distribution information in the density map. An illustration of the online sampling is shown in Fig. 2.

Multi-scale training.

Since our network is built based on fully convolutional network architecture, it can take inputs of any sizes. We further increase the randomness by defining multiple parameter for augmenting training samples. Specifically, we define a list containing all the candidate , and randomly select one value from the list to be the identical sample size in each iteration. In contrast to existing methods using single-scale training, our multi-scale training introduces much more variations of data, which improves the ability of our network to handle highly dense scenes.

Dataset No. of images No. of train No. of test Resolution Total count
ShanghaiTech A [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma] 482 300 182 Varied 241677
ShanghaiTech B [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma] 716 400 316 768 * 1024 88488
UCF CC 50 [Idrees et al.(2013)Idrees, Saleemi, Seibert, and Shah] 50 40 10 Varied 63705
WorldExpo’10 [Zhang et al.(2015)Zhang, Li, Wang, and Yang] 3980 3380 600 576 * 720 199923
Table 1: Summary of the three datasets

5 Experiments

We conduct extensive experiments on three benchmark datasets, which are widely-used for crowd counting. The statistics of the three datasets, i.e., the ShanghaiTech dataset [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma], the UCF CC 50 dataset [Idrees et al.(2013)Idrees, Saleemi, Seibert, and Shah] and the WorldExpo’10 dataset [Zhang et al.(2015)Zhang, Li, Wang, and Yang] are provided in Table 1. We also provide comprehensive comparison with state-of-the-art methods.

5.1 Evaluation metrics

Following the traditional protocol of crowd counting works [Sindagi and Patel(2017b), Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma], we evaluate all of the recent methods with the Mean Absolute Error (MAE) and the Mean Squared Error (MSE), which are defined as and , respectively, where is the number of images of the test set, and and represent the ground truth count and the predicted count of the -th image, which are computed by the sum of the density maps.

5.2 Results

We report experimental results on three datasets, and provide comprehensive comparisons with other previous methods. Our SCNet consistently delivers high performances and exceeds recent state-of-the-art methods in most cases, which shows the great effectiveness of our SCNet as a simple but powerful solution for crowd counting. Illustrative results are visualized in Fig 3.

ShanghaiTech.

The ShanghaiTech dataset [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma] consists of two subsets: Part A and Part B as shown in Table 1. We evaluate our network on both subsets. The results and comparison with other methods are reported in Table 2. Our SCNet achieves the lowest MAE on Part A among all compared methods; and it produces the lowest errors of both in MAE and MSE on Part B, which dramatically outperforms other methods by large margins.

ShanghaiTech Part A ShanghaiTech Part B UCF CC 50
Method MAE MSE MAE MSE MAE MSE
Idrees et al. - - - - 419.5 541.6
Zhang et al. 181.8 277.7 32.0 49.8 467.0 498.5
Marsden et al. 126.5 173.5 23.8 33.1 338.6 424.5
MCNN 110.2 173.2 26.4 41.3 377.6 509.1
Cascaded-MTL 101.3 152.4 20.0 31.1 322.8 397.9
Switching-CNN 90.4 135.0 21.6 33.4 318.1 439.2
CP-CNN 73.6 106.4 20.1 30.1 295.8 320.9
Ours 71.9 117.9 9.3 14.4 280.5 332.8
Table 2: Estimation errors on the ShanghaiTech dataset and the UCF CC 50 dataset

-3mm

Method Scene1 Scene2 Scene3 Scene4 Scene5 Average
Chen et al. 2.1 55.9 9.6 11.3 3.4 16.5
Zhang et al. 9.8 14.1 14.3 22.2 3.7 12.9
MCNN 3.4 20.6 12.9 13.0 8.1 11.6
Switching-CNN 4.4 15.7 10.0 11.0 5.9 9.4
CP-CNN 2.9 14.7 10.5 10.4 5.8 8.86
Ours 1.8 9.6 14.2 13.3 3.2 8.4
Table 3: The MAE of the WorldExpo’10 dataset

Ucf Cc 50.

The UCF CC 50 dataset [Idrees et al.(2013)Idrees, Saleemi, Seibert, and Shah] contains only 50 images collected from diverse scenes with varying perspective and a wide range of densities, which makes the dataset extremely challenging. To overcome the extremely limited images we perform a 5-fold cross-validation following the standard setting in [Idrees et al.(2013)Idrees, Saleemi, Seibert, and Shah]. The results and comparisons of our method and the other 9 recent methods are shown in Table 2. Again, our SCNet achieves the highest performance in terms of MAE with a remarkable improvement over compared methods, and very competitive performance in terms of MSE with the state of the arts.

WorldExpo’10.

Frames of the WorldExpo’10 dataset [Zhang et al.(2015)Zhang, Li, Wang, and Yang] are from 108 different scenes with annotated ROI for each of the scenes. The training frames and test frames are taken from different 103 scenes and the remaining 5 scenes respectively. We follow the settings in [Zhang et al.(2015)Zhang, Li, Wang, and Yang] by only considering the ROI regions. The results are shown in Table 3. Our SCNet delivers the lowest MAE in 3 of the 5 test scenes, and also achieves the best performance in terms of average MAE.

We have also conducted an ablation study to show the effectiveness of our new paradigm of data preparation. Specifically, we perform experiments on the ShanghaiTech Part B dataset by individually removing online sampling and multi-scale training. The experimental results are reported in Table 4, which show that both of the online-sampling and multi-scale training can improve the performance in crowd counting.

Method MAE MSE
SCNet 10.6 16.7
SCNet+Online sampling 9.8 15.0
SCNet+Online sampling+Multi-scale training 9.3 14.4
Table 4: Estimation errors of different configurations of our methods on ShanghaiTech Part B dataset.
Figure 3: Illustrative results on three datasets. Original images, ground truth density maps, and the estimations of SCNet are shown from the top row to the bottom row.

6 Conclusion

In this paper, we have presented the single-column counting network (SCNet) for crowd counting via density estimation. We propose residual fusion module which adopts dilated convolution for feature extraction at multiple scales, and pyramid pooling module for efficient feature aggregation. To balance the accuracy of pixel-wise density estimation and computational cost, an efficient sup-pixel convolutional module is proposed to recover feature resolution and encourage spatial information to be encoded in channel dimension without any parameters. In order to further fully make use of the limited training data, we introduce our training setting empowered by online sampling and multi-scale training, which boosts the robustness of our network and provides a principled setting for crowd counting task. The experiment results on three benchmarks have demonstrated our SCNet as a powerful tool for crowd counting and density estimation.

7 Acknowledgement

This paper was supported in part by the National Key Research and Development Program of China under Grant 2016YFB1200100, the National Natural Science Foundation of China under Grant 91538204 and Grant 61425014, the Foundation for Innovative Research Groups of the National Natural Science Foundation of China under Grant 61521091, and National Science Foundation.

References

  • [Boominathan et al.(2016)Boominathan, Kruthiventi, and Babu] Lokesh Boominathan, Srinivas SS Kruthiventi, and R Venkatesh Babu. Crowdnet: A deep convolutional network for dense crowd counting. In Proceedings of the 2016 ACM on Multimedia Conference, pages 640–644. ACM, 2016.
  • [Chan and Vasconcelos(2009)] Antoni B Chan and Nuno Vasconcelos. Bayesian poisson regression for crowd counting. In Computer Vision, 2009 IEEE 12th International Conference on, pages 545–551. IEEE, 2009.
  • [Chan et al.(2008)Chan, Liang, and Vasconcelos] Antoni B Chan, Zhang-Sheng John Liang, and Nuno Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In

    Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on

    , pages 1–7. IEEE, 2008.
  • [Chen et al.(2012)Chen, Loy, Gong, and Xiang] Ke Chen, Chen Change Loy, Shaogang Gong, and Tony Xiang. Feature mining for localised crowd counting. In BMVC, volume 1, page 3, 2012.
  • [Chen et al.(2013)Chen, Gong, Xiang, and Loy] Ke Chen, Shaogang Gong, Tao Xiang, and Chen Change Loy. Cumulative attribute space for age and crowd density estimation. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2467–2474. IEEE, 2013.
  • [Chen et al.(2017)Chen, Papandreou, Schroff, and Adam] Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587, 2017.
  • [Ciregan et al.(2012)Ciregan, Meier, and Schmidhuber] Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber.

    Multi-column deep neural networks for image classification.

    In Computer vision and pattern recognition (CVPR), 2012 IEEE conference on, pages 3642–3649. IEEE, 2012.
  • [Dalal and Triggs(2005)] Navneet Dalal and Bill Triggs. Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 886–893. IEEE, 2005.
  • [Enzweiler and Gavrila(2009)] Markus Enzweiler and Dariu M Gavrila. Monocular pedestrian detection: Survey and experiments. IEEE transactions on pattern analysis and machine intelligence, 31(12):2179–2195, 2009.
  • [Felzenszwalb et al.(2010)Felzenszwalb, Girshick, McAllester, and Ramanan] Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [Idrees et al.(2013)Idrees, Saleemi, Seibert, and Shah] Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multi-source multi-scale counting in extremely dense crowd images. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2547–2554. IEEE, 2013.
  • [Leibe et al.(2005)Leibe, Seemann, and Schiele] Bastian Leibe, Edgar Seemann, and Bernt Schiele. Pedestrian detection in crowded scenes. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, volume 1, pages 878–885. IEEE, 2005.
  • [Lempitsky and Zisserman(2010)] Victor Lempitsky and Andrew Zisserman. Learning to count objects in images. In Advances in neural information processing systems, pages 1324–1332, 2010.
  • [Loy et al.(2013)Loy, Chen, Gong, and Xiang] Chen Change Loy, Ke Chen, Shaogang Gong, and Tao Xiang. Crowd counting and profiling: Methodology and evaluation. In Modeling, Simulation and Visual Analysis of Crowds, pages 347–382. Springer, 2013.
  • [Marsden et al.(2016)Marsden, McGuiness, Little, and O’Connor] Mark Marsden, Kevin McGuiness, Suzanne Little, and Noel E O’Connor. Fully convolutional crowd counting on highly congested scenes. arXiv preprint arXiv:1612.00220, 2016.
  • [Miao et al.(2018)Miao, Zhen, Liu, Deng, Athitsos, and Huang] Xin Miao, Xiantong Zhen, Xianglong Liu, Cheng Deng, Vassilis Athitsos, and Heng Huang. Direct shape regression networks for end-to-end face alignment. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5040–5049, 2018.
  • [Onoro-Rubio and López-Sastre(2016)] Daniel Onoro-Rubio and Roberto J López-Sastre.

    Towards perspective-free object counting with deep learning.

    In European Conference on Computer Vision, pages 615–629. Springer, 2016.
  • [Paragios and Ramesh(2001)] Nikos Paragios and Visvanathan Ramesh. A mrf-based approach for real-time subway monitoring. In Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, volume 1, pages I–I. IEEE, 2001.
  • [Pham et al.(2015)Pham, Kozakaya, Yamaguchi, and Okada] Viet-Quoc Pham, Tatsuo Kozakaya, Osamu Yamaguchi, and Ryuzo Okada.

    Count forest: Co-voting uncertain number of targets using random forest for crowd density estimation.

    In Proceedings of the IEEE International Conference on Computer Vision, pages 3253–3261, 2015.
  • [Sam et al.(2017)Sam, Surya, and Babu] Deepak Babu Sam, Shiv Surya, and R Venkatesh Babu.

    Switching convolutional neural network for crowd counting.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 6, 2017.
  • [Shang et al.(2016)Shang, Ai, and Bai] Chong Shang, Haizhou Ai, and Bo Bai. End-to-end crowd counting via joint learning local and global count. In Image Processing (ICIP), 2016 IEEE International Conference on, pages 1215–1219. IEEE, 2016.
  • [Shi et al.(2016)Shi, Caballero, Huszár, Totz, Aitken, Bishop, Rueckert, and Wang] Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang.

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1874–1883, 2016.
  • [Sindagi and Patel(2017a)] Vishwanath A Sindagi and Vishal M Patel. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In Advanced Video and Signal Based Surveillance (AVSS), 2017 14th IEEE International Conference on, pages 1–6. IEEE, 2017a.
  • [Sindagi and Patel(2017b)] Vishwanath A Sindagi and Vishal M Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 1879–1888. IEEE, 2017b.
  • [Sindagi and Patel(2017c)] Vishwanath A Sindagi and Vishal M Patel. A survey of recent advances in cnn-based single image crowd counting and density estimation. Pattern Recognition Letters, 2017c.
  • [Viola and Jones(2004)] Paul Viola and Michael J Jones.

    Robust real-time face detection.

    International journal of computer vision, 57(2):137–154, 2004.
  • [Walach and Wolf(2016)] Elad Walach and Lior Wolf. Learning to count with cnn boosting. In European Conference on Computer Vision, pages 660–676. Springer, 2016.
  • [Wu and Nevatia(2007)] Bo Wu and Ram Nevatia. Detection and tracking of multiple, partially occluded humans by bayesian combination of edgelet based part detectors. International Journal of Computer Vision, 75(2):247–266, 2007.
  • [Yu and Koltun(2015)] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
  • [Zhang et al.(2015)Zhang, Li, Wang, and Yang] Cong Zhang, Hongsheng Li, Xiaogang Wang, and Xiaokang Yang. Cross-scene crowd counting via deep convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on, pages 833–841. IEEE, 2015.
  • [Zhang et al.(2016)Zhang, Zhou, Chen, Gao, and Ma] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 589–597, 2016.
  • [Zhao et al.(2017)Zhao, Shi, Qi, Wang, and Jia] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing network. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages 2881–2890, 2017.
  • [Zhen et al.(2018a)Zhen, Yu, He, and Li] Xiantong Zhen, Mengyang Yu, Xiaofei He, and Shuo Li. Multi-target regression via robust low-rank learning. IEEE transactions on pattern analysis and machine Intelligence, 40(2):497–504, 2018a.
  • [Zhen et al.(2018b)Zhen, Yu, Zheng, Nachum, Bhaduri, Laidley, and Li] Xiantong Zhen, Mengyang Yu, Feng Zheng, Ilanit Ben Nachum, Mousumi Bhaduri, David Laidley, and Shuo Li. Multitarget sparse latent regression. IEEE transactions on neural networks and learning systems, 29(5):1575–1586, 2018b.