Crowd counting is the task of predicting the number of people present in an image and it has attracted growing interest in the last decade due to its utility to the real-word use cases. The computer vision community has tackled this task in a variety of ways: early works either counted based on the outputs of a body or head detector[1, 2, 3] or they learned a mapping from the global or local features of an image to the actual count [4, 5, 6]. More recently, thanks to the ability of convolutional neural networks (CNNs) to learn local patterns, works have started to learn density maps that not only predict the count, but also the spatial extent of the crowd [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18].
Despite this progress, crowd counting remains an extremely difficult task, due to background clutter, heavy occlusions and scale variations. Of these, scale is the issue that has received the largest amount of attention in the recent literature [7, 10, 13, 11, 9, 8, 12, 18, 14].
In this paper, we tackle two notions of scale: (i) variation in the distance from the camera (fig. 1a-b) and (ii) variation in the size of image plane projection, in other words variation in the image resolution (fig. 1c-d). We address the latter problem with a simple, yet effective image size regularization approach. For the former issue we propose a novel scale-aware deep convolutional neural network. The hierarchical structure of convolutional neural networks progressively expands the receptive field of the network feature maps, implicitly capturing information at different scales. Inspired by the skip branches in FCN  and SSD , we propose to directly generate multiple density maps from these intermediate feature maps. As the feature map generated by the last convolutional layer has the largest receptive field, it carries high-level semantic information that can be used to differentiate foreground from background and to localize large heads; on the other hand, feature maps generated by the intermediate layers are more accurate and robust in counting extremely small heads (i.e., the crowds), and they contain important details about the spatial layout of the people and low-level texture patterns.
In order to aggregate the density maps generated from different layers of our network, we propose a novel soft attention mechanism that learns a set of gating masks, one for each map. Our masks learn to attend to large heads from the density map predicted by the last convolutional layer and smaller ones from earlier layers. While this can be trained by only providing supervision to the final density estimate, we found that performance improves by supervising the intermediate density estimates as well. We propose a new scale-aware loss function to further regularize our multi-scale estimates and guide them to specialize on a particular head size. Furthermore, as head size information is not available in any crowd counting dataset, we also propose a novel approach to automatically estimate them. Our approach combines the geometry-adaptive technique of  with a new bounding-box-adaptive technique.
In our experiments we show that our approach achieves state-of-the-art results on four major public crowd counting datasets: UCF-QNRF , ShanghaiTech A & B  and UCF_CC_50  by a substantial margin ( % on UCF-QNRF, % on the others). Moreover, in our ablation study we analyze the density maps generated by different layers of our network and show that each specializes on the different scale variations.
To summarize, we make the following contributions: (i) we propose a new network architecture that generates multi-scale density maps from its intermediate layers (sec. 3); (ii) we propose a new scale-aware attention mechanism to aggregate these maps into our final prediction (sec. 3.2); (iii) we propose a new scale-aware loss function to further help regularize these maps during training; and (iv) we propose a simple, yet effective technique to estimate the size of each head in an image, in a completely automatic way.
2 Related work
Multi-scale models for crowd counting.
Crowd counting datasets contain very large variation of people sizes, due to large perspective changes. In order to address this issue, many recent works on crowd counting have focused on learning multi-scale models.
Most previous works use a multi-column architecture [7, 10, 13, 11, 9, 8, 22]. Zhang et. al. trained a custom network with three CNN columns, each with a different receptive field to capture a specific range of head sizes (MCNN ). Running three CNN columns was however slow and Sam et. al. proposed to predict which column to run for each input image patch (Switch-CNN ). Later, Sam et. al. further extended their previous work by training a mixture of experts (each one equivalent to a column) in an incrementally growing fashion (IG-CNN ). Furthermore, Sindagi et. al. proposed a new architecture where MCNN is enriched with two additional columns capturing global and local context (CP-CNN ). Differently, instead of each column designed with different receptive fields, Boominathan et al. proposed using columns of different depths, where the deep CNN captured large crowds and the shallow CNN smaller ones (CrowdNet ). Finally, Onoro-Rubio et. al. (Hydra CNN ) and Kang et al. (AFS-FCN ) represented columns as pyramid levels over image patches at multiple scales (former) or over the full image fed to the same network multiple times at different resolutions (latter). While all these multi-column architectures have shown promising results, they present several disadvantages: they have large amount of model parameters, which often result in difficulties during training, and they are slow at inference, as multiple CNNs need to be run.
To overcome these limitations, recent works have focused on multi-scale, single column architectures [12, 18, 14]. Zhang et. al. proposed an architecture that combines two feature maps of two layers through a skip connection (saCNN ). Cao et. al. proposed an encoder-decoder network, where the encoder learns scale diversity in its features by using an aggregation module that combines filters of different sizes (SANet ). Finally, Li et. al. replaced some pooling layers in the CNN with dilated convolutional filters at different rates, which enlarge the receptive field of feature maps without losing spatial resolutions (CSRNet ).
In this paper, we present a single column network architecture that mimics multi-columns by predicting multi-scale density maps from different layers of the network. Our architecture takes advantage of the multi-column approaches ability to predict multi-scale density maps, yet it is much faster to compute and requires far fewer parameters. Moreover, differently from the previous multi-column approaches, our architecture aggregates its predictions using a novel attention-based mechanism that selects each column based on the size of each head in an image.
Attention models have been widely used for many computer vision tasks like image classification [23, 24], object detection [25, 26], semantic segmentation [27, 28], saliency detection  and, very recently, crowd counting . These models work by learning an intermediate attention map that is used to select the most relevant piece of information for visual analysis. The most similar works to ours are the ones of Chen et al.  and Kang et al. . Both approaches extract multi-scale features from several resized input images and use an attention mechanism to weight the importance of each pixel of each feature map. One clear drawback of these approaches is that their inference is slow, as each test image needs to be re-sized and fed into the CNN model multiple times. Instead, our approach is much faster: it requires a single input image and a single pass through the model, as our multi-scale features are generated by pooling information from different layers of the same network instead of multiple passes through the same network.
3 Our approach
In sec. 3.1 we present our baseline network for estimating the density map and training loss. In sec. 3.2 we describe how we extend this baseline with our novel multi-branch density prediction architecture and attention mechanism for selecting between these branches. In sec 3.3 we describe our novel scale-aware loss function, which guides each density prediction branch to specialize on a particular head size. This loss requires a head size estimate during training. As head size information is not available in any public dataset for crowd counting, in sec. 3.4, we present our novel approach to automatically estimate head size.
3.1 Baseline network for crowd counting
Like other density-based approaches [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18] for crowd counting, given an image, we feed it to a fully convolutional network and estimate a density map (fig. 2, ). Then, we sum all the values in this map to obtain the final count.
Our baseline network consists of three components: a backbone network, a regression head and an upsampling layer. The image is fed into the backbone, which progressively down samples the spatial resolution to produce a feature map with large receptive field but at of the image resolution. These features are fed into the regression head to produce a density map. Then bi-linear upsampling is used to bring the density estimates back to the original image resolution.
During training, we use a pixel-wise Euclidean loss on the density map output:
where is the estimated density map at pixel location predicted by our model and is its corresponding ground truth value. We follow the method of MCNN  to generate ground truth density map () and blur each head point in an image with a Gaussian kernel.
3.2 Scale-aware soft attention masks
Our approach enriches the baseline with density estimates , with the idea that each map will be specialized to perform well on a specific range of head sizes. Our network estimates all density maps in a single forward pass by branching the features from intermediate layers of our backbone and sending each into its own regression head. Then to aggregate these density estimates and produce a single density estimate , we use a soft attention mechanism that learns a set of gating masks corresponding to each branch. Each mask is used to re-weight the pixels of its corresponding density estimate to produce the final density estimate as follows:
where refers to the element-wise product. The attention masks are generated by the attention block, which takes as input the last feature map from our backbone network, passes it through an attention head and produces
-channel logit maps
. These are then fed to a softmax layer to produce the masks:
where and are the values for the corresponding maps at pixel location . The softmax ensures the attentions maps act as a weighted average over the density predictions.
We train this network end-to-end with the same loss used in sec. 3.1 applied only to the final density estimate (). The intuition is that the attention masks () will learn to attend to large heads from the density map predicted by the last branch () as it is derived from a feature map with a large receptive field and thus should perform well on large heads. Conversely, smaller heads will be attended by density estimates from earlier branches () as these branches have smaller receptive fields and higher spatial resolution, thus they capture finer details in the image.
3.3 Scale-aware loss regularization
In eq. 2, the error signal propagated back to the -th branch is modulated by the attention mask , i.e. . Instead of propagating the whole error signals back to every branch, these masks force each branch to only focus on improving the crowd counting accuracy on the areas selected by its corresponding mask. While fig. 3 shows the attention masks mostly attend to heads of different sizes, as intended, the network has no explicit regularization that enforces this to happen.
Here we present a new scale-aware loss function to further regularize each branch estimate and guide them to specialize on a particular head size. To achieve this, we add a scale-aware loss to each branch, which measures the distance between the branch’s predicted density map and our ground truth density map only in areas of the image with heads in a target size range for that density map. In this way, each branch only needs to perform well on its scale.
For each ground truth head point we estimate the head size and assign it to one of head size bins . The method for predicting is described in sec. 3.4. We then generate scale supervision masks by first setting all values to 0. Then for each head , we set a circular region with a diameter equal to and center at the point on the scale mask to 1. Finally we set all background pixels to 1 in each mask. This supervision guides each map to correctly predict the heads for its scale and background, but does not give any penalty to heads out of its scale. We compute the new scale-aware loss as follows:
Our final loss then becomes the combination of the loss on the final density (eq. 1) and our scale-aware losses on the intermediate layers of the network.
where refers to the regularization weight.
3.4 Estimating the size of each head:
The scale-aware loss regularization presented in the previous section requires an estimate of the diameter of the head , however, head size is not available in any crowd counting dataset. In this section we present a new method to estimate it. We combine the popular geometry-adaptive technique  with a new bounding-box-adaptive technique that estimates head sizes based on the output of a head detector. More specifically, given head , we estimate its size as follow:
We compute by first running a person head detector. Then, for each ground truth head point, we estimate its scale as the median size prediction from the nearest head detections:
where are the detected bounding boxes with the closest center to and and are the width and height of bounding box respectively. This estimate is only as good as the detector. We found that our detector works well most of the time, but it fails when people are too small and too close together. Thus, we augment this prediction with the geometry-adaptive approach () from Zhang et al. . For each head, this measure is computed as half the mean distance to the nearest heads or:
where is the number of neighbors and is the location of the ground truth head annotation. This measure works well for crowded scenes but not when people are further apart, thus complementing our measure well.
4.1 Evaluation metrics
In crowd counting, the count error is measured by two metrics: Mean Absolute Error (MAE) and Mean Squared Error (MSE), which are defined as follows:
where is the number of test images, the predicted count for image and the ground-truth.
(2018) is the latest released dataset and it consists of 1535 challenging images from Flickr, Web Search and Hajj footage. The number of people in an image (i.e., the count) varies from 49 to 12,865, making this the dataset with the largest crowd variation. Furthermore, the average image resolution is also larger compared to all other datasets, causing the absolute size of a person head to vary drastically from a few pixels to more than 1500.
(2016) consists of two parts: A and B. Part A contains 482 images capturing dense scenes like stadiums and parades; its counts varies from 33 to 3139. Part B contains 716 images of street scenes from fixed cameras capturing sparser crowds; its count varies from 12 to 578.
(2013) consists of 50 black and white, low resolution images and its count varies from 94 to 4543. We follow the dataset instructions and evaluate our results using 5-fold cross-validation.
4.3 Implementation details
. For our experiments, we start with a network pre-trained on the ImageNet classification challenge. We use three branches () from VGG features conv3_3, conv4_3 and conv5_3 from blocks 3, 4 and 5 respectively. Our regression head consists of two convolutions with 128 and 64 channel each followed by a final convolutional regression layer. For the scale-aware loss regularization, we empirically split the person scale space into the following three bins based on the head size : [0, 10], [10, 20], [20, 32] with an initial learning rate of 1e-4. The network is trained with a batch size of 64 and the inputs to the network are crops of size
randomly sampled from different locations in the image. At test time we do not extract image crops and instead we feed the whole image to the network. For all datasets, we follow their official training/test splits. Finally, we implemented our approach and conducted our experiments with the MXNet deep learning framework.
For both the geometry-adaptive and bounding-box-adaptive estimations, we set the number of neighbors to . For the bounding-box adaptive estimation, we trained a Faster-RCNN  head detector with a ResNet-50 backbone . We used the same hyper-parameters of , but we reduced the smallest anchor box size from 32 pixel to 8, in order to be able to localize extremely small heads. We trained our detector on the combination of two public datasets: SCUT-HEAD  and Pascal-Parts . SCUT-HEAD contains annotations for around 111k heads, which are visually similar to those in crowd counting images. Pascal-Parts, on the other hand, contains annotations for only 7.5k heads, but it offers a large selection of extremely useful and difficult background. We found the combination of these two complementary datasets to lead to great detection performance (fig. 4).
4.4 Validation of our method
In this section we present incremental results of our model architecture and its components (sec. 3). We use the UCF-QNRF dataset for these experiments as it is the largest both in number of images and diversity in crowd count (sec. 4.2). Results discussed in this section are presented in table 1.
As a baseline, we train a VGG-16 architecture backbone with the loss of eq. 1 and the settings of sec. 3.1. This generates a single feature map from the last convolutional layer of the network and it achieves an MAE of 128.5 (table 1, row 2) which is the lowest across all entries in the table. Still, this simple baseline achieves competitive results, which is comparable to the state-of-the-art (table 3).
By enriching the baseline with three branches that predict multi-scale density maps and our novel scale-aware attention mechanism (sec. 3.2), the error decreases to 116.7 (table 1 row 3), which is a significant improvement. This indicates that (i), using multi-scale feature maps is beneficial for crowd counting and (ii), the inferred attention masks are performing well on aggregating multi-scale predictions from our multi-branch network.
Further adding our scale-aware loss regularization (sec. 3.3) also brings an improvement and the error further decreases to 113.3 (table 1, row 4). This indicates that our regularization helps each branch output accurate density maps for the people within its assigned scale range, which collectively contributes to the the accuracy improvement of crowd estimation.
Adding image resolution regularization.
As mentioned earlier, scale variation in image resolution can cause similar people to look considerably different. In fig. 1c-d we show an example of the same person represented by 20 pixels (c) vs. a few hundred pixels (d). These two instances look extremely different. While this intra-class variation can be learned during training, sometimes it exceeds the capability of the network. We observed this to be the case for the UCF-QNRF dataset: some of its images are 6k9k and they contain heads of 1.5k1.5k, which, clearly, are outside the range of our network’s receptive field. To overcome this resolution issue, we propose to down-sample large images to a maximum size of 1080p (i.e., 1920 pixels). This simple normalization improves performance considerably, lowering our final MAE to 97.5 (table 1 row 5).
4.5 Ablation study
In this section we explore some of our model components and analyze their outputs. As in the previous section, all the experiments are conducted on the UCF-QNRF dataset.
Our multi-scales density predictions.
Li et. al.  showed that the three columns of MCNN  learned similar information instead of being specialized to a specific scale. Here we investigate the predictions of our branches and to what extent our branches are learning different scale information. As shown in fig. 3, branch 1 has stronger activations for smaller people, as it relates small-size people with low-level texture patterns. On the other hand, branch 2 and 3 make far fewer errors on medium and large people, as they operate on a larger receptive field. From this perspective, different branches learn complementary scale information for inferring person counts.
Attention mask .
In fig. 3 we also show the attention masks generated by our approach. Interestingly, our network learns distinctive attention masks for each branch. In general, has higher weights for large-size person and background regions, while gives higher weights to small-scale people. However, without scale aware-loss regularization, the masks learn to attend mostly to the density map predicted by branch 3 (i.e. red region in ). After scale-aware loss is used, and get higher weights for small and medium-size people respectively. This demonstrate the effectiveness and usefulness of the proposed scale aware loss regularization.
Aggregating multi-scale maps.
We compare our soft attention mechanism (sec. 3.2) used to aggregate our multi-scale density predictions against other popular aggregation methods: ‘average’, which is popular in semantic segmantation , ‘max’, which is popular in human pose estimation  and ‘concatenation+conv’, which has been used in several multi-column works for crowd counting [7, 11, 12]. Results are presented in table 2. ‘Max’ produces the largest error, as it tends to over-estimate the count; ‘concatenation+conv’ and ‘average’ work better, but the best performing is our attention mechanism. This result proves the effectiveness of our scale-aware aggregation mechanism to fuse multi-scale density maps.
|Concatenation + Conv||128.3||210.1|
|Method||Venue & Year||MAE||MSE||MAE||MSE||MAE||MSE||MAE||MSE|
Our head size estimation approach.
Finally, we present some visual results (fig. 4) for our head size estimation approach (sec. 3.4). The figure shows two images with the bounding boxes detected by our head detector and the corresponding head sizes estimated by the popular , our and our final . For visualizion purposes, we color each head based on its size, where dark red is used for the largest head in each map and dark blue for the smallest. performs relatively well on very crowded scenes (fig. 4, bottom row), but it performs rather poorly on sparse regions with small heads far away from each other (fig. 4
, top row). This is probably the reason why CSRNet, among other works, uses the geometry-adaptive Gaussian for the extremely dense Shanghai Tech A dataset, but a fixed Gaussian for all other crowd counting datasets. On the other hand, performs very well on both sparse and dense scenes, but it tends to predict slightly larger sizes than in reality on undetected heads (fig. 4, bottom image, end of the tail). By combining these two techniques in the novel , we are able to overcome most of their limitations and to produce highly accurate maps (fig. 4, last column).
4.6 Comparison to other crowd counting methods
We now compare our model against several approaches in the literature, on the datasets introduced in sec. 4.2. Results are presented in table 3 and fig. 5. Our baselines and our models are pre-trained on UCF-QNRF, as it helps performance slightly (e.g., on ShanghaiTech A the performance of our baseline improves MAE from 72.5 to 68.0). Moreover, image resolution regularization was used only on the UCF-QNRF dataset, as the other ones do not contain many images larger than 1080p. Overall, our approach always performs better than the baseline, showing the importance of learning multi-scale features (fig. 5). Furthermore, our approach also outperforms all previous methods in the literature, on all datasets and all metrics. We observe the largest improvement on the UCF-QNRF dataset (24%: from 132 to 97.5 MAE), which is the largest dataset and the one with the largest variation of head size. Our model is clearly capable of handling such large scale variations and produce positive results (fig. 5). Moreover, our approach also brings a moderate improvement on ShanghaiTechA and UCF_CC_50 (9% MAE) and a small improvement on ShangaiTechB (2.5% MAE). These results clearly show the significance of our novel approach.
Finally, fig. 5 presents some visual results of our baseline and our approach. Interestingly, in addition to performing better on counting the number of people in an image, our approach also shows better localized predictions. Its density maps are much sharper than those output by the baseline, which tends to oversmooth regions with large crowds. This is especially evident in fig. 5d. It also validates our hypothesis that directly using low-level layers to output intermediate density maps is beneficial for localizing small-scale people, as these low-level feature maps have detailed spatial layout.
In this work, we proposed a novel multi-branch architecture that generates multi-scale density maps from its intermediate layers. To aggregate these density maps into our final prediction, we developed a new soft attention mechanism that learns a set of gating masks, one for each map. We further introduced a scale-aware loss to guide each branch to specialize on different scale ranges . Finally, we proposed a simple, yet effective technique to estimate the size of each head in an image. Our approach achieved state-of-the-art results on four challenging crowd counting datasets, on all evaluation metrics.
-  Bo Wu and Ram Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. In ICCV, 2005.
-  Meng Wang and Xiaogang Wang. Automatic adaptation of a generic pedestrian detector to a specific traffic scene. In CVPR, 2011.
-  Mikel Rodriguez, Ivan Laptev, Josef Sivic, and Jean-Yves Audibert. Density-aware person detection and tracking in crowds. In ICCV, 2011.
-  Antoni B Chan, Zhang-Sheng John Liang, and Nuno Vasconcelos. Privacy preserving crowd monitoring: Counting people without people models or tracking. In CVPR, 2008.
-  Antoni B Chan and Nuno Vasconcelos. Bayesian poisson regression for crowd counting. In CVPR, 2009.
-  David Ryan, Simon Denman, Clinton Fookes, and Sridha Sridharan. Crowd counting using multiple local features. In DICTA, 2009.
-  Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In CVPR, 2016.
-  Lokesh Boominathan, Srinivas SS Kruthiventi, and R Venkatesh Babu. Crowdnet: A deep convolutional network for dense crowd counting. In ACM, 2016.
-  Daniel Onoro-Rubio and Roberto J López-Sastre. Towards perspective-free object counting with deep learning. In ECCV, 2016.
-  Deepak Babu Sam, Shiv Surya, and R Venkatesh Babu. Switching convolutional neural network for crowd counting. In CVPR, 2017.
-  Vishwanath A Sindagi and Vishal M Patel. Generating high-quality crowd density maps using contextual pyramid cnns. In ICCV, 2017.
-  Lu Zhang and Miaojing Shi. Crowd counting via scale-adaptive convolutional neural network. In WACV, 2018.
-  Deepak Babu Sam, Neeraj N Sajjan, R Venkatesh Babu, and Mukundhan Srinivasan. Divide and grow: Capturing huge diversity in crowd images with incrementally growing cnn. In CVPR, 2018.
-  Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, 2018.
-  Jiang Liu, Chenqiang Gao, Deyu Meng, and Alexander G Hauptmann. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In CVPR, 2018.
-  Zan Shen, Yi Xu, Bingbing Ni, Minsi Wang, Jianguo Hu, and Xiaokang Yang. Crowd counting via adversarial cross-scale consistency pursuit. In CVPR, 2018.
-  Haroon Idrees, Muhmmad Tayyab, Kishan Athrey, Dong Zhang, Somaya Al-Maadeed, Nasir Rajpoot, and Mubarak Shah. Composition loss for counting, density map estimation and localization in dense crowds. In ECCV, 2018.
-  Xinkun Cao, Zhipeng Wang, Yanyun Zhao, and Fei Su. Scale aggregation network for accurate and efficient crowd counting. In ECCV, 2018.
-  Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
-  Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, and Alexander C Berg. Ssd: Single shot multibox detector. In ECCV, 2016.
-  Haroon Idrees, Imran Saleemi, Cody Seibert, and Mubarak Shah. Multi-source multi-scale counting in extremely dense crowd images. In CVPR, 2013.
-  Di Kang and Antoni Chan. Crowd counting by adaptively fusing predictions from an image pyramid. In BMVC, 2018.
-  Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In CVPR, 2015.
-  Chunshui Cao, Xianming Liu, Yi Yang, Yinan Yu, Jiang Wang, Zilei Wang, Yongzhen Huang, Liang Wang, Chang Huang, Wei Xu, et al. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks. In ICCV, 2015.
-  Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention. In ICLR, 2015.
-  Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony S Paek, and In So Kweon. Attentionnet: Aggregating weak directions for accurate object detection. In CVPR, 2015.
-  Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and Alan L Yuille. Attention to scale: Scale-aware semantic image segmentation. In CVPR, 2016.
-  Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. Pyramid attention network for semantic segmentation. In BMVC, 2018.
Nian Liu, Junwei Han, and Ming-Hsuan Yang.
Picanet: Learning pixel-wise contextual attention for saliency
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3089–3098, 2018.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. ImageNet large scale visual recognition challenge. IJCV, 2015.
-  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2014.
-  Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. In arXiv preprint arXiv:1512.01274, 2015.
-  Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  Ross Girshick, Ilija Radosavovic, Georgia Gkioxari, Piotr Dollár, and Kaiming He. Detectron. https://github.com/facebookresearch/detectron, 2018.
-  Dezhi Peng, Zikai Sun, Zirong Chen, Zirui Cai, Lele Xie, and Lianwen Jin. Detecting heads using feature refine net and cascaded multi-scale architecture. In ICPR, 2018.
-  Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, Sanja Fidler, Raquel Urtasun, and Alan Yuille. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, 2014.
-  Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
-  Vishwanath A Sindagi and Vishal M Patel. Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In AVSS, 2017.
-  Zenglin Shi, Le Zhang, Yun Liu, Xiaofeng Cao, Yangdong Ye, Ming-Ming Cheng, and Guoyan Zheng. Crowd counting with deep negative correlation learning. In CVPR, 2018.