Log In Sign Up

Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting

In this paper, we proposed two modified neural network architectures based on SFANet and SegNet respectively for accurate and efficient crowd counting. Inspired by SFANet, the first model is attached with two novel multi-scale-aware modules called, ASSP and CAN. This model is called M-SFANet. The encoder of M-SFANet is enhanced with ASSP containing parallel atrous convolution with different sampling rates and hence able to extract multi-scale features of the target object and incorporate larger context. To further deal with scale variation throughout an input image, we leverage contextual module called CAN which adaptively encodes the scales of the contextual information. The combination yields an effective model for counting in both dense and sparse crowd scenes. Based on SFANet decoder structure, M-SFANet decoder has dual paths, for density map generation and attention map generation. The second model is called M-SegNet. For M-SegNet, we simply change bilinear upsampling used in SFANet to max unpooling originally from SegNet and propose the faster model while providing competitive counting performance. Designed for high-speed surveillance applications, M-SegNet has no additional multi-scale-aware module in order to not increase the complexity. Both models are encoder-decoder based architectures and end-to-end trainable. We also conduct extensive experiments on four crowd counting datasets and one vehicle counting dataset to show that these modifications yield algorithms that could outperform some of state-of-the-art crowd counting methods.


page 10

page 12

page 13


Multi-Scale Context Aggregation Network with Attention-Guided for Crowd Counting

Crowd counting aims to predict the number of people and generate the den...

Multi-Level Attentive Convoluntional Neural Network for Crowd Counting

Recently the crowd counting has received more and more attention. Especi...

ADCrowdNet: An Attention-injective Deformable Convolutional Network for Crowd Understanding

We propose an attention-injective deformable convolutional network calle...

NAS-Count: Counting-by-Density with Neural Architecture Search

Most of the recent advances in crowd counting have evolved from hand-des...

Crowd Counting on Images with Scale Variation and Isolated Clusters

Crowd counting is to estimate the number of objects (e.g., people or veh...

MSTR: Multi-Scale Transformer for End-to-End Human-Object Interaction Detection

Human-Object Interaction (HOI) detection is the task of identifying a se...

DENet: A Universal Network for Counting Crowd with Varying Densities and Scales

Counting people or objects with significantly varying scales and densiti...

1 Introduction

Crowd counting is an important task due to its wide-range of applications such as public safety, surveillance monitor, traffic control and intelligent transportation. However it is a challenging computer vision task and not trivial to efficiently solve at the first glance due to heavy occlusion, perspective distortion, scale variation and diverse crowd distribution throughout read-world images. These problems are emphasized especially when the target objects are crowded. Some of the early methods

[12] treat crowd counting as a detection problem. The handcrafted features from multi-source are also investigated in [17]. These approaches are not suitable when the targeted objects are overlapping each other and the handcrafted features can not handle the diversity of crowd distribution in input images properly. In order to take characteristics of crowd distribution in input images into account, one should not consider developing models predicting only the number of people in the target image because the characteristics is neglected. Thus, more recent methods rely more on the density map generated from the head annotation ground truth instead. On the other hand, the authors of [28]

consider the density map as the likelihood describing “how the spatial pixels would be” given the annotation ground truth and propose the novel Bayesian loss. However, in our experiments, we consider density map ground truth as the learning target in order to investigate the improved accuracy caused by the architecture modifications and better compare with the state-of-the-art methods. In the age of deep learning, Convolution Neural Networks (CNNs) have been utilized to estimate accurate density map. By considering Convolutional filters as sliding windows, CNNs are capable of feature extraction throughout various regions of an input image. Consequently, the diversity of crowd distribution in the image is handled more properly. In order to cope with head scale variation problem caused by camera perspective, previous works mostly make use of multi-column/multi-resolution based architectures

[45, 30, 4, 40] and achieve higher accuracy. However the study in [23] shows that the features learned at each column structure of MCNN [45] is nearly identical and it is not efficient to train such architecture when networks go deeper. As opposed to the multi-column network architecture, a deep single column network based on a truncated VGG16 [35] feature extractor and a decoder with dilated convolutional layers is proposed in [23] and achieve the breakthrough counting performance on ShanghaiTech [45] dataset. The proposed architecture demonstrate the strength of VGG16 [35]

encoder pretrained on ImageNet

[10] for higher semantics information extraction and the ability to transfer knowledge across vision tasks. Moreover, the study shows how to attach atrous convolutional layers to the network instead adding more pooling layers which cause loss of spatial information. Nevertheless, [27] has raised the issue of using the same filters and pooling operations over the whole image. The authors of [27] have pointed out that the receptive field size should be changed across the image due to perspective distortion. To deal with this problem, the scale-aware contextual module capable of feature extraction over multiple receptive field sizes is also proposed in [27]. By the module design, the importance of each such extracted feature at every image location is learnable. Aside from crowd counting, objects overlapping is also a crucial problem for image segmentation. As a result, spatial pyramid pooling modules such as SPP [16] and ASSP [6] are used to capture the contextual information at multiple scales. Encoder-decoder based CNNs [32, 1, 6] have prior success in image segmentation ascribed to the ability to reconstruct precise object boundaries. After the accomplishment of CSRNet [23], more variations of encoder-decoder based networks have been proposed for crowd counting such as [4, 40, 39, 26, 41]. Bridging the gap between image segmentation and crowd counting, SFANet [46] integrates UNet[32]-like decoder with the dual path structure [46] to predict the head regions among noisy background then regress for the head counts.

In this paper, we propose two modified networks for crowd counting and generating high-quality density maps. The first proposed model is called “M-SFAgNet” (Modified-SFANet) in which the multi-scale-aware modules, CAN [27] and ASSP [6], are additionally connected to the VGG-16bn [35] encoder of SFANet [46] to handle occlusion in input images by capturing the contexts around the target objects at different scales. By fusing these novel structure together, M-SFANet is effective on both sparse crowd scenes and dense crowd scenes. Second, we integrate the dual path structure [46] [46] into ModSegNet[13]-like encoder-decoder network instead of Unet [32] and call this model “M-SegNet” (Modified-SegNet). Designed for medical image segmentation, ModSegNet [13] is similar to UNet [32] but leverage max-unpooling [1] instead of transpose convolution which is not parameter-free. M-SegNet is designed to be faster than SFANet [46]

, while providing the similar performance. Furthermore, we also test the performance of the ensemble model between M-SegNet and M-SFANet by average prediction. For some surveillance applications which speed are not the constraint, the ensemble model should be in the consideration because of its lower variance prediction.

2 Related Works

Solutions to crowd counting can be classified into traditional approach or CNN-based approach. The tradition approaches include detection-based methods and regression-based methods. The CNN-based approaches denote utilization of CNN to predict density maps and often outperform the traditional approaches.

2.1 Traditional Approaches

Some of early approaches rely on sliding window based detection algorithm. This requires the extracted feature from human heads or human bodies such as HOG [9] (histogram oriented gradients) or Haar wavelets [37]. Unfortunately these methods fail to detect people when encountering heavy occlusion in the input images. Regression-based methods attempt to learn the mapping function from low-level information [5] generated by features such foreground and texture to the number of target objects. Linear mapping function has been studied in [22]. Then, due to the difficulty of learning linear mapping, Pham et al.[31]

has proposed non-linear mapping using a random forest regression instead.

2.2 CNN-based Approaches

Lately CNN-based methods have shown significant improvements from traditional methods on crowd counting task. Walach et al. [38] demonstrated the use of CNN with layered boosting and selective sampling. Shang et al. [34] studied CNNs which directly regresses the whole input image to the final crowd count. Zhang et al. [43] proposed a deep CNN trained to estimate crowd count while predicting crowd density level. Shang et al.[34] proposed a GoogLeNet[36]-based model which predicts global count while employing LSTM memory cell exploiting contextual information to predict local count. Boominathan et al. [3] used dual-column CNNs to generate density maps. Zhang et al. [45] proposed a multi-column CNN (MCNN). Each column was designed to respond to different scales. Onoro et al. [30] presented a scale-aware multi-column network, called Hydra, trained with a pyramid of image patches. Switching CNN was proposed by Sam et al. [33] to select the best CNN for image patches. Li et al. [23] proposed a deep single column CNN based on a truncated VGG16 encoder and dilated convolutional layers as decoder to aggregate the multi-scale contextual information. Cao et al. [4] presented an encoder-decoder network, called scale aggregation network (SANet). The encoder of this network extracts multi-scale features using scale aggregation module inspired by Inception structure [36]. Wu et al. [40] proposed a counting network structured with two parallel branches containing different sizes of the receptive field and a third control branch for adaptively recalibration of the pathway-wise responses. Liu et al. [26] applied a attention mechanism to crowd counting by integrating an attention-aware network into a multi-scale deformable network to detect crowd regions. Wang et al. [39] boosted the performance of crowd counting by pretraining a crowd counter on the synthetic data and proposed a crowd counting method via domain adaptation to deal with the lack of labelled data. Jiang et al. [20] proposed a trellis encoder-decoder network (TEDnet) that incorporates multiple decoding paths to hierarchically aggregate features from different encoding layers. Liu et al.[27] proposed a VGG16-based model with scale-aware contextual structure (CAN module in this paper) that combines features extracted from multiple receptive field sizes and learns the importance of each such feature over image location. Zhu et al. [46] proposed an encoder-decoder network with a dual path multi-scale fusion decoder. The decoder architecture reuses coarse features and high-level features from encoding stages. This is similar to Unet [32] architecture for medical image segmentation. Instead of employing density maps as learning targets, Ma et al. [28] constructed a density contribution model and trained a VGG19[35]

-based network using Bayesian loss instead of the vanilla mean square error (MSE) loss function.

3 Proposed Approach

Since there are two base neural network architectures we modify and experiment with, SFANet and SegNet, we call our models “M-SFANet” (Modified SFANet) and “M-SegNet” (Modified SegNet) respectively. Both of them are encoder-decoder based deep convolutional neural networks. They commonly have VGG-16bn [35] as the encoder which gradually reduces feature maps size and captures high-level semantics information. In the case of M-SFANet, the features are passed through CAN [27] module and ASSP [6] to extract the scale-aware contextual features [27]. Finally, the decoders recover the spatial information to generate the final high-resolution density map. Combining SFANet [46] with CAN [27] module and ASSP [6], M-SFANet is more heavy-weighted and usually predicts more accurate crowd counts compared to the proposed M-SegNet. On the other hand, M-SegNet is based on SegNet [1] and has no additional multi-scale-aware module. However M-SegNet can achieve competitive results on some crowd counting benchmarks.

Figure 1:

The architecture of the propsed M-SFANet. The convolutional layers’ parameters are denoted as Conv (kernel size)-(kernel size)-(number of filters). Max pooling is conducted over a

pixel window with stride 2.

3.1 Modified SFANet (M-SFANet)

The model design is inspired from the successful models for image segmentation (Unet [32] and DeepLabv3 [6]) and crowd counting (CAN [27] and SFANet [46]). The model architecture consists of 3 novel components, VGG-16bn [35] feature map encoder, the multi-scale-aware modules [27, 6], and dual path multi-scale fusing decoder [46]. First, the input images are fed into the encoder to learn useful high level semantics meaning. Then the feature maps are fed into the multi-scale-aware modules in order to highlight the multi-scale features of the target objects and the context. There are two multi-scale-aware modules in M-SFANet architecture, one connected with the 13 layer of VGG16-bn is ASSP [6], and the other module connected with the 10 layer of VGG16-bn is CAN [27]. Finally the decoder paths use concatenate and bilinear upsampling to fuse the multi-scale features into the density maps and attention maps. Before producing the final density maps, the crowd regions are segmented from background by the attention maps. This mechanism subpresses noisy background and lets the model focus more on the regions of interest. We leverage the multi-task loss function in [46] to gain the advantage from the attention map generator. The overall picture of M-SFANet is shown in Fig. 1

. Every convolutional layers are followed by batch normalization


and ReLU

[29] except the last convolutional layer for producing the final density map.

Feature map encoder (13 layers of VGG-16bn): We leverage the first pretrained 13 layers of VGG-16 with batch normalization as the feature map encoder because the stack of 3x3 Convolutional layers is able to extract multi-scale features and multi-level semantic information [46]. This is a more efficient way to deal with scale variation problems throughout the input images compared to using multi-column architecture with different kernel sizes [23]. The high-level feature maps (1/8 in size of the original input) from the 10 layer are fed into CAN to adaptively encode the scales of the contextual information [27]. Moreover, the top feature maps (1/16 in size of the original input) from the 13 layer are fed into ASSP module to learn image level features (e.g. human head in our experiments) and the contextual information at multiple rates.

CAN module: CAN module is capable of producing scale-aware contextual features using multiple receptive fields of average pooling operation [27]. The module extracts those features and learns the importance of each such feature at every image location, thus accounting for potentially rapid scale changes within image [27]. The importance of the extracted features varies according to the difference to their neighborhood. Due to discriminately information fused from different scales, CAN module performs very well when encountering with perspective distortion in the input image.

ASSP module: ASSP [6] module applies several effective fields-of-view of atrous convolutional operation and image pooling to the incoming features, thus capturing multi-scale information. Because of the help of atrous convolutional operation, loss of information related to object boundaries (between human heads and background) throughout convolutional layers in the encoder is alleviated. With atrous convolutional operation, the field of view of filters can be enlarged and capable of incorporating larger context without losing image resolution. This module is experimentally proved to be effective for image segmentation task in [6] by exploiting multi-scale information.

Dual path multi-scale fusion decoder: The decoder architecture consists of the density map path and the attention map path as described in [46]

. The following strategy is applied to both density map path and attention map path. First, the output feature maps from ASSP are upsampled by factor of 2 using bilinear interpolation and then concatenated with the output feature maps from CAN module. Next the concatenated feature maps are passed through 1x1x256 and 3x3x256 convolutional layers. Again, The fused features are upsampled by the factor of 2 and concatenated with the conv3-3 and the upsampled (by factor of 4) feature maps from ASSP before passing through 1x1x128 and 3x3x128 convolutional layers. This strategy helps remind the networks of the learned multi-scale features from high-level image representation. Finally, the 128 fused features are upsampled by the factor of 2 and concatenated with the conv2-2 before passing through 1x1x64, 3x3x64 and 3x3x32 convolutional layers respectively. Because of the use of three upsampling layers, the model can retrieve high-resolution feature maps with 1/2 size of the original input. Element-wise multiple is applied on attention map and the last density feature maps to generate refined density feature maps


Figure 2: The architecture of the proposed M-SegNet. The convolutional layers’ parameters are denoted as Conv (kernel size)-(kernel size)-(number of filters). Max pooling is conducted over a pixel window with stride 2.

3.2 Modified SegNet (M-SegNet)

M-SegNet shares almost the same components as presented in M-SFANet except the fact that there are no CAN [27] module and ASPP [6] to additionally emphasize multi-scale information and the bilinear upsampling are replaced with max unpooling operation using the memorized max-pooling indices [1] from the corresponding encoder layer. Hence, M-SegNet requires less computational resources than M-SFANet and more suitable for real-world applications. The overall picture of M-SegNet is shown in Fig. 2.

4 Training method

In this section, we explain how the density map ground truth and the attention map ground truth are generated in our experiments. Training settings for each dataset are shown in Table 1.

4.1 Density map ground truth

To generate the density map ground truth

, we follow the Gaussian method (fixed standard deviation kernel) described in

[45]. Assuming that there is a head annotation at pixel represented as , the density map can be constructed by convolution with Gaussian kernel [22]. This processes are formulated as:


In the ground truth annotation, we convolve each with a Gaussian kernel (blurring each head annotation) with parameter p, where is a number of total head counts. In our experiment we set p = 5, 4, 4, 10 for ShanghaiTech [45], UCF_CC_50 [2], WorldExpo’10 [43], and TRANCOS [15] dataset. For Beijing BRT [11] dataset, we use the code provided in for density map ground truth generation. For UCF_CC_50 and WE, we borrow the code from [14]

4.2 Attention map ground truth

Following [46], Attention map ground truth is generated based on the threshold applied to the corresponding density map ground truth. The formulated equation is as follows:


where i is a coordinate in the density map ground truth. The threshold is set to 0.001 according to [46].

4.3 Training details

We leverage the same image augmentation strategy as described in [46] but size of the cropped image differs across dataset. Hence when training each datasets, we use slightly different learning rates and batch sizes. The main strategy can be summarized as random resize by small portion, image cropping, horizontal flip, and gamma adjustment. The main difference to [46] is that we use Adam [21] with lookahead optimizer [44] to train the models since it shows faster convergence than standard Adam optimizer and experimentally proved to improve the model performance in [44].

Dataset learning rate batch size* size of croped image
ShanghaiTech [45] 5e-4 8 | 8 400x400
UCF_CC_50 [2] 8e-4 5 | 8 512x512
WE [43] 8e-4 | 5e-4 42 | 45 224x224
BRT [11] 6e-4 42 | 45 224x224
TRANCOS [15] 5e-4 5 | 8 full image**

Note: “|” separates batch size for M-SFANet (left) and M-SegNet(right).
*We recommend to use as large as possible. We select these values due to the GPU memory limit.
**Full image training means no random resize and no image cropping.

Table 1: Training settings for each dataset

5 Experimental Evaluation

In this section, we show the results of our M-SFANet and M-SegNet on 4 challenging crowd counting datasets. We also test our models’ performance on 1 congested vehicle counting dataset, TRANCOS [15]. We mostly evaluate the performance using Mean absolute error (MAE) and Root mean square error (RMSE). The metrics are defined as follows:


where is the number of test images. and refer to the prediction head counts and the ground truth head counts for the test image.

Figure 3: Visualization of estimated density maps. The first row are sample images from ShanghaiTech Part A. The second row is the ground truth. The to rows correspond to the estimated density maps from M-SegNet, M-SFANet and M-SegNet+M-SFANet respectively.

5.1 ShanghaiTech dataset

ShanghaiTech [45] dataset consists of 1198 labelled images with 330,165 annotated people. The dataset is divided into Part A and Part B. Part A contains 482 (train:300, test:182) highly congested images downloaded from the internet. Part B includes 716 (train:400, test:316) relatively sparse crowed scenes taken from streets in Shanghai. Table 2 shows the results of our models and state-of-the-art methods on this dataset. Compared to the base model, SFANet [46], M-SFANet can reduce MAE by 3.76% on Part A and 2.03% on Part B. Indicating MAE/RMSE improvement by 1.45%/4.50%, M-SegNet is able to outperform SFANet [46] on Part B as well. Note that unlike SFANet [46] our models are not pre-trained on UCF-QNRF dataset [18]. M-SFANet and M-SegNet both show competitive results compared to the best methods on Part A (S-DCNet [41]) and Part B (SANet+SPANet [8]). By average prediction between M-SFANet and M-SegNet, we can gain 3.11% MAE and 2.77% MAE relative improvement on Part A and Part B respectively. The visualization of estimated density maps by our models on ShanghaiTech Part A are shown in Fig. 3.

Method Part A Part B
MCNN [45] 110.2 173.2 26.4 41.3
CSRNet [23] 68.2 115.0 10.6 16.0
DRSAN [25] 69.3 96.4 11.1 18.2
CAN [27] 62.3 100.0 7.8 12.2
SFANet [46] 59.8 99.3 6.9 10.9
BL [28] 62.8 101.8 7.7 12.7
S-DCNet [41] 58.3 95.0 6.7 10.7
SANet+SPANet [16] 59.4 92.5 6.5 9.9
M-SegNet 60.55 100.80 6.80 10.41
M-SFANet 59.69 95.66 6.76 11.89
M-SFANet+M-SegNet 57.55 94.48 6.32 10.06

Note: “M-SFANet+M-SegNet” determines average prediction between the two models. The best performance is on boldface.

Table 2: Comparison with state-of-the-art methods on ShanghaiTech [45] dataset

5.2 UCF_CC_50 dataset

Proposed by [2] UCF_CC_50 dataset contains extremely crowded scenes with limited training samples. It includes only 50 high resolution images with numbers of head annotation ranging from 94 to 4543. Because of the limited numbers of training samples, we pre-train our models on ShanghaiTech Part A. To evaluate model performance, 5-fold cross validation is performed following the standard setting in [2]. The results compared to the state-of-the-art methods are listed in Table 3. M-SFANet obtains the competitive performance with 20.5% MAE improvement compared with the second best approach, S-DCNet [41]. The visualization of the predicted density maps on a dense scene of this dataset is depicted in the left column of Fig. 4.

Figure 4: Visualization of estimated density maps. The first row are sample images from UCF_CC_50 [2], Beijing BRT [11] and TRANCOS [15] dataset (left→right). The second row is the ground truth. The to rows correspond to the estimated density maps from M-SegNet, M-SFANet and M-SegNet+M-SFANet respectively.
CSRNet [23] 266.1 397.5
DRSAN [25] 219.2 250.2
CAN [27] 212.2 243.7
SFANet [46] 219.6 316.2
BL [28] 229.3 308.2
SANet+SPANet [16] 232.6 311.7
S-DCNet [41] 204.2 301.3
M-SegNet 188.40 262.21
M-SFANet 162.33 276.76
M-SFANet+M-SegNet 167.51 256.26

Note: “M-SFANet+M-SegNet” determines average prediction between the two models. The best performance is on boldface.

Table 3: Comparison with state-of-the-art methods on UCF_CC_50 [2] dataset

5.3 WorldExpo’10 dataset

It includes 1,132 annotated video sequences collected from 103 different scenes. There are 3,980 annotated frames and 3,380 of them are used for model training. Each scene has a Region Of Interest (ROI). Having no access to the original dataset, we use the images and the density map ground truth generated by [14] to train our models. In Table 4, the performance on each test scene is reported in MAE. M-SegNet and M-SFANet achieve the best performance in scene 1 (sparse crowd) and scene 4 (dense crowd) respectively. Fig. 5 depicts the visualization of predicted density maps from M-SFANet and M-SegNet.

Figure 5: Visualization of estimated density maps from M-SFANet ( row) and M-SegNet ( row) on test samples of WorldExpo’10 dataset [43]


Method Sce.1 Sce.2 Sce.3 Sce.4 Sce.5 Ave.
MCNN [45] 3.4 20.6 12.9 13.0 8.1 11.6
CSRNet [23] 2.9 11.5 8.6 16.6 3.4 8.6
CAN [27] 2.9 12.0 10.0 7.9 4.3 7.4
PGCNet [42] 2.5 12.7 8.4 13.7 3.2 8.1
DSSINet [24] 1.57 9.51 9.46 10.35 2.49 6.67
M-SegNet 1.45 11.72 10.29 21.15 5.47 10.03
M-SFANet 1.88 13.24 10.07 7.5 3.87 7.32

Note: The result of “M-SFANet+M-SegNet” is not included because of no improvement from state-of-the-art-methods. The best performance is on boldface.

Table 4: Comparison with state-of-the-art methods on WE [43] dataset

5.4 Beijing BRT dataset

Beijing BRT dataset [11] is a new crowd counting dataset applicable for intelligent transportation. The number of heads vary from 1 to 64. The images size are all 640360 pixels and taken from at the Bus Rapid Transit (BRT) in Beijing. The images are taken from morning until night, therefore they contain shadows, glare, and sunshine interference. Table 5 reports our models’ performance on this dataset. M-SFANet+M-SegNet obtains the new best performance with 17.27.5%/9.50% MAE/RMSE relative improvement for this dataset. The visualization of the estimated density maps on a sample of this dataset is shown in the middle column of Fig. 4.

MCNN [45] 2.24 3.35
FCN [11] 1.74 2.43
ResNet-14 [11] 1.48 2.22
DR-ResNet [11] 1.39 2.00
M-SegNet 1.26 1.98
M-SFANet 1.16 1.90
M-SFANet+M-SegNet 1.15 1.81

Note: “M-SFANet+M-SegNet” determines average prediction between the two models. The best performance is on boldface.

Table 5: Comparison with state-of-the-art methods on Beijing BRT [11] dataset

5.5 TRANCOS dataset

Apart from crowd counting, we also evaluate our models on TRANCOS [15], a vehicle counting dataset, to demonstrate the robustness and generalization of our approaches. The dataset contains 1244 images of different congested traffic scenes taken by surveillance cameras. Each image has the region of interest (ROI) used for evaluation. Following the work in [15], we use Grid Average Mean Absolute Error (GAME) for model performance evaluation. The metric is defined in equation 5. Our approaches especially M-SFANet surpass the best previous method as shown in Table 6. The results show that the averaged density map estimation might improve counting accuracy (MAE) but does not provide better localization of target objects. The generated density maps from our models are shown in the right column of Fig. 4.

Method GAME(0) GAME(1) GAME(2) GAME(3)
Hydra-3s [30] 10.99 13.75 16.69 19.32
CSRNet [23] 3.56 5.49 8.57 15.04
SPN [7] 3.35 4.94 6.47 9.22
ADCrowdNet(AMG-attn-DME) [26] 2.44 4.14 6.78 13.58
S-DCNet [41] 2.92 4.29 5.54 7.05
M-SegNet 2.51 5.43 7.59 9.49
M-SFANet 2.23 3.46 4.86 6.91
M-SFANet+M-SegNet 2.22 3.87 5.51 7.37

Note: “M-SFANet+M-SegNet” determines average prediction between the two models. The best performance is on boldface.

Table 6: Comparison with state-of-the-art methods on TRANCOS [15] dataset

where N corresponds to the number of test images. and are the predicted and ground truth counts of the sub-region of test image.

6 Conclusions

In this paper, we propose two modified end-to-end trainable neural networks, named M-SFANet and M-SegNet by the combination of novel architectures, designed for crowd counting, image segmentation and deep learning in general. For M-SFANet, We add two multi-scale-aware modules [27, 6] to SFANet [46]

architecture for better tackle scale changes of target object throughout an input image, therefore the model shows the superior performance over state-of-the-art methods on both crowd counting dataset and vehicle counting dataset. Furthermore, the decoder structure of SFANet is adjusted to have more residual connections in order to ensure that the learned multi-scale features of high-level semantic information will impact how the model regress for the final density map. For M-SegNet, we change the up-sampling algorithm from bilinear to max unpooling using the memorized indices employed in SegNet

[1]. This yields the computationally cheaper model while providing competitive counting performance applicable for real-world applications.


  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §1, §1, §3.2, §3, §6.
  • [2] A. Bansal and K. Venkatesh (2015) People counting in high density crowds from still images. arXiv preprint arXiv:1507.08445. Cited by: §4.1, Table 1, Figure 4, §5.2, Table 3.
  • [3] L. Boominathan, S. S. Kruthiventi, and R. V. Babu (2016) Crowdnet: a deep convolutional network for dense crowd counting. In Proceedings of the 24th ACM international conference on Multimedia, pp. 640–644. Cited by: §2.2.
  • [4] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §1, §2.2.
  • [5] A. B. Chan and N. Vasconcelos (2009) Bayesian poisson regression for crowd counting. In 2009 IEEE 12th international conference on computer vision, pp. 545–551. Cited by: §2.1.
  • [6] L. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam (2018) Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pp. 801–818. Cited by: §1, §1, §3.1, §3.1, §3.2, §3, §6.
  • [7] X. Chen, Y. Bin, N. Sang, and C. Gao (2019) Scale pyramid network for crowd counting. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1941–1950. Cited by: Table 6.
  • [8] Z. Cheng, J. Li, Q. Dai, X. Wu, and A. G. Hauptmann (2019) Learning spatial awareness to improve crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6152–6161. Cited by: §5.1.
  • [9] N. Dalal and B. Triggs (2005) Histograms of oriented gradients for human detection. In

    2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05)

    Vol. 1, pp. 886–893. Cited by: §2.1.
  • [10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §1.
  • [11] X. Ding, Z. Lin, F. He, Y. Wang, and Y. Huang (2018) A deeply-recursive convolutional network for crowd counting. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1942–1946. Cited by: §4.1, Table 1, Figure 4, §5.4, Table 5.
  • [12] P. Dollar, C. Wojek, B. Schiele, and P. Perona (2011) Pedestrian detection: an evaluation of the state of the art. IEEE transactions on pattern analysis and machine intelligence 34 (4), pp. 743–761. Cited by: §1.
  • [13] P. Ganaye, M. Sdika, and H. Benoit-Cattin (2018) Semi-supervised learning for segmentation under semantic constraint. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 595–602. Cited by: §1.
  • [14] J. Gao, W. Lin, B. Zhao, D. Wang, C. Gao, and J. Wen (2019) C

    framework: an open-source pytorch code for crowd counting

    arXiv preprint arXiv:1907.02724. Cited by: §4.1, §5.3.
  • [15] R. Guerrero-Gómez-Olmedo, B. Torre-Jiménez, R. López-Sastre, S. Maldonado-Bascón, and D. Onoro-Rubio (2015) Extremely overlapping vehicle counting. In Iberian Conference on Pattern Recognition and Image Analysis, pp. 423–431. Cited by: §4.1, Table 1, Figure 4, §5.5, Table 6, §5.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2015) Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE transactions on pattern analysis and machine intelligence 37 (9), pp. 1904–1916. Cited by: §1, Table 2, Table 3.
  • [17] H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2547–2554. Cited by: §1.
  • [18] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 532–546. Cited by: §5.1.
  • [19] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167. Cited by: §3.1.
  • [20] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6133–6142. Cited by: §2.2.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.3.
  • [22] V. Lempitsky and A. Zisserman (2010) Learning to count objects in images. In Advances in neural information processing systems, pp. 1324–1332. Cited by: §2.1, §4.1.
  • [23] Y. Li, X. Zhang, and D. Chen (2018) Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1091–1100. Cited by: §1, §2.2, §3.1, Table 2, Table 3, Table 4, Table 6.
  • [24] L. Liu, Z. Qiu, G. Li, S. Liu, W. Ouyang, and L. Lin (2019) Crowd counting with deep structured scale integration network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1774–1783. Cited by: Table 4.
  • [25] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin (2018) Crowd counting using deep recurrent spatial-aware network. arXiv preprint arXiv:1807.00601. Cited by: Table 2, Table 3.
  • [26] N. Liu, Y. Long, C. Zou, Q. Niu, L. Pan, and H. Wu (2019) Adcrowdnet: an attention-injective deformable convolutional network for crowd understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3225–3234. Cited by: §1, §2.2, Table 6.
  • [27] W. Liu, M. Salzmann, and P. Fua (2019) Context-aware crowd counting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5099–5108. Cited by: §1, §1, §2.2, §3.1, §3.1, §3.1, §3.2, §3, Table 2, Table 3, Table 4, §6.
  • [28] Z. Ma, X. Wei, X. Hong, and Y. Gong (2019) Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE International Conference on Computer Vision, pp. 6142–6151. Cited by: §1, §2.2, Table 2, Table 3.
  • [29] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In

    Proceedings of the 27th international conference on machine learning (ICML-10)

    pp. 807–814. Cited by: §3.1.
  • [30] D. Onoro-Rubio and R. J. López-Sastre (2016) Towards perspective-free object counting with deep learning. In European Conference on Computer Vision, pp. 615–629. Cited by: §1, §2.2, Table 6.
  • [31] V. Pham, T. Kozakaya, O. Yamaguchi, and R. Okada (2015) Count forest: co-voting uncertain number of targets using random forest for crowd density estimation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3253–3261. Cited by: §2.1.
  • [32] O. Ronneberger, P. Fischer, and T. Brox (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §1, §1, §2.2, §3.1.
  • [33] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4031–4039. Cited by: §2.2.
  • [34] C. Shang, H. Ai, and B. Bai (2016) End-to-end crowd counting via joint learning local and global count. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 1215–1219. Cited by: §2.2.
  • [35] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §1, §1, §2.2, §3.1, §3.
  • [36] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.2.
  • [37] P. Viola and M. J. Jones (2004)

    Robust real-time face detection

    International journal of computer vision 57 (2), pp. 137–154. Cited by: §2.1.
  • [38] E. Walach and L. Wolf (2016) Learning to count with cnn boosting. In European conference on computer vision, pp. 660–676. Cited by: §2.2.
  • [39] Q. Wang, J. Gao, W. Lin, and Y. Yuan (2019) Learning from synthetic data for crowd counting in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8198–8207. Cited by: §1, §2.2.
  • [40] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He (2019) Adaptive scenario discovery for crowd counting. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2382–2386. Cited by: §1, §2.2.
  • [41] H. Xiong, H. Lu, C. Liu, L. Liu, Z. Cao, and C. Shen (2019) From open set to closed set: counting objects by spatial divide-and-conquer. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8362–8371. Cited by: §1, §5.1, §5.2, Table 2, Table 3, Table 6.
  • [42] Z. Yan, Y. Yuan, W. Zuo, X. Tan, Y. Wang, S. Wen, and E. Ding (2019) Perspective-guided convolution networks for crowd counting. In Proceedings of the IEEE International Conference on Computer Vision, pp. 952–961. Cited by: Table 4.
  • [43] C. Zhang, H. Li, X. Wang, and X. Yang (2015) Cross-scene crowd counting via deep convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 833–841. Cited by: §2.2, §4.1, Table 1, Figure 5, Table 4.
  • [44] M. Zhang, J. Lucas, J. Ba, and G. E. Hinton (2019) Lookahead optimizer: k steps forward, 1 step back. In Advances in Neural Information Processing Systems, pp. 9593–9604. Cited by: §4.3.
  • [45] Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma (2016) Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597. Cited by: §1, §2.2, §4.1, §4.1, Table 1, §5.1, Table 2, Table 4, Table 5.
  • [46] L. Zhu, Z. Zhao, C. Lu, Y. Lin, Y. Peng, and T. Yao (2019) Dual path multi-scale fusion networks with attention for crowd counting. arXiv preprint arXiv:1902.01115. Cited by: §1, §1, §2.2, §3.1, §3.1, §3.1, §3, §4.2, §4.2, §4.3, §5.1, Table 2, Table 3, §6.