Multi-Level Attentive Convoluntional Neural Network for Crowd Counting

05/24/2021 ∙ by Mengxiao Tian, et al. ∙ JD.com, Inc. China Agricultural University Beijing Institute of Technology 8

Recently the crowd counting has received more and more attention. Especially the technology of high-density environment has become an important research content, and the relevant methods for the existence of extremely dense crowd are not optimal. In this paper, we propose a multi-level attentive Convolutional Neural Network (MLAttnCNN) for crowd counting. We extract high-level contextual information with multiple different scales applied in pooling, and use multi-level attention modules to enrich the characteristics at different layers to achieve more efficient multi-scale feature fusion, which is able to be used to generate a more accurate density map with dilated convolutions and a 1× 1 convolution. The extensive experiments on three available public datasets show that our proposed network achieves outperformance to the state-of-the-art approaches.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

With a wide range of surveillance applications, crowding counting has been investigated continuously in recent years. It can be applied into safety monitoring, crowd estimation, traffic management, disaster relief, and urban planning, etc. However, due to the presence of drastic scale variances, illumination changes, complex backgrounds, and irregular distribution of human beings, crowd counting is still a very challenging task in reality.

Figure 1: The visualization of effect on mulit-level attention maps. The prediction number is 250.5, 598.9, and 673.9, from top to bottom. The ground-truth number is 699.

Various advanced deep learning techniques based on detection 

[7], regression [10, 25], and density estimation [39, 2], have been proposed to deal with this challenge. In particular, density estimation based methods have achieved great advances. These techniques have the ability to localize the crowd by predicting a density map as a pixel-wise regression and aggregating to a final counting as the integral of the density map. To generate maps with a retained spatial size as the input, deep Convolutional Neural Network (CNN) architectures [1, 28, 39] are widely applied to handle scale variations by combining different network structures with different receptive field sizes, or fusing multi-context information to formulate a semantic feature representation and facilitate accurate pixel-wise regression. This motives us to further investigate multi-scale and multi-context to exploit the potential of the underlying representation for more accurate pixel-wise estimation. It is worth noting that attention mechanism [5] [33] [12] [20] [38] [9] [13] [37] is able to heighten the sensitivity to features containing the most valuable information. Several attempts have been exerted to incorporate attention mechanism as an effective technique processing into crowd counting [14, 40, 16]. However, most of existing research works only use simple attentions and single complicate attention to adjust the model weights or determine the weights to generate a density map at different scales, which can not handle complicated scenes well in the crowd counting problem. This motivates us to further explore multi-level attention mechanism to focus on key pieces of the feature spaces for crowd counting and differentiate irrelevant information.

In this paper, we propose a multi-level attentive Convolutional Neural Network (MLAttnCNN) for crowd counting, as shown in Figure 2. In order to extract global and sub-regional contextual feature, we introduce the multi-scale spatial pooling with multiple bin sizes, which correspond to different perception fields for capturing different sizes of human heads and overcoming the constraints of fixed sizes of deep networks with full connection layers. Then three-level attention modules are fully explored to extract the contextual feature representation before we produce a density map with dilated convolutions and convolution.

We shall emphasize that the 1st-level channel-wise attention module and the 2nd-level spatial attention module are designed to recover more detailed information after upsampling at each scale pooling to formulate more informative feature maps. With multi-scale fusion, we get rich characteristics for the feature representation. It enables the entire learning model to have the ability to perceive multi-scale targets and incorporate contextual semantic features, which can better preserve the underlying details to the upper level.

Inspired by the success of channel-wise attention and spatial attention, we design the 3rd-level triplet attention module by exploring channel, row, and column attention, to rescale the intermediate features by apply the channel, row, and column multiplication for further improving the quality of our feature representation. The intuition behind that we believe one-dimensional attention mechanism should be helpful to impose an attention weight to each element on the corresponding dimension, and the full consideration should also cover row and column information. The effects of three-level attention maps are illustrated in Figure 1. We experimentally observe that with the triplet attention, our MLAttnCNN achieves more accurate estimation in crowd counting.

To sum up, our contributions are three-fold:

  • We propose a multi-level attentive Convolutional Neural Network to enhance its selectivity in spatial and channel information, improving the effectiveness of multi-scale cascade.

  • We design an elegant triplet attention module which is able to extract contextual feature by exploring the channel, row, a nd column attention, to further improve the quality of our feature representation.

  • We experimentally demonstrate that our proposed method achieves excellent performance on three common benchmark datasets, and achieves comparable and even better performance to state-of-the-art approaches.

Related Work

The related work can be divided into: density map regression for crowd counting and attention machenism for crowd counting.

Density map regression for crowd counting starts from [18] in which density maps are generated with the minimum annotation of a single point blurred by the Gauss kernel for the training of counting networks. HydraCNN [24] was proposed to solve the multi-scale issue by extracting multiple overlapping image patches from the multi-scale pyramid of each input image. [1] combines deep and shallow fully convolutional network to capture high-level semantic features and address different scale changes in high-density crowd. [28] and [39] use a multi-column network to handle changes in different head sizes in an image. These networks add extra computational costs, and inefficient branches do not adapt well to large-scale changes in crowd density, nor do they adequately address multi-scale changes at all levels, nor do they make good use of spatial and semantic features. This causes the estimated accuracy in intensive scenarios not to meet the needs in the real-world applications.

Attention mechanism for crowding counting is popular in recent works. [14] uses attention mechanism to adjust the model weights adaptively according to the change of crowd density. [40]

proposes the attention mechanism to estimate the probability map and use the predicted probability map for the non-head region of the density estimation.

[16] inputs each scale with the final feature map and gets an attention map, multiplies the softmax results of each scale attention map with the density maps of each scale, and finally fuses all density maps with 11 convolution. [8] achieves dense crowd counting by a global attentional network with local scale perception. [41]

generate a score map using a multi-resolution attention model in which the head position has a higher score that the network structure can pay attention on the head areas in a complex background suppressing the non-head areas.

In this work, we utilize multi-scale pooling module with multi-level attentions to choose relevant and important multi-scale features. We employ the dilated convolutions [36, 19, 4] which have been proven to provide more global multi-scale sensing information without the use of multiple inputs or complex networks.

Methodology

As shown in Figure 2, the pipeline of our proposed multi-level attentive convolutional neural network contains four modules, i.e.

, feature extraction with an VGG-16 backbone, multi-scale pooling module, the 1st level channel-wise attention module, the 2nd level spatial attention, the 3rd triplet attention, and dilated convolutions and

convolution to generate a density map.

Figure 2: The pipeline of our proposed multi-level attentive convolutional neural network for crowd counting. It consists of an VGG-16 backbone as feature extraction, multi-scale average pooling, 1st level channel-wise attention, 2nd level spatial attention, 3rd triplet attention, and dilated convolutions and convolutions to generate a density map.

Given an input image, we employ the VGG-16 network [30] structure as the backbone of extraction feature. Note that we remove three pooling layers to preserve larger feature map, which reduces the loss of spatial information. The final feature map size is 1/4 of the input image, which fed into the multi-scale average pooling module to obtain context information. In this paper, we use five different pooling scales with the sizes of 11, 33, 55, 77, 99. Each of them has the same number of channels. Then the multi-scale features are upsampled to the same size and fed into the 1st-level channel-wise attention module and the 2nd-level spatial attention for further processing. All the scale features obtained after applying the first two level attention modules are concatenated together with the original feature map to form the contextual feature representation, which is then fed into the 3rd-level triplet attention module, and finally fed into a series of dilated convolution layers and convolution layer before the density map is generated.

After obtaining five feature maps in different scales, merging them may not be the most effective method. The scale of each object varies according to the location of the image because of the scene perspective. Especially for the scale changes of people head, the direct fusion easily loses the spatial information which causes blurring. Therefore, each position of the feature map needs different fusion weights. So we employ multi-level attentions to estimate the attention map for each scale, which could enrich information and more efficient multi-scale integration.

We are going to discuss three levels of attention module with implementation details as follows.

The 1st-Level Channel-wise Attention Module

We used a channel-wise attention module based on the high-level feature map as input and generated a channel attention map, and then used to launch the feature map along the channel dimension, as illustrated in Figure 3. For a convolutional feature map , we first squeeze it in spatial dimension by using global avg pooling to get feature map with 11

C size, which is then followed by two FC layers and a Sigmoid layer. The first FC layer obtains feature vector with 1

1

size, where C is the number of channels, and r is scaling parameters. The purpose of r is to reduce the number of channels and reduce the amount of computation. And then followed by the ReLU activation, the dimension of its output remains unchanged. The output dimension of the second FC layer is 1

1C. Finally, we get the feature vector by using Sigmoid, and final channel-wise attention map by making an element-wise multiplication between and for generation. We merge original CNN feature map and channel-wise attention map as the input for the 2nd-level spatial attention module.

Figure 3: The 1st-level channel-wise attention module.

The 2nd-Level Spatial Attention Module

We put the output channel-wise attention maps focus on spatial attention. Different from the 1st-level channel-wise attention that enhances the correlation between objects and image captions, the 2nd-level spatial attention is designed to focus on location information, which selects attentive areas to enhance the response of feature map, as shown in Figure 4. We adopt the spatial attention module [34]

and make an element-wise multiplication between channel-wise attention map and spatial attention map to actuate the feature map before concating with the original CNN feature. Firstly, it employs the max pooling and avg pooling on channel dimensions to get different feature vectors

and respectively, and we merge these two feature vector by using concatenation. Finally generate the spatial attention map

by the convolution operation, and each spatial layer along with batch normalization (bn) layer and ReLU.

Figure 4: The 2nd-level spatial attention module.

The 3rd-Level Triplet Attention Module

We concatenate the five multi-scale actuated attention maps with original CNN features to generate the readjusted density map. and as the input to the 3rd-level triplet attention, As demonstrated in Figure 5, our 3rd-level triple attention has three branches, which is composed of a channel-attention , a row-attention , and a column-attention . We execute the three branch’s attention mechanisms respectively to get three dimension feature maps.

The aim of the channel attention is to perform feature recalibration in a global way where the per-channel, per-row and per-column summary statistics are calculated and then used to selectively emphasize informative feature-maps as well as suppress useless ones (e.g. redundant feature-maps). We first normalize the feature map with a sigmoid activation and multiply it with the original feature map, and then perform two conversions from the input feature map ( and ). We multiply these two feature maps with the further normalized feature maps. Then we perform transpose to restore the original shape. Finally, we merge the three feature maps together to obtain the normalized feature map.

The normalized feature map first obtains three feature maps through three convolutional layers. We reshape the first feature map , where , and multiply the transposition of A and reshaped A, and get the channel attention map by softmax. Each element of the feature map is as follows:

(1)

where indicates the effect of the feature of the channel on the channel. We then multiply channel attention map and the transpose of A by matrix, multiply the result by a scale factor. Finally we reshape it to the same shape of feature map A and add these two feature maps to get the final output feature map with selective channels :

(2)
Figure 5: The 3rd-level triplet attention module.

Where is initialized to 0 and gradually learns to assign to a larger weight. Similarly, we can obtain the row-attention and the column-attention to rescale the input feature with a row multiplication to get the row attention map , and repeat the previous multiplication and addition to obtain the recalibricated output , and with a column mulitplication to get the column attention map , and repeat the previous multiplication and addition the recalibricated output . It is worth mentioning here that we can transpose both the row-attention and the column-attention into the form of channel attention, and then transpose back flexibly.

The three outputs are summed to obtain the final feature representation:

(3)

where is the final feature map we merge three attention maps by three branches, and , and represents the weight of each attention map from three branches. As shown in Figure 12, there are the validation loss curves of 6 groups of different a, b, and c training on the UCF_CC_50 dataset. When , the overall model is more convergent.

Finally, the merged feature map followed by convolution kernel with 33 which contains bn layer, we construct four dilated convolution layers with parameters are set to 2, and their all kernel sizes are set to 3

3 to reduce the complexity of network structure. The entire network adopts ReLU as an activation function. In order to map the feature map to the density map, we use a 1

1 filter to produce the final predicted density map, and upsample it to the 1/4 size resolution of the original image.

Implementation Details and Loss Function

In order to supervise our regression model, we use Euclidean distance to measure the distance between the estimated density map and the ground-truth density map. The loss function can be defined as:

(4)

where represents the number of training image patches, is the ground-truth map and is the estimated density map by our regression module. The is the parameters of our model learning and learns the loss between the ground-truth density map and the estimated density map. The is the density distribution of objects in images can be computed as below:

(5)

where

represents two-dimensional Gaussian distribution,

is the annotated position, and the number of covariance matrix is . With this density map , the number of people can be computed as below:

(6)

Note that the number of people obtained by integrating the pixels of the predicted density map.

At the training stage, we initialize the end-to-end training model with some parameters of the network using the pre-trained VGG model, and use Adam optimizer [17] in training. For training the whole network, the learning rate is set as 0.0001, and the batch size is 1 in UCF_CC_50 dataset, and other three benchmark dataset which use the batch size with 6.

At the testing stage, we feed the whole input image into the network instead of extracting image patches from the original input image. Finally, we implemented our approach based on Pytorch framework 

[26].

Experiments

We evaluate our proposed method on three different crowd counting datasets, i.e., UCF_CC_50 dataset  [10], ShanghaiTech dataset  [39], and UCF-QNRF  [11].

The UCF_CC_50 dataset is a very challenging dataset. For the distribution of crowd density varies considerably, there are serious problems with occlusion and rapid changes in the number of people per image, which makes it extremely difficult. More specifically, the dataset has only 50 images with a total of 63,974 head center annotation provided. The number of head counts vary from 84 to 4543 (1280 in average) per image. Following [10], we conduct five-fold cross validation and report average test performance.

The ShanghaiTech Part-B dataset is a largest open crowd counting dataset in term of the number of annotated people. It contains 716 images with the fixed size of taken from busy streets, covering 88,488 people in total with head center annotations provided. Each image is with the number of people ranging from 9 to 578. Following [39], we take 400 images for training and the rest 316 for evaluation.

The UCF-QNRF dataset is a new crowd counting dataset consisting of 1,525 images with a total of 1.25 million annotations, of which crowd counts between 49 and 12,865. It is splitted into training and testing subsets consisting of 1201 and 534 images, respectively.

Note that We fixed the covariance matrix of the Gaussian function to generate the ground-truth density map to in UCF_CC_50, ShanghaiTech Part-B, and UCF-QNRF dataset. For all benchmark datasets, we follow the data augmentation and data processing techniques used in the method [6].

Regarding the evaluation metrics, we adopt Mean Absolute Error (MAE) and rooted Mean Square Error (MSE), which are defined as

(7)

where represents the number of images in test set, represents the actual count in the th image, and represents the predicted count in the th image. MAE measures the accuracy of crowd counting algorithm, MSE measures the robustness of crowd counting algorithm.

Comparison with State-of-the-art

We compared our proposed MLAttnCNN with a series of recent advanced approaches, i.e., MCNN [39], Cascade-CNN [31], Switch-CNN [28], AT-CSRNet [42], CL-CNN [11], DRSAN [21], CSRNet [19], TEDNet [15]. SANet [2], TDF-CNN [27], ASD [35], SL2R [23], PACNN [29], and CAN [22].

For fair comparison, we train each model on the same training set with the same data augmentation tricks, and evaluate on the same testing set.

UCF_CC_50 dataset

We summarize the quantitative results in Table 1, from which we can clearly observe that our approach outperforms the state-of-the-art methods and has achieved the best results with MAE of 200.8. This suggests that our proposed method is able to achieve good performance even though the training data is insufficient.

Method Venue/Year MAE MSE
DRSAN IJCAI/2018 219.2 250.2
SANet ECCV/2018 258.4 334.9
TDF-CNN AAAI/2018 354.7 491.4
SL2R PAMI/2019 279.6 408.1
PACNN CVPR/2019 241.7 320.7
CAN CVPR/2019 212.2 243.7
MLAttnCNN AAAI/2020 200.8 273.8
Table 1: Estimation errors on the UCF_CC_50 dataset.
Figure 6: Visualization of the predicted density maps on the UCF_CC_50 dataset.

Figure 7: Comparisons of estimated count in UCF_CC_50 dataset.

For better understand the advantage of our proposed approach, we visualize some examples of density maps estimated on the UCF_CC_50 dataset in Figure 6. We can see that our predicted density map has high quality and corresponding crowd images, and the predicted density maps closely follow the actual crowd distribution in the visualization.

We also compared the predicted and actual counts in the UCF_CC_50 dataset, as shown in Figure 7. Our method is superior to CAN across most data splits, which improves the robustness and effectiveness of our proposed method. For most images, the predicted count is close to the actual count relatively, but while the crowd density is particularly large of an image, the prediction error is also relatively large. We believe that the reason for this may be that the training data is not enough, and lack of crowded image data, resulting in our model not able to learn the image features in training well.

ShanghaiTech Part-B dataset

Table 2 presents all the quantitative results. As we can observe, compared with other methods, we achieve the best MAE of 7.5 and MSE of 11.6. We also provide the visualization results in Figure 9, and we divide all the testing images into 10 groups according to basis of the crowd count, and count each group on average. Estimated count comes from the prediction of our network structure. We can see that our method is superior to CAN in data splits, which proves its robustness and effectiveness, which as been seen in Figure 8.

Method Venue/Year MAE MSE
MCNN CVPR/2016 26.4 41.3
Cascade-CNN AVSS/2017 20.0 31.1
Switch-CNN CVPR/2017 21.6 33.4
CSRNet CVPR/2018 10.6 16.0
SANet ECCV/2018 8.4 13.6
AT-CSRNet CVPR/2019 8.11 13.53
TEDNet CVPR/2019 8.2 12.8
CAN CVPR/2019 7.8 12.2
MLAttnCNN AAAI/2020 7.5 11.6
Table 2: Estimation errors on the ShanghaiTech Part-B dataset.

Figure 8: Comparisons of average counting estimates of 10 splits in ShanghaiTech Part-B data set based on the increase of the number of people in each image.

Figure 9: Comparison in visualization on density maps in the ShanghaiTech Part-B dataset.

UCF-QNRF dataset

We compare our proposed MLAttnCNN with eleven baselines and report the results in Table 3. On this dataset, we achieve the best MAE of 101.2 and MSE of 174.6. We also visuailize some examples on these two datasets on Figure 10. Again, we can observe that our proposed MLAttnCNN model always improves the counting performance.

Method Venue/Year MAE MSE
MCNN CVPR/2016 277 426
Cascade-CNN AVSS/2017 252 514
Switch-CNN CVPR/2017 228 445
CL-CNN ECCV/2018 132 191
TEDNet CVPR/2019 113 188
CAN CVPR/2019 107 183
MLAttnCNN AAAI/2020 101 175
Table 3: Estimation errors on the UCF-QNRF dataset.

Figure 10: Visualization of the predicted density maps on the UCF-QNRF dataset.

Ablation Study

We conduct ablation study on the UCF_CC_50 dataset to justify the design of our proposed MLAttnCNN.

Effectiveness of backbone structure

We choose the VGG-16 structure as our backbone for initial feature extraction. Previous density counting research [19, 21, 3, 32] generally retrain the first 5 sets of convolutional blocks in VGG, and delete the last two pooling layers with the downsampling coefficient as 8. Since too many pooling layers would lead to the loss of spatial information, we remove three pooling layers and change the downsampling coefficient to 4, in order to avoid this situation and improve accuracy.

VGG-16 structure MAE MSE
3Pool 2+MP, 2+MP, 233.4 315.6
3+MP, 3
2Pool 2+MP, 2+MP, 200.8 273.8
3, 3)
Table 4: Results of the different number of pooling layers in UCF_CC_50 dataset. denotes convolution layers, each of which has filters with the kernel size of . “+MP” represents a max pooling layer.

The results of comparison experiments are reported in Table  4, which shows that we are able to reduce both MAE and MSE values for our proposed MLAttnCNN by eliminating the number of pool layers, resulting in a more accurate density map and improve the crowd counting performance.

Effectiveness of Multi-Scale Pooling Module

Scales UCF_CC_50
MAE MSE
4avg pooling layers MS1 221.2 276.6
4avg pooling layers MS2 241.3 324.0
5avg pooling layers MS3 200.8 273.8
4avg pooling layers MS4 244.1 338.5
Table 5: Comparison of results on both the UCF_CC_5 dataset for varying-size pooling bin sizes.

We studied four varying multi-scale (MS) pooling bin sizes to assemble the model, i.e.,

  • MS1: 11, 22, 33, and 66 avg pooling layers.

  • MS2: 33, 55, 77 and 99 avg pooling layers.

  • MS3: 11, 33, 55, 77 and 99 avg pooling layers.

  • MS4: 11, 33, 55, and 77 avg pooling layers.

We compared these four options on both the ShanghaiTech dataset and the UCF_CC_50 dataset. As shown in table  5

, MS4 performs the best in the UCF_CC_50 dataset, and MS3 works the best in the ShanghaiTech dataset. Also, we observe that odd pooling bin size is more suitable for learning density map of CNN Model than even size pooling kernel. With more and more effective semantic information, odd size pooling kernel could accurately locate the spatial distribution of crowd density and improve the accuracy of crowd counting; For the UCF_CC_50 dataset, which has fewer images than SHB and UCF-QNRF, and crowds are more crowded and very variable distribution. Filters with large receptive fields are more suitable for ground-truth density map with larger heads.

Effectiveness of attention modules

We also study multi-level attentive modules on the UCF_CC_50 dataset with three combinations, as shown in Table 6 and Figure 11. We can see that combining the 1st-level channel-wise attention with the 2nd-level spatial attention works better than use the 1st-level attention only, and the combination of all three-level attentions achieves the best performance.

Methods MAE MSE
1st-Level Attn 260.5 333.1
1st-Level + 2nd-Level Attn 246.2 330.3
1st-Level + 2nd-Level + 3rd-Level Attn 200.8 273.8
Table 6: Effect of multi-level attentive modules for crowd counting in our proposed MLAttnCNN on the UCF_CC_50 dataset.
Figure 11: The visualization of effect on mulit-level attention maps.

Effectiveness of hyperparameters in 3rd-triple attention

We make six different sets of hyperparameter experiments in the UCF_CC_50 dataset. From the results in Figure 

12, we can observe that our proposed MLAttnCNN achieves the faster convergence and better convergence when we use the hyperparameters in the 3rd-level triplet attention.

Figure 12: The convergence performance for different network hyperparameters in the UCF_CC_50 dataset.

Conclusion

In this paper, we propose a multi-level attentive Convolutional Neural Network for crowd counting. We employ multi-scale pooling and multi-level attentions to explore the underlying contextual details for generating a high quality density map. Our approach not only facilitates the semantic perception of the heads of different scales, but also enriches the features of different layers for fusion. It is experimentally proved more suitable for extremely dense people. Extensive experiments strongly demonstrate that our approach is robust and has outperformed the state-of-the-art approaches on multiple crowd counting benchmark datasets.

Our future work includes extending the current work to solve crowd counting problems in videos.

Acknowledgments

This research is supported by the National Natural Science Foundation of China [grant numbers 42071449, 41601491].

References

  • [1] L. Boominathan, S. S. Kruthiventi, and R. V. Babu (2016) Crowdnet: a deep convolutional network for dense crowd counting. In ACM MM, pp. 640–644. Cited by: Introduction, Related Work.
  • [2] X. Cao, Z. Wang, Y. Zhao, and F. Su (2018) Scale aggregation network for accurate and efficient crowd counting. In ECCV, pp. 734–750. Cited by: Introduction, Comparison with State-of-the-art.
  • [3] X. Chen, Y. Bin, N. Sang, and C. Gao (2019-01) Scale pyramid network for crowd counting. In WACV, Vol. , pp. 1941–1950. External Links: Document, ISSN 1550-5790 Cited by: Effectiveness of backbone structure.
  • [4] F. Dai, H. Liu, Y. Ma, J. Cao, Q. Zhao, and Y. Zhang (2019-06) Dense Scale Network for Crowd Counting. arXiv e-prints, pp. arXiv:1906.09707. External Links: 1906.09707 Cited by: Related Work.
  • [5] B. Ding, C. Long, L. Zhang, and C. Xiao (2019)

    ARGAN: attentive recurrent generative adversarial network for shadow detection and removal

    .
    In ICCV, pp. 10213–10222. Cited by: Introduction.
  • [6] J. Gao, W. Lin, B. Zhao, D. Wang, C. Gao, and J. Wen (2019)

    C^ 3 framework: an open-source pytorch code for crowd counting

    .
    arXiv preprint arXiv:1907.02724. Cited by: Experiments.
  • [7] W. Ge and R. T. Collins (2009) Marked point processes for crowd counting. In CVPR 2009, Cited by: Introduction.
  • [8] M. Hossain, M. Hosseinzadeh, O. Chanda, and Y. Wang (2019) Crowd counting using scale-aware attention networks. Cited by: Related Work.
  • [9] T. Hu, C. Long, and C. Xiao (2021) A novel visual representation on text using diverse conditional gan for visual recognition. T-IP 30, pp. 3499–3512. Cited by: Introduction.
  • [10] H. Idrees, I. Saleemi, C. Seibert, and M. Shah (2013) Multi-source multi-scale counting in extremely dense crowd images. In CVPR, Cited by: Introduction, Experiments, Experiments.
  • [11] H. Idrees, M. Tayyab, K. Athrey, D. Zhang, S. Al-Maadeed, N. Rajpoot, and M. Shah (2018) Composition loss for counting, density map estimation and localization in dense crowds. Cited by: Comparison with State-of-the-art, Experiments.
  • [12] A. Islam, C. Long, A. Basharat, and A. Hoogs (2020) DOA-gan: dual-order attentive generative adversarial network for image copy-move forgery detection and localization. In CVPR, Cited by: Introduction.
  • [13] A. Islam, C. Long, and R. Radke (2021) A hybrid attention mechanism for weakly-supervised temporal action localization. In AAAI, Cited by: Introduction.
  • [14] L. Jiang, C. Gao, D. Meng, and A. G. Hauptmann (2017) DecideNet: counting varying density crowds through attention guided detection and density estimation. Cited by: Introduction, Related Work.
  • [15] X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao (2019) Crowd counting and density estimation by trellis encoder-decoder networks. In CVPR, pp. 6133–6142. Cited by: Comparison with State-of-the-art.
  • [16] D. Kang and A. Chan (2018) Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115. Cited by: Introduction, Related Work.
  • [17] D. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: Implementation Details and Loss Function.
  • [18] V. Lempitsky and A. Zisserman (2010) Learning to count objects in images. In NIPS, pp. 1324–1332. Cited by: Related Work.
  • [19] Y. Li, X. Zhang, and D. Chen (2018) Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, pp. 1091–1100. Cited by: Related Work, Comparison with State-of-the-art, Effectiveness of backbone structure.
  • [20] D. Liu, C. Long, H. Zhang, H. Yu, X. Dong, and C. Xiao (2020) ARShadowGAN: shadow generative adversarial network for augmented reality in single light scenes. In CVPR, Cited by: Introduction.
  • [21] L. Liu, H. Wang, G. Li, W. Ouyang, and L. Lin (2018) Crowd counting using deep recurrent spatial-aware network. In IJCAI, Cited by: Comparison with State-of-the-art, Effectiveness of backbone structure.
  • [22] W. Liu, M. Salzmann, and P. Fua (2019-06) Context-aware crowd counting. In CVPR, Cited by: Comparison with State-of-the-art.
  • [23] X. Liu, J. Van De Weijer, and A. D. Bagdanov (2019)

    Exploiting unlabeled data in cnns by self-supervised learning to rank

    .
    PAMI. Cited by: Comparison with State-of-the-art.
  • [24] D. Oñoro-Rubio and R. J. López-Sastre (2016) Towards perspective-free object counting with deep learning. In ECCV, Cited by: Related Work.
  • [25] N. Paragios and V. Ramesh (2001) A mrf-based approach for real-time subway monitoring. In CVPR, Cited by: Introduction.
  • [26] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in pytorch. Cited by: Implementation Details and Loss Function.
  • [27] D. B. Sam and R. V. Babu (2018) Top-down feedback for crowd counting convolutional neural network. In AAAI, Cited by: Comparison with State-of-the-art.
  • [28] D. B. Sam, S. Surya, and R. V. Babu (2017) Switching convolutional neural network for crowd counting. In CVPR, pp. 4031–4039. Cited by: Introduction, Related Work, Comparison with State-of-the-art.
  • [29] M. Shi, Z. Yang, C. Xu, and Q. Chen (2019) Revisiting perspective information for efficient crowd counting. In CVPR, pp. 7279–7288. Cited by: Comparison with State-of-the-art.
  • [30] K. Simonyan and A. Zisserman (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Methodology.
  • [31] V. A. Sindagi and V. M. Patel (2017) Cnn-based cascaded multi-task learning of high-level prior and density estimation for crowd counting. In AVSS, pp. 1–6. Cited by: Comparison with State-of-the-art.
  • [32] V. A. Sindagi and V. M. Patel (2019) HA-ccn: hierarchical attention-based crowd counting network. TIP. Cited by: Effectiveness of backbone structure.
  • [33] J. Wei, C. Long, H. Zou, and C. Xiao (2019) Shadow inpainting and removal using generative adversarial networks with slice convolutions. CGF 38 (7), pp. 381–392. Cited by: Introduction.
  • [34] S. Woo, J. Park, J. Lee, and I. So Kweon (2018) Cbam: convolutional block attention module. In ECCV, pp. 3–19. Cited by: The 2nd-Level Spatial Attention Module.
  • [35] X. Wu, Y. Zheng, H. Ye, W. Hu, J. Yang, and L. He (2019) Adaptive scenario discovery for crowd counting. In ICASSP, pp. 2382–2386. Cited by: Comparison with State-of-the-art.
  • [36] F. Yu and V. Koltun (2015) Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: Related Work.
  • [37] J. Zhang, C. Long, Y. Wang, H. Piao, H. Mei, X. Yang, and B. Yin (2021)

    A two-stage attentive network for single image super-resolution

    .
    TCSVT. Cited by: Introduction.
  • [38] J. Zhang, C. Long, Y. Wang, X. Yang, H. Mei, and B. Yin (2020) Multi-context and enhanced reconstruction network for single image super resolution. In ICME, Cited by: Introduction.
  • [39] Y. Zhang, D. Zhou, S. Chen, S. Gao, and M. Yi (2016) Single-image crowd counting via multi-column convolutional neural network. In CVPR, Cited by: Introduction, Related Work, Comparison with State-of-the-art, Experiments, Experiments.
  • [40] Y. Zhang, C. Zhou, F. Chang, and A. C. Kot (2018) Attention to head locations for crowd counting. arXiv preprint arXiv:1806.10287. Cited by: Introduction, Related Work.
  • [41] Y. Zhang, C. Zhou, F. Chang, and A. C. Kot (2019) Multi-resolution attention convolutional neural network for crowd counting. Neurocomputing 329, pp. 144–152. Cited by: Related Work.
  • [42] M. Zhao, J. Zhang, C. Zhang, and W. Zhang (2019) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In CVPR, pp. 12736–12745. Cited by: Comparison with State-of-the-art.