Introduction
With a wide range of surveillance applications, crowding counting has been investigated continuously in recent years. It can be applied into safety monitoring, crowd estimation, traffic management, disaster relief, and urban planning, etc. However, due to the presence of drastic scale variances, illumination changes, complex backgrounds, and irregular distribution of human beings, crowd counting is still a very challenging task in reality.
Various advanced deep learning techniques based on detection
[7], regression [10, 25], and density estimation [39, 2], have been proposed to deal with this challenge. In particular, density estimation based methods have achieved great advances. These techniques have the ability to localize the crowd by predicting a density map as a pixelwise regression and aggregating to a final counting as the integral of the density map. To generate maps with a retained spatial size as the input, deep Convolutional Neural Network (CNN) architectures [1, 28, 39] are widely applied to handle scale variations by combining different network structures with different receptive field sizes, or fusing multicontext information to formulate a semantic feature representation and facilitate accurate pixelwise regression. This motives us to further investigate multiscale and multicontext to exploit the potential of the underlying representation for more accurate pixelwise estimation. It is worth noting that attention mechanism [5] [33] [12] [20] [38] [9] [13] [37] is able to heighten the sensitivity to features containing the most valuable information. Several attempts have been exerted to incorporate attention mechanism as an effective technique processing into crowd counting [14, 40, 16]. However, most of existing research works only use simple attentions and single complicate attention to adjust the model weights or determine the weights to generate a density map at different scales, which can not handle complicated scenes well in the crowd counting problem. This motivates us to further explore multilevel attention mechanism to focus on key pieces of the feature spaces for crowd counting and differentiate irrelevant information.In this paper, we propose a multilevel attentive Convolutional Neural Network (MLAttnCNN) for crowd counting, as shown in Figure 2. In order to extract global and subregional contextual feature, we introduce the multiscale spatial pooling with multiple bin sizes, which correspond to different perception fields for capturing different sizes of human heads and overcoming the constraints of fixed sizes of deep networks with full connection layers. Then threelevel attention modules are fully explored to extract the contextual feature representation before we produce a density map with dilated convolutions and convolution.
We shall emphasize that the 1stlevel channelwise attention module and the 2ndlevel spatial attention module are designed to recover more detailed information after upsampling at each scale pooling to formulate more informative feature maps. With multiscale fusion, we get rich characteristics for the feature representation. It enables the entire learning model to have the ability to perceive multiscale targets and incorporate contextual semantic features, which can better preserve the underlying details to the upper level.
Inspired by the success of channelwise attention and spatial attention, we design the 3rdlevel triplet attention module by exploring channel, row, and column attention, to rescale the intermediate features by apply the channel, row, and column multiplication for further improving the quality of our feature representation. The intuition behind that we believe onedimensional attention mechanism should be helpful to impose an attention weight to each element on the corresponding dimension, and the full consideration should also cover row and column information. The effects of threelevel attention maps are illustrated in Figure 1. We experimentally observe that with the triplet attention, our MLAttnCNN achieves more accurate estimation in crowd counting.
To sum up, our contributions are threefold:

We propose a multilevel attentive Convolutional Neural Network to enhance its selectivity in spatial and channel information, improving the effectiveness of multiscale cascade.

We design an elegant triplet attention module which is able to extract contextual feature by exploring the channel, row, a nd column attention, to further improve the quality of our feature representation.

We experimentally demonstrate that our proposed method achieves excellent performance on three common benchmark datasets, and achieves comparable and even better performance to stateoftheart approaches.
Related Work
The related work can be divided into: density map regression for crowd counting and attention machenism for crowd counting.
Density map regression for crowd counting starts from [18] in which density maps are generated with the minimum annotation of a single point blurred by the Gauss kernel for the training of counting networks. HydraCNN [24] was proposed to solve the multiscale issue by extracting multiple overlapping image patches from the multiscale pyramid of each input image. [1] combines deep and shallow fully convolutional network to capture highlevel semantic features and address different scale changes in highdensity crowd. [28] and [39] use a multicolumn network to handle changes in different head sizes in an image. These networks add extra computational costs, and inefficient branches do not adapt well to largescale changes in crowd density, nor do they adequately address multiscale changes at all levels, nor do they make good use of spatial and semantic features. This causes the estimated accuracy in intensive scenarios not to meet the needs in the realworld applications.
Attention mechanism for crowding counting is popular in recent works. [14] uses attention mechanism to adjust the model weights adaptively according to the change of crowd density. [40]
proposes the attention mechanism to estimate the probability map and use the predicted probability map for the nonhead region of the density estimation.
[16] inputs each scale with the final feature map and gets an attention map, multiplies the softmax results of each scale attention map with the density maps of each scale, and finally fuses all density maps with 11 convolution. [8] achieves dense crowd counting by a global attentional network with local scale perception. [41]generate a score map using a multiresolution attention model in which the head position has a higher score that the network structure can pay attention on the head areas in a complex background suppressing the nonhead areas.
In this work, we utilize multiscale pooling module with multilevel attentions to choose relevant and important multiscale features. We employ the dilated convolutions [36, 19, 4] which have been proven to provide more global multiscale sensing information without the use of multiple inputs or complex networks.
Methodology
As shown in Figure 2, the pipeline of our proposed multilevel attentive convolutional neural network contains four modules, i.e.
, feature extraction with an VGG16 backbone, multiscale pooling module, the 1st level channelwise attention module, the 2nd level spatial attention, the 3rd triplet attention, and dilated convolutions and
convolution to generate a density map.Given an input image, we employ the VGG16 network [30] structure as the backbone of extraction feature. Note that we remove three pooling layers to preserve larger feature map, which reduces the loss of spatial information. The final feature map size is 1/4 of the input image, which fed into the multiscale average pooling module to obtain context information. In this paper, we use five different pooling scales with the sizes of 11, 33, 55, 77, 99. Each of them has the same number of channels. Then the multiscale features are upsampled to the same size and fed into the 1stlevel channelwise attention module and the 2ndlevel spatial attention for further processing. All the scale features obtained after applying the first two level attention modules are concatenated together with the original feature map to form the contextual feature representation, which is then fed into the 3rdlevel triplet attention module, and finally fed into a series of dilated convolution layers and convolution layer before the density map is generated.
After obtaining five feature maps in different scales, merging them may not be the most effective method. The scale of each object varies according to the location of the image because of the scene perspective. Especially for the scale changes of people head, the direct fusion easily loses the spatial information which causes blurring. Therefore, each position of the feature map needs different fusion weights. So we employ multilevel attentions to estimate the attention map for each scale, which could enrich information and more efficient multiscale integration.
We are going to discuss three levels of attention module with implementation details as follows.
The 1stLevel Channelwise Attention Module
We used a channelwise attention module based on the highlevel feature map as input and generated a channel attention map, and then used to launch the feature map along the channel dimension, as illustrated in Figure 3. For a convolutional feature map , we first squeeze it in spatial dimension by using global avg pooling to get feature map with 11
C size, which is then followed by two FC layers and a Sigmoid layer. The first FC layer obtains feature vector with 1
1size, where C is the number of channels, and r is scaling parameters. The purpose of r is to reduce the number of channels and reduce the amount of computation. And then followed by the ReLU activation, the dimension of its output remains unchanged. The output dimension of the second FC layer is 1
1C. Finally, we get the feature vector by using Sigmoid, and final channelwise attention map by making an elementwise multiplication between and for generation. We merge original CNN feature map and channelwise attention map as the input for the 2ndlevel spatial attention module.The 2ndLevel Spatial Attention Module
We put the output channelwise attention maps focus on spatial attention. Different from the 1stlevel channelwise attention that enhances the correlation between objects and image captions, the 2ndlevel spatial attention is designed to focus on location information, which selects attentive areas to enhance the response of feature map, as shown in Figure 4. We adopt the spatial attention module [34]
and make an elementwise multiplication between channelwise attention map and spatial attention map to actuate the feature map before concating with the original CNN feature. Firstly, it employs the max pooling and avg pooling on channel dimensions to get different feature vectors
and respectively, and we merge these two feature vector by using concatenation. Finally generate the spatial attention mapby the convolution operation, and each spatial layer along with batch normalization (bn) layer and ReLU.
The 3rdLevel Triplet Attention Module
We concatenate the five multiscale actuated attention maps with original CNN features to generate the readjusted density map. and as the input to the 3rdlevel triplet attention, As demonstrated in Figure 5, our 3rdlevel triple attention has three branches, which is composed of a channelattention , a rowattention , and a columnattention . We execute the three branch’s attention mechanisms respectively to get three dimension feature maps.
The aim of the channel attention is to perform feature recalibration in a global way where the perchannel, perrow and percolumn summary statistics are calculated and then used to selectively emphasize informative featuremaps as well as suppress useless ones (e.g. redundant featuremaps). We first normalize the feature map with a sigmoid activation and multiply it with the original feature map, and then perform two conversions from the input feature map ( and ). We multiply these two feature maps with the further normalized feature maps. Then we perform transpose to restore the original shape. Finally, we merge the three feature maps together to obtain the normalized feature map.
The normalized feature map first obtains three feature maps through three convolutional layers. We reshape the first feature map , where , and multiply the transposition of A and reshaped A, and get the channel attention map by softmax. Each element of the feature map is as follows:
(1) 
where indicates the effect of the feature of the channel on the channel. We then multiply channel attention map and the transpose of A by matrix, multiply the result by a scale factor. Finally we reshape it to the same shape of feature map A and add these two feature maps to get the final output feature map with selective channels :
(2) 
Where is initialized to 0 and gradually learns to assign to a larger weight. Similarly, we can obtain the rowattention and the columnattention to rescale the input feature with a row multiplication to get the row attention map , and repeat the previous multiplication and addition to obtain the recalibricated output , and with a column mulitplication to get the column attention map , and repeat the previous multiplication and addition the recalibricated output . It is worth mentioning here that we can transpose both the rowattention and the columnattention into the form of channel attention, and then transpose back flexibly.
The three outputs are summed to obtain the final feature representation:
(3) 
where is the final feature map we merge three attention maps by three branches, and , and represents the weight of each attention map from three branches. As shown in Figure 12, there are the validation loss curves of 6 groups of different a, b, and c training on the UCF_CC_50 dataset. When , the overall model is more convergent.
Finally, the merged feature map followed by convolution kernel with 33 which contains bn layer, we construct four dilated convolution layers with parameters are set to 2, and their all kernel sizes are set to 3
3 to reduce the complexity of network structure. The entire network adopts ReLU as an activation function. In order to map the feature map to the density map, we use a 1
1 filter to produce the final predicted density map, and upsample it to the 1/4 size resolution of the original image.Implementation Details and Loss Function
In order to supervise our regression model, we use Euclidean distance to measure the distance between the estimated density map and the groundtruth density map. The loss function can be defined as:
(4) 
where represents the number of training image patches, is the groundtruth map and is the estimated density map by our regression module. The is the parameters of our model learning and learns the loss between the groundtruth density map and the estimated density map. The is the density distribution of objects in images can be computed as below:
(5) 
where
represents twodimensional Gaussian distribution,
is the annotated position, and the number of covariance matrix is . With this density map , the number of people can be computed as below:(6) 
Note that the number of people obtained by integrating the pixels of the predicted density map.
At the training stage, we initialize the endtoend training model with some parameters of the network using the pretrained VGG model, and use Adam optimizer [17] in training. For training the whole network, the learning rate is set as 0.0001, and the batch size is 1 in UCF_CC_50 dataset, and other three benchmark dataset which use the batch size with 6.
Experiments
We evaluate our proposed method on three different crowd counting datasets, i.e., UCF_CC_50 dataset [10], ShanghaiTech dataset [39], and UCFQNRF [11].
The UCF_CC_50 dataset is a very challenging dataset. For the distribution of crowd density varies considerably, there are serious problems with occlusion and rapid changes in the number of people per image, which makes it extremely difficult. More specifically, the dataset has only 50 images with a total of 63,974 head center annotation provided. The number of head counts vary from 84 to 4543 (1280 in average) per image. Following [10], we conduct fivefold cross validation and report average test performance.
The ShanghaiTech PartB dataset is a largest open crowd counting dataset in term of the number of annotated people. It contains 716 images with the fixed size of taken from busy streets, covering 88,488 people in total with head center annotations provided. Each image is with the number of people ranging from 9 to 578. Following [39], we take 400 images for training and the rest 316 for evaluation.
The UCFQNRF dataset is a new crowd counting dataset consisting of 1,525 images with a total of 1.25 million annotations, of which crowd counts between 49 and 12,865. It is splitted into training and testing subsets consisting of 1201 and 534 images, respectively.
Note that We fixed the covariance matrix of the Gaussian function to generate the groundtruth density map to in UCF_CC_50, ShanghaiTech PartB, and UCFQNRF dataset. For all benchmark datasets, we follow the data augmentation and data processing techniques used in the method [6].
Regarding the evaluation metrics, we adopt Mean Absolute Error (MAE) and rooted Mean Square Error (MSE), which are defined as
(7) 
where represents the number of images in test set, represents the actual count in the th image, and represents the predicted count in the th image. MAE measures the accuracy of crowd counting algorithm, MSE measures the robustness of crowd counting algorithm.
Comparison with Stateoftheart
We compared our proposed MLAttnCNN with a series of recent advanced approaches, i.e., MCNN [39], CascadeCNN [31], SwitchCNN [28], ATCSRNet [42], CLCNN [11], DRSAN [21], CSRNet [19], TEDNet [15]. SANet [2], TDFCNN [27], ASD [35], SL2R [23], PACNN [29], and CAN [22].
For fair comparison, we train each model on the same training set with the same data augmentation tricks, and evaluate on the same testing set.
UCF_CC_50 dataset
We summarize the quantitative results in Table 1, from which we can clearly observe that our approach outperforms the stateoftheart methods and has achieved the best results with MAE of 200.8. This suggests that our proposed method is able to achieve good performance even though the training data is insufficient.
Method  Venue/Year  MAE  MSE 

DRSAN  IJCAI/2018  219.2  250.2 
SANet  ECCV/2018  258.4  334.9 
TDFCNN  AAAI/2018  354.7  491.4 
SL2R  PAMI/2019  279.6  408.1 
PACNN  CVPR/2019  241.7  320.7 
CAN  CVPR/2019  212.2  243.7 
MLAttnCNN  AAAI/2020  200.8  273.8 
For better understand the advantage of our proposed approach, we visualize some examples of density maps estimated on the UCF_CC_50 dataset in Figure 6. We can see that our predicted density map has high quality and corresponding crowd images, and the predicted density maps closely follow the actual crowd distribution in the visualization.
We also compared the predicted and actual counts in the UCF_CC_50 dataset, as shown in Figure 7. Our method is superior to CAN across most data splits, which improves the robustness and effectiveness of our proposed method. For most images, the predicted count is close to the actual count relatively, but while the crowd density is particularly large of an image, the prediction error is also relatively large. We believe that the reason for this may be that the training data is not enough, and lack of crowded image data, resulting in our model not able to learn the image features in training well.
ShanghaiTech PartB dataset
Table 2 presents all the quantitative results. As we can observe, compared with other methods, we achieve the best MAE of 7.5 and MSE of 11.6. We also provide the visualization results in Figure 9, and we divide all the testing images into 10 groups according to basis of the crowd count, and count each group on average. Estimated count comes from the prediction of our network structure. We can see that our method is superior to CAN in data splits, which proves its robustness and effectiveness, which as been seen in Figure 8.
Method  Venue/Year  MAE  MSE 

MCNN  CVPR/2016  26.4  41.3 
CascadeCNN  AVSS/2017  20.0  31.1 
SwitchCNN  CVPR/2017  21.6  33.4 
CSRNet  CVPR/2018  10.6  16.0 
SANet  ECCV/2018  8.4  13.6 
ATCSRNet  CVPR/2019  8.11  13.53 
TEDNet  CVPR/2019  8.2  12.8 
CAN  CVPR/2019  7.8  12.2 
MLAttnCNN  AAAI/2020  7.5  11.6 
UCFQNRF dataset
We compare our proposed MLAttnCNN with eleven baselines and report the results in Table 3. On this dataset, we achieve the best MAE of 101.2 and MSE of 174.6. We also visuailize some examples on these two datasets on Figure 10. Again, we can observe that our proposed MLAttnCNN model always improves the counting performance.
Method  Venue/Year  MAE  MSE 

MCNN  CVPR/2016  277  426 
CascadeCNN  AVSS/2017  252  514 
SwitchCNN  CVPR/2017  228  445 
CLCNN  ECCV/2018  132  191 
TEDNet  CVPR/2019  113  188 
CAN  CVPR/2019  107  183 
MLAttnCNN  AAAI/2020  101  175 
Ablation Study
We conduct ablation study on the UCF_CC_50 dataset to justify the design of our proposed MLAttnCNN.
Effectiveness of backbone structure
We choose the VGG16 structure as our backbone for initial feature extraction. Previous density counting research [19, 21, 3, 32] generally retrain the first 5 sets of convolutional blocks in VGG, and delete the last two pooling layers with the downsampling coefficient as 8. Since too many pooling layers would lead to the loss of spatial information, we remove three pooling layers and change the downsampling coefficient to 4, in order to avoid this situation and improve accuracy.
VGG16 structure  MAE  MSE  

3Pool  2+MP, 2+MP,  233.4  315.6 
3+MP, 3  
2Pool  2+MP, 2+MP,  200.8  273.8 
3, 3) 
The results of comparison experiments are reported in Table 4, which shows that we are able to reduce both MAE and MSE values for our proposed MLAttnCNN by eliminating the number of pool layers, resulting in a more accurate density map and improve the crowd counting performance.
Effectiveness of MultiScale Pooling Module
Scales  UCF_CC_50  

MAE  MSE  
4avg pooling layers  MS1  221.2  276.6 
4avg pooling layers  MS2  241.3  324.0 
5avg pooling layers  MS3  200.8  273.8 
4avg pooling layers  MS4  244.1  338.5 
We studied four varying multiscale (MS) pooling bin sizes to assemble the model, i.e.,

MS1: 11, 22, 33, and 66 avg pooling layers.

MS2: 33, 55, 77 and 99 avg pooling layers.

MS3: 11, 33, 55, 77 and 99 avg pooling layers.

MS4: 11, 33, 55, and 77 avg pooling layers.
We compared these four options on both the ShanghaiTech dataset and the UCF_CC_50 dataset. As shown in table 5
, MS4 performs the best in the UCF_CC_50 dataset, and MS3 works the best in the ShanghaiTech dataset. Also, we observe that odd pooling bin size is more suitable for learning density map of CNN Model than even size pooling kernel. With more and more effective semantic information, odd size pooling kernel could accurately locate the spatial distribution of crowd density and improve the accuracy of crowd counting; For the UCF_CC_50 dataset, which has fewer images than SHB and UCFQNRF, and crowds are more crowded and very variable distribution. Filters with large receptive fields are more suitable for groundtruth density map with larger heads.
Effectiveness of attention modules
We also study multilevel attentive modules on the UCF_CC_50 dataset with three combinations, as shown in Table 6 and Figure 11. We can see that combining the 1stlevel channelwise attention with the 2ndlevel spatial attention works better than use the 1stlevel attention only, and the combination of all threelevel attentions achieves the best performance.
Methods  MAE  MSE 

1stLevel Attn  260.5  333.1 
1stLevel + 2ndLevel Attn  246.2  330.3 
1stLevel + 2ndLevel + 3rdLevel Attn  200.8  273.8 
Effectiveness of hyperparameters in 3rdtriple attention
We make six different sets of hyperparameter experiments in the UCF_CC_50 dataset. From the results in Figure
12, we can observe that our proposed MLAttnCNN achieves the faster convergence and better convergence when we use the hyperparameters in the 3rdlevel triplet attention.Conclusion
In this paper, we propose a multilevel attentive Convolutional Neural Network for crowd counting. We employ multiscale pooling and multilevel attentions to explore the underlying contextual details for generating a high quality density map. Our approach not only facilitates the semantic perception of the heads of different scales, but also enriches the features of different layers for fusion. It is experimentally proved more suitable for extremely dense people. Extensive experiments strongly demonstrate that our approach is robust and has outperformed the stateoftheart approaches on multiple crowd counting benchmark datasets.
Our future work includes extending the current work to solve crowd counting problems in videos.
Acknowledgments
This research is supported by the National Natural Science Foundation of China [grant numbers 42071449, 41601491].
References
 [1] (2016) Crowdnet: a deep convolutional network for dense crowd counting. In ACM MM, pp. 640–644. Cited by: Introduction, Related Work.
 [2] (2018) Scale aggregation network for accurate and efficient crowd counting. In ECCV, pp. 734–750. Cited by: Introduction, Comparison with Stateoftheart.
 [3] (201901) Scale pyramid network for crowd counting. In WACV, Vol. , pp. 1941–1950. External Links: Document, ISSN 15505790 Cited by: Effectiveness of backbone structure.
 [4] (201906) Dense Scale Network for Crowd Counting. arXiv eprints, pp. arXiv:1906.09707. External Links: 1906.09707 Cited by: Related Work.

[5]
(2019)
ARGAN: attentive recurrent generative adversarial network for shadow detection and removal
. In ICCV, pp. 10213–10222. Cited by: Introduction. 
[6]
(2019)
C^ 3 framework: an opensource pytorch code for crowd counting
. arXiv preprint arXiv:1907.02724. Cited by: Experiments.  [7] (2009) Marked point processes for crowd counting. In CVPR 2009, Cited by: Introduction.
 [8] (2019) Crowd counting using scaleaware attention networks. Cited by: Related Work.
 [9] (2021) A novel visual representation on text using diverse conditional gan for visual recognition. TIP 30, pp. 3499–3512. Cited by: Introduction.
 [10] (2013) Multisource multiscale counting in extremely dense crowd images. In CVPR, Cited by: Introduction, Experiments, Experiments.
 [11] (2018) Composition loss for counting, density map estimation and localization in dense crowds. Cited by: Comparison with Stateoftheart, Experiments.
 [12] (2020) DOAgan: dualorder attentive generative adversarial network for image copymove forgery detection and localization. In CVPR, Cited by: Introduction.
 [13] (2021) A hybrid attention mechanism for weaklysupervised temporal action localization. In AAAI, Cited by: Introduction.
 [14] (2017) DecideNet: counting varying density crowds through attention guided detection and density estimation. Cited by: Introduction, Related Work.
 [15] (2019) Crowd counting and density estimation by trellis encoderdecoder networks. In CVPR, pp. 6133–6142. Cited by: Comparison with Stateoftheart.
 [16] (2018) Crowd counting by adaptively fusing predictions from an image pyramid. arXiv preprint arXiv:1805.06115. Cited by: Introduction, Related Work.
 [17] (2014) Adam: a method for stochastic optimization. Computer Science. Cited by: Implementation Details and Loss Function.
 [18] (2010) Learning to count objects in images. In NIPS, pp. 1324–1332. Cited by: Related Work.
 [19] (2018) Csrnet: dilated convolutional neural networks for understanding the highly congested scenes. In CVPR, pp. 1091–1100. Cited by: Related Work, Comparison with Stateoftheart, Effectiveness of backbone structure.
 [20] (2020) ARShadowGAN: shadow generative adversarial network for augmented reality in single light scenes. In CVPR, Cited by: Introduction.
 [21] (2018) Crowd counting using deep recurrent spatialaware network. In IJCAI, Cited by: Comparison with Stateoftheart, Effectiveness of backbone structure.
 [22] (201906) Contextaware crowd counting. In CVPR, Cited by: Comparison with Stateoftheart.

[23]
(2019)
Exploiting unlabeled data in cnns by selfsupervised learning to rank
. PAMI. Cited by: Comparison with Stateoftheart.  [24] (2016) Towards perspectivefree object counting with deep learning. In ECCV, Cited by: Related Work.
 [25] (2001) A mrfbased approach for realtime subway monitoring. In CVPR, Cited by: Introduction.
 [26] (2017) Automatic differentiation in pytorch. Cited by: Implementation Details and Loss Function.
 [27] (2018) Topdown feedback for crowd counting convolutional neural network. In AAAI, Cited by: Comparison with Stateoftheart.
 [28] (2017) Switching convolutional neural network for crowd counting. In CVPR, pp. 4031–4039. Cited by: Introduction, Related Work, Comparison with Stateoftheart.
 [29] (2019) Revisiting perspective information for efficient crowd counting. In CVPR, pp. 7279–7288. Cited by: Comparison with Stateoftheart.
 [30] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Methodology.
 [31] (2017) Cnnbased cascaded multitask learning of highlevel prior and density estimation for crowd counting. In AVSS, pp. 1–6. Cited by: Comparison with Stateoftheart.
 [32] (2019) HAccn: hierarchical attentionbased crowd counting network. TIP. Cited by: Effectiveness of backbone structure.
 [33] (2019) Shadow inpainting and removal using generative adversarial networks with slice convolutions. CGF 38 (7), pp. 381–392. Cited by: Introduction.
 [34] (2018) Cbam: convolutional block attention module. In ECCV, pp. 3–19. Cited by: The 2ndLevel Spatial Attention Module.
 [35] (2019) Adaptive scenario discovery for crowd counting. In ICASSP, pp. 2382–2386. Cited by: Comparison with Stateoftheart.
 [36] (2015) Multiscale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122. Cited by: Related Work.

[37]
(2021)
A twostage attentive network for single image superresolution
. TCSVT. Cited by: Introduction.  [38] (2020) Multicontext and enhanced reconstruction network for single image super resolution. In ICME, Cited by: Introduction.
 [39] (2016) Singleimage crowd counting via multicolumn convolutional neural network. In CVPR, Cited by: Introduction, Related Work, Comparison with Stateoftheart, Experiments, Experiments.
 [40] (2018) Attention to head locations for crowd counting. arXiv preprint arXiv:1806.10287. Cited by: Introduction, Related Work.
 [41] (2019) Multiresolution attention convolutional neural network for crowd counting. Neurocomputing 329, pp. 144–152. Cited by: Related Work.
 [42] (2019) Leveraging heterogeneous auxiliary tasks to assist crowd counting. In CVPR, pp. 12736–12745. Cited by: Comparison with Stateoftheart.
Comments
There are no comments yet.