Harnessing Perceptual Adversarial Patches for Crowd Counting

by   Shunchang Liu, et al.

Crowd counting, which is significantly important for estimating the number of people in safety-critical scenes, has been shown to be vulnerable to adversarial examples in the physical world (e.g., adversarial patches). Though harmful, adversarial examples are also valuable for assessing and better understanding model robustness. However, existing adversarial example generation methods in crowd counting scenarios lack strong transferability among different black-box models. Motivated by the fact that transferability is positively correlated to the model-invariant characteristics, this paper proposes the Perceptual Adversarial Patch (PAP) generation framework to learn the shared perceptual features between models by exploiting both the model scale perception and position perception. Specifically, PAP exploits differentiable interpolation and density attention to help learn the invariance between models during training, leading to better transferability. In addition, we surprisingly found that our adversarial patches could also be utilized to benefit the performance of vanilla models for alleviating several challenges including cross datasets and complex backgrounds. Extensive experiments under both digital and physical world scenarios demonstrate the effectiveness of our PAP.



There are no comments yet.


page 1

page 3

page 6

page 7


Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting

Crowd counting has drawn much attention due to its importance in safety-...

Inconspicuous Adversarial Patches for Fooling Image Recognition Systems on Mobile Devices

Deep learning based image recognition systems have been widely deployed ...

Defensive Patches for Robust Recognition in the Physical World

To operate in real-world high-stakes environments, deep learning systems...

Dual Attention Suppression Attack: Generate Adversarial Camouflage in Physical World

Deep learning models are vulnerable to adversarial examples. As a more t...

Improving the Transferability of Adversarial Examples with Restructure Embedded Patches

Vision transformers (ViTs) have demonstrated impressive performance in v...

Patch Attack for Automatic Check-out

Adversarial examples are inputs with imperceptible perturbations that ea...

Generating Adversarial yet Inconspicuous Patches with a Single Image

Deep neural networks have been shown vulnerable toadversarial patches, w...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Figure 1: Illustration of our adversarial patches. (a) Digital world attacks (left: clean scene; right: perturbed scene). Our adversarial patch looks natural, i.e., like a sticker or poster. (b) Physical world attacks (left: clean scene; right: perturbed scene). Our patch has strong attacking ability. (c) Model performance for complex backgrounds are improved by training with our adversarial patches (left: vanilla model; right: the model trained with PAP).

Crowd counting, which estimates the number of people in unconstrained scenes, is becoming increasingly important in many safety-critical scenarios in practice (e.g., illegal gathering or pedestrian density monitoring as shown in Figure 1

(a)). Up to now, research has focused on designing different crowd counting methods, including detection-based approaches and density-map-estimation-based methods with Convolutional Neural Networks (CNNs). Among them, the estimation-based approach has become the de facto solution for crowd counting due to its better performance.

Unfortunately, current estimation-based crowd counting models are highly vulnerable to adversarial examples, i.e.

, small perturbations that are imperceptible to humans but can easily lead deep neural networks (DNNs) to wrong predictions

Szegedy et al. (2013); Liu et al. (2020a); Wu et al. (2021). Though harmful, adversarial attacks provide valuable insights into the blind-spots of DNNs while helping to improve their robustness. Wu et al. (2021) have successfully attacked crowd counting models by generating adversarial patches (perturbations confined into a small patch Brown et al. (2017)), which challenges its real-world applications. As far as we know, as the only adversarial patch attack work for crowd counting models, Wu et al. (2021) performed weak transferable attacks. This, in turn, limits their ability to evaluate the robustness of black-box crowd counting systems in practice.

Recent studies Dong et al. (2019); Lennon, Drenkow, and Burlina (2021) have shown that model-invariant characteristics greatly influence transferable attacks in vision tasks. In light of this, we aim to find those intrinsic characteristics that are shared (invariant) between models for generating adversarial patches with strong transferability. For crowd counting, we reached two key insights: (1) Different models contain various receptive fields, and they tend to show different perceptual preferences for different crowd scales, i.e., multiple scale perception. (2) Different models show similar attention patterns at the same crowd positions, i.e., shared position perception.

Thus, based on the above investigation, we propose the Perceptual Adversarial Patch (PAP) generation framework to learn model-invariant features by exploiting the model scale and position perceptions, therefore promoting the transferability of our adversarial patches. As for the scale perception, PAP introduces a differentiable interpolation module to randomly resize the adversarial examples during attacking for better adaptation to different receptive fields, which help to capture the scale invariance between models (i.e., the adversarial patches could adapt to models with different crowd scale perceptions). Regarding the position perception, PAP draws the model-shared attention of the target model from the spatially dispersed crowd patterns to the patch region, which helps to capture position invariance among models (i.e., forcing the position perception of different models to focus on the patch). Overall, our approach will improve the transferable attacking ability of adversarial patches by exploiting both the scale and position perception. Figure 1 (b) shows the application of our proposed adversarial patches in the physical world.

Furthermore, while most studies have found that adversarial training will reduce the model performance on the original task Madry et al. (2018); Tsipras et al. (2019), we found an intriguing effect that benefits model performance for crowd counting by training with our adversarial patches. Since the generated adversarial patches consist of model-invariant characteristics (i.e., scale perception and position perception), adversarial training with our patches can force the vanilla model to better focus on crowds in perception level (e.g., Figure 1 (c)).

To sum up, our contributions are summarized as follows:

  • We proposed the Perceptual Adversarial Patch (PAP) generation framework by exploiting the differentiable interpolation and density attention to capture the model-invariant features, i.e., scale perception and position perception, achieving strong transferable attacking ability.

  • We utilized the simply standard adversarial training scheme with our adversarial patches to improve the performance of vanilla crowd counting models in several aspects e.g., cross-datasets generalization and complex background robustness.

  • We demonstrated that PAP can generate adversarial patches with strong transferabilty (up to 1497.1% MAE, 957.9% MSE) through extensive experiments in both the digital and physical world. In addition, adversarial training with our patches can improve the model performance by large margins (at most -26.3% MAE, -23.4% MSE for cross-datasets generalization and -28.5% MAE, -19.3% MSE for complex backgrounds robustness.)

Related Works

Crowd Counting

Image or video-based crowd counting aims to automatically estimate the number of people in unconstrained scenes. Early works mainly focus on detection-based methods Topkaya, Erdogan, and Porikli (2014); Li et al. (2008), which have shown unsatisfactory results in extremely dense crowds. Nowadays, density-map-estimation-based approaches with Convolutional Neural Networks, as we focus on, have been widely used due to their better performance. These estimation-based methods can be roughly divided into two categories Gao et al. (2020)

based on the branches used for feature extraction: multi-column strategies

Zhang et al. (2016); Song et al. (2021) and single-column strategies Li, Zhang, and Chen (2018); Liu, Salzmann, and Fua (2019a).

Though having achieved promising results, Gao et al. (2020)

pointed out that current crowd counting models still encounter multiple challenges, which limits the landing of crowd counting systems in practice: (1) weak generalization across datasets, which will cause sub-optimal results when generalizing the model to unseen scenes with non-uniform distributions; (2) weak robustness for complex backgrounds,

e.g., weathers (rain, snow, haze, etc.), hard samples that are similar to crowds (leaves, birds, etc).

Adversarial Attacks

Adversarial examples are inputs intentionally designed to mislead DNNs but are imperceptible to humans Szegedy et al. (2013). A long line of work has been devoted to performing adversarial attacks in different scenarios by generating imperceptible perturbations Goodfellow, Shlens, and Szegedy (2014); Athalye, Carlini, and Wagner (2018); Uesato et al. (2018); Croce and Hein (2020). Besides perturbations, adversarial patch Brown et al. (2017), where noises are confined to a small and localized patch, emerged for its easy accessibility in real-world scenarios. Adversarial patches have been widely studied and applied to attack different real-world applications Karmon, Zoran, and Goldberg (2018); Eykholt et al. (2018); Thys, Van Ranst, and Goedemé (2019); Liu et al. (2019, 2020b).

In the crowd counting scenario, rare attempts have been adopted to perform adversarial attacks. Liu, Salzmann, and Fua (2019b) generated adversarial perturbations using FGSM Goodfellow, Shlens, and Szegedy (2014) and studied the defense against them in the digital world. Wu et al. (2021) proposed the first and the only method APAM for adopting adversarial patches to attack crowd counting models. However, these studies fail to generate adversarial examples with strong transferability, thus showing limited abilities for evaluating the black-box crowd counting models in practice.

Figure 2: The illustration of our Perceptual Adversarial Patch (PAP) generation framework. We optimize our patches using the differentiable interpolation module (randomly resizing the adversarial examples to adapt the different scale perceptions) and the density attention module (drawing the model-shared position perception to focus on the patch). Finally, our generated adversarial patches could mislead the crowd counting models in both the digital and physical world and further improve the model performance by standard adversarial training.


In this section, we first provide the problem definition and then elaborate on our proposed Perceptual Adversarial Patch (PAP) generation framework. Finally, we improve model performance with our PAP.

Problem Definition

For the density map estimation crowd counting models, given input image , a model is designed to approximate the ground truth density map by solving the following optimization problem:


where denotes the number of input samples.

In this paper, we aim to generate adversarial patches , a localized patch, to fool the crowd counting model for wrong predictions. Specifically, given the crowd counting model , we generate adversarial patch by maximizing the model loss as


where an adversarial example is composed of a clean image , an additive adversarial patch and a location mask {0,1}. It can be formulated as


where is the element-wise multiplication.

As a result, the generated adversarial patch can mislead the crowd counting models into wrong predictions.

Perceptual Adversarial Patch (PAP) Framework

Existing studies reveal that the model-invariant characteristics largely influence the transferability of attacks Dong et al. (2019); Lennon, Drenkow, and Burlina (2021). Thus, we aim to find the model-shared characteristics which highly influence model performance and then learn model-invariant features from them to generate transferable adversarial patches across models. Driven by this belief, we propose the Perceptual Adversarial Patch (PAP) generation framework by exploiting the model intrinsic perceptual characteristics, i.e., scale perception and position perception, to help adversarial patches capture model-invariant features. Thus, our generated adversarial patches could enjoy better transfer attacking abilities and can be further utilized to improve the model performance. The overall framework is shown in Figure 2.

Scale Perception via Differentiable Interpolation.

Recent studies Zhang et al. (2016); Gao et al. (2020) illustrated that crowd scale variation highly influences the design of estimation-based methods. Different models contain various receptive fields and tend to show different perception preferences for crowds with different scales. Capturing the scale-invariant features between models could benefit adversarial patches for better adaptation to different crowd scale perceptions, which results in stronger transferability among models. Therefore, we introduce a differentiable interpolation module to allow our adversarial patches to adapt to different model receptive fields. Through this specially designed module, the adversarial patch will be randomly resized to perform attacks during the optimization, which forces it to capture the scale invariance between models that have different scale perceptions.

Specifically, given an input image and a randomly initialized patch , we create an adversarial example by adding the patch onto the image via Eqn 3. We proposed an interpolation module to randomly resize

with probability

and then feed it to the source model, which imitates the change of the receptive fields. Thus, our adversarial patches could be adapted to different scale perceptions and better transfer to different models. The can be written as


where indicates a function which randomly conducts upsampling or downsampling operations for the image with ranges in [, ].

Since our goal is to attack the model to wrong predictions, we ought to force the model to recognize the adversarial patches as crowds to a large extent. Therefore, we take the summary of all values of the model outputs (predicted density map) and introduce the scale perception loss without using any labels or ground truth as follows


where is the pixel value at of the predicted density map.

Position Perception via Density Attention.

Previous work reveals that different models share similar position perceptions towards the same image Wang et al. (2021). As for crowd counting models, we find that they also appear similar spatially dispersed attention patterns at the same crowd positions. Therefore, we disturb the position perception of the target model by attracting the model-shared attention patterns to the adversarial patch region through salient map aggregation. In this way, the generated adversarial patches can capture the position-invariant features and perform better transferable attacks.

Specifically, inspired by Grad-CAM Selvaraju et al. (2017), given the interpolated image and a target model , we compute the attention map by introducing a density attention module as


where , is the pixel value in position of the -th feature map, denotes the RELU function, and is the parameter for global average pooling.

To draw the model attention to the patch region in turn leading to wrong estimations, we introduce the position perception loss as follows, which directly increases the attention value on the patch region:


where is the pixel value at

of the attention map. Note that we enhance the attention towards the patch region through backpropagation with the location mask

. The attention towards the other regions is relatively suppressed due to normalization.

Overall Training.

We generate the adversarial patches by jointly optimizing the scale perception loss and position perception loss . Specifically, we use the gradient-based iteration algorithm to optimize our adversarial patches. In each iteration, we first generate adversarial examples with an initial adversarial patch at a random position; then we enhance the model-shared attention towards the interpolated adversarial examples and mislead the model to predict them as crowds to a large extent; finally, we update the adversarial patch through backpropagation with the location mask. In summary, we can generate the transferable adversarial patch by optimizing the following formulation as



is a hyperparameter controlling the contributions of each term. The overall training algorithm can be described as Algorithm


0:  Initial patch , image set , and target model
0:  Adversarial patch

 the number of epochs 

     select images from
     for  steps do
        randomly generate a location mask
        generate by Eqn (3)
        conduct the interpolation operation by Eqn (4)
        get the density attention map by Eqn (6)
        calculate the , by Eqn (5, 7)
        optimize the adversarial patch by Eqn (8)
     end for
  end for
Algorithm 1 Perceptual Adversarial Patch Generation

Improving Crowd Counting with PAP

Recent studies have revealed the fact that crowd counting models are still facing several challenges including weak generalization abilities across datasets and robustness on complex backgrounds, which cast a shadow over the applications in practice Gao et al. (2020). Some studies Xie et al. (2020); Chen et al. (2021) have shown that adversarial examples can also be used to improve model performance if harnessed in the right manner. Inspired by them, we aim to take the advantage of our perceptual adversarial patches and use them to improve the performance of crowd counting models. However, to improve image recognition and object detection models, current studies Xie et al. (2020); Chen et al. (2021)

adopts multiple Batch Normalization (BN) branches (mixtureBN) to respectively handle clean and adversarial examples during adversarial training, which modifies the model architectures. They cannot be simply implemented in the crowd counting task where most models do not have BN layers. Therefore, we adversarially train crowd counting models with our perceptual adversarial patches to improve the model performance without modifying architectures.

Specifically, we modify the standard adversarial training scheme Madry et al. (2018) to adapt our PAP framework, which can be defined as follows:


where is the adversarial example (combined by and via Eqn 3), is the ground truth density map, and is the crowd counting model parameters.

represents the loss function. In practice, instead of solving the min-max optimization problem iteratively, we simply generate all the adversarial examples via the pretrained model at the beginning, which could achieve better performance and take less time (see Supplementary Material for more analyses).

Our perceptual adversarial patches could attack models under different crowd scale perceptions and disturb them to focus on the wrong position perception regions. Adversarial training with our patches is able to further enhance the model for the tolerance of perturbations brought from scales and positions. In other words, the enhanced crowd counting model with our adversarial patches could increase the perception generalization for multiple crowd scales and rectify their perceptions by better focusing on the crowd itself under noises. Therefore, it will better generalize to unseen scenarios with different crowd scales and pay more attention to crowd regions rather than complex backgrounds in the natural scenes.


In this section, we first outline the experimental settings and then illustrate the effectiveness of our proposed attacking method by thorough evaluations in both the digital and physical world. Finally, we use our adversarial patches to improve crowd counting model performance.

Experimental Settings

Datasets and Models.

For attacks, we choose Shanghai Tech dataset Zhang et al. (2016) following Wu et al. (2021), which are divided into two parts: Part A and Part B. We employ six commonly-used and SoTA crowd counting models to attack: MCNN Zhang et al. (2016), CSRNet Li, Zhang, and Chen (2018), CAN Liu, Salzmann, and Fua (2019a), BL Ma et al. (2019), DM-Count Wang et al. (2020a), and SASNet Song et al. (2021). For model improvements, We totally use three datasets following Gao et al. (2020) as Shanghai Tech, NWPU Wang et al. (2020b), and JHU-CROWD++ Sindagi, Yasarla, and Patel (2020). We select the SoTA model DM-Count for evaluation.

Evaluation Metrics

We use the widely-used crowd counting metrics Mean Absolute Error (MAE) and Mean Squared Error (MSE) following Li, Zhang, and Chen (2018) for evaluation. For attacks, Higher MAE and MSE values indicate stronger adversarial attacks; for model improvements, lower MAE and MSE values indicate better models.

Compared Methods.

For attacks, we compare with the only adversarial patch generation method for crowd counting, i.e., APAM Wu et al. (2021). For model improvements, we compare with two adversarial training methods: adversarial training with APAM generated adversarial patches (APAM-AT), PGD-adversarial training Madry et al. (2018) (PAT), and three data augmentation methods: Cutout DeVries and Taylor (2017), Cutmix Yun et al. (2019), and Augmix Hendrycks et al. (2019). Note that, all the methods use the same amount of extra data to train models.

Implementation details.

For attacks, we randomly initialize the adversarial patch with a fixed size and conduct training with batch size 1 by 100 iterations every epoch with an attack step size of 0.01, and a maximum of 3 epochs. The position and orientation of the patch are randomly chosen, which makes our adversarial patches could universally attack all images. We set the interpolation hyper-parameter , , as 0.2, 0.9, 1.1 and the position perception loss weight as 10 (More details can be seen in Supplementary Materials). For model improvements,

we first generate adversarial patches on each image in the original training set and mix them to obtain the new training set (the ratio of adversarial examples and clean examples is 1:1). Then, we train the crowd counting model using the new training set. All of our codes are implemented in PyTorch. We conduct all experiments on an NVIDIA Tesla V100 GPU cluster.

Digital World Attack

We evaluate the performance of our adversarial patches in the digital world under both white-box and black-box settings. For white-box attacks, adversaries have complete knowledge of the target model and can fully access it; for black-box attacks, adversaries have limited model knowledge and can not directly access the model. As for APAM, we use their released codes and keep the same settings for fair comparisons. Due to the limited space, we hereby only report the results of adversarial patches with the size222The size of the patch only accounts for 0.83% of the size of images in the Shanghai Tech dataset. For other patch sizes and datasets, please refer to the Supplementary Materials. of on Shanghai Tech Part A. Note that in our main paper we only report the results of attacks that increase the counting number. Besides, we can also generate adversarial patches that decrease the crowd counting numbers. We defer more results in the Supplementary Materials.

White-box attacks.

For the white-box attack, we generate adversarial patches using the specific target model and perform attacks on it accordingly. As shown in Table 1 (diagonal), in contrast to APAM, our method achieves higher MAE and MSE in the white-box settings on different models. Therefore, our method is able to generate adversarial patches with much stronger white-box attacking ability.

MAE / MSE Target Model
Source model Method MCNN CSRNet CAN BL DM-Count SASNet
Clean 108.0 / 165.0 67.0 / 105.2 59.9 / 94.1 61.8 / 94.1 58.2 / 93.2 52.8 / 86.2
MCNN APAM 116.5 / 174.7 66.9 / 105.3 59.7 / 94.3 61.5 / 93.7 58.3 / 93.3 52.6 / 86.2
Ours 1249.1 / 1287.8 67.1 / 105.5 60.0 / 95.2 64.6 / 96.3 60.8 / 94.1 52.8 / 87.2
CSRNet APAM 107.9 / 164.7 67.1 / 105.7 60.0 / 94.2 61.9 / 94.0 58.3 / 93.0 53.5 / 86.6
Ours 116.6 / 167.9 459.0 / 471.7 164.7 / 184.3 300.9 / 314.2 185.9 / 203.0 56.7 / 88.5
CAN APAM 107.2 / 163.2 67.0 / 105.6 60.5 / 95.6 62.0 / 93.9 58.5 / 92.6 53.5 / 87.0
Ours 154.4 / 196.4 355.5 / 369.6 581.2 / 616.2 347.4 / 359.0 206.8 / 221.3 57.4 / 88.3
BL APAM 107.5 / 164.6 67.1 / 105.6 60.1 / 94.5 61.6 / 93.9 58.2 / 93.1 53.2 / 86.4
Ours 118.4 / 168.7 73.2 / 107.6 61.6 / 94.7 1648.6 / 1658.2 635.0 / 644.9 53.9 / 87.1
DM-Count APAM 107.4 / 164.3 67.0 / 105.5 60.0 / 94.5 61.7 / 94.0 58.2 / 93.1 53.4 / 86.8
Ours 117.2 / 167.9 71.5 / 107.0 65.0 / 97.3 987.0 / 995.5 999.5 / 1007.6 54.9 / 88.3
SASNet APAM 107.9 / 164.5 68.2 / 104.4 60.3 / 93.5 62.2 / 94.2 59.1 / 93.4 55.1 / 89.7
Ours 109.9 / 165.0 68.7 / 105.4 61.2 / 94.2 74.4 / 104.6 72.0 / 107.6 232.5 / 244.6
Table 1: Attacks on Shanghai Tech dataset (Part A). The first row is the results for clean samples. The results on the diagonal are under white-box settings while the others are under black-box settings. We outperform APAM by large margins with higher MAE and MSE.

Black-box attacks.

In the black-box setting, we first generate adversarial patches based on one specific model, and then transfer the attacks to other models and test their attacking ability. As illustrated in Table 1, we can draw some observations as follows:

(1) Compared to APAM, we achieve stronger black-box attacking ability by showing higher MAE and MSE values for different models and outperform APAM by large margins (at most by 1497.1% , 957.9% from DM-Count to BL).

(2) We found that adversarial attacks could hardly transfer between multi-column (e.g., SASNet and MCNN) and single-column models. We conjecture the reasons might be those multi-column models have more complex architectures with several branches and more information redundancy Li, Zhang, and Chen (2018). These architectures might cause the weak black-box transferability of adversarial attacks, and we leave the detailed analyses as future work.

Figure 3: Physical world attack in a real-world scenario. Our adversarial patches can mislead the crowd counting model under different scenes in practice.

Physical World Attack

Here, we further evaluate the practical performance of our adversarial patches in the physical world, which is also more challenging and meaningful.

We first generate an adversarial patch using the CSRNet model and print the adversarial patches by an HP Color LaserJet Pro MFP M281fdw printer. We then take 96 pictures by holding them or sticking them as a flag or poster with a Huawei P40 mobile phone. To prove its effectiveness in the complex real-world scenario, We take photos using different patch sizes (15cm15cm and 20cm20cm), distances (1.5m and 3m), and scenes (indoor and outdoor). In each setting, we take six pairs of photos with and without the patch. Our adversarial patches are able to attack crowd counting models in the real-world scenarios (increasing the average MAE and MSE values from 0.76, 1.03 to 232.23, 395.09 respectively for a black-box SoTA crowd counting model DM-Count). As shown in Figure 3, the generated adversarial patches are quite natural in the real world and will pose safety problems when deployed in practice.

Improving Crowd Counting Performance

In this section, we aim to prove the effectiveness of our perceptual adversarial patches for benefiting the model performance. Specifically, we evaluate the generalization ability across datasets and robustness on scenes with complex backgrounds of the enhanced crowd counting model.

MAE / MSE Cross-dataset Evaluation
Method Part APart B Part BPart A
Vanilla 22.8 / 34.3 142.4 / 241.3
Cutout 18.0 / 27.9 153.1 / 272.3
Cutmix 22.1 / 34.1 147.5 / 241.9
Augmix 17.9 / 29.0 145.3 / 243.2
PAT 23.7 / 35.1 145.7 / 249.6
APAM-AT 23.3 / 35.7 143.3 / 246.7
Ours 16.8 / 27.8 128.9 / 230.0
Table 2: Cross-dataset evaluation (results are shown as “training datasettest dataset”). Our enhanced model has better generalization (lower MAE and MSE) across datasets.

Generalization across Datasets.

We use Shanghai Tech Part A and Shanghai Tech Part B (photos are taken from different scenarios in different ways) to conduct the cross-dataset performance evaluation. As shown in Table 2, due to the strong recognition ability for crowds with different scales, models trained with our PAP can significantly improve the generalization ability across datasets by large margins (at most -26.3% for MAE and -23.4% for MSE). We also outperform the adversarial training baselines (e.g., APAM-AT and PAT) which deteriorate the model generalization and the data augmentation techniques (e.g., Cutout, Cutmix, and Augmix).

Robustness for Complex Backgrounds.

Following Gao et al. (2020), we test the model performance on scenes with complex backgrounds using three test sets including distractors, special weather samples, and negative samples. Among them, distractors and special weather samples are elaborately selected from JHU-CROWD++, and negative samples are built on NWPU. Specifically, for distractors, they contain complex backgrounds which may be confused for the crowd; for special weather samples, they are composed of images under special weather conditions including rain, snow, and haze; for negative samples, they contain images with no crowds.

As shown in Figure 4, models trained with our PAP could improve robustness on these three test sets (-8.0% MAE and -4.2% MSE on distractors, -2.1% MAE and -1.0% MSE on special weather samples and -28.5% MAE and -19.3% MSE on negative samples), and we also outperform other methods. Intuitively, adversarial training with our patches can help models to resist the crowd-like noises and focus on the real crowd patterns, resulting in stronger robustness on negative samples. As the visualization shown in Figure 5, the attention of the models trained with our PAP could focus on the human areas more accurately, while the density maps of the raw model are focused on other distractors.

(a) Distractors
(b) Weathers
(c) Negative
Figure 4: Model performance on images with complex backgrounds (i.e., distractors, special weathers, and negative samples). Model robustness can be improved by training with our adversarial patches (lowest MAE and MSE).
Figure 5: The density map for models on the scenes with complex backgrounds. The model trained with our adversarial patch focus on the crowd more precisely leading to better robustness.

Ablation Studies

In this section, we conduct ablation studies to further investigate the contributions of the scale perception and position perception, i.e., and . Thus, we generate adversarial patches with or without these two loss terms from CSRNet, and then perform transfer attacks to other models on the Shanghai Tech dataset. As shown in Table 3, the MAE and MSE values for attacking all target models increase after adding the perception loss ; meanwhile, the transfer attacking ability is also improved after introducing the position loss . We achieve the highest MAE and MSE values when two modules are added. Thus, the above experimental results demonstrate the effectiveness of our scale perception and position perception for improving the transferability of attacks.

MAE / MSE Dataset
Target Model Part A Part B
MCNN 108.0 / 165.0 28.3 / 38.7
116.1 / 167.6 141.5 / 147.1
116.2 / 167.9 32.9 / 40.9
116.6 / 167.9 151.0 / 156.3
CAN 59.9 / 94.1 7.5 / 11.9
155.9 / 174.9 146.5 / 147.6
94.6 / 121.5 40.5 / 42.8
164.7 / 184.3 149.2 / 150.1
BL 61.8 / 94.1 7.3 / 12.0
290.3 / 304.7 66.1 /67.6
251.9 / 266.3 23.1 / 27.0
300.9 / 314.2 66.9 / 68.4
DM-Count 58.2 / 93.2 7.3 / 11.8
182.8 / 201.7 107.6 / 109.4
152.9 / 169.8 8.0 / 12.6
185.9 / 203.0 110.0 / 111.9
SASNet 52.8 / 86.2 6.4 / 9.9
56.2 / 88.2 6.5 / 10.0
54.3 / 87.0 6.5 / 10.2
56.7 / 88.5 6.5 / 10.2
Table 3: The ablation study on dual perception module. “” and “” respectively denote the scale perception loss and position perception loss. Both of the loss terms could improve the transfer attacking ability.


To generate strong transferable attacks for crowd counting models, this paper proposes the Perceptual Adversarial Patch (PAP) generation framework to learn the model-invariant features by exploiting both the model scale perception and position perception. Besides, our adversarial patches can also be exploited to benefit crowd counting model performance via adversarial training. To validate the effectiveness of our proposed method, we conduct extensive experiments in both the digital and physical world, which shows that PAP achieves the state-of-the-art performance.

In contrast to the previous studies, we surprisingly found that adversarial training with our patches can benefit model performance. Though providing a preliminary explanation, we are also interested in investigating the nature of the observation and we leave it as future work.


  • Athalye, Carlini, and Wagner (2018) Athalye, A.; Carlini, N.; and Wagner, D. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples.

    International Conference on Machine Learning

  • Brown et al. (2017) Brown, T. B.; Mané, D.; Roy, A.; Abadi, M.; and Gilmer, J. 2017. Adversarial patch. arXiv preprint arXiv:1712.09665.
  • Chen et al. (2021) Chen, X.; Xie, C.; Tan, M.; Zhang, L.; Hsieh, C.-J.; and Gong, B. 2021. Robust and Accurate Object Detection via Adversarial Learning. In CVPR.
  • Croce and Hein (2020) Croce, F.; and Hein, M. 2020. Reliable Evaluation of Adversarial Robustness with an Ensemble of Diverse Parameter-free Attacks. In International Conference on Machine Learning.
  • DeVries and Taylor (2017) DeVries, T.; and Taylor, G. W. 2017. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552.
  • Dong et al. (2019) Dong, Y.; Pang, T.; Su, H.; and Zhu, J. 2019. Evading defenses to transferable adversarial examples by translation-invariant attacks. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    , 4312–4321.
  • Eykholt et al. (2018) Eykholt, K.; Evtimov, I.; Fernandes, E.; Li, B.; Rahmati, A.; Xiao, C.; Prakash, A.; Kohno, T.; and Song, D. 2018.

    Robust physical-world attacks on deep learning visual classification.

    In Proceedings of the IEEE conference on computer vision and pattern recognition, 1625–1634.
  • Gao et al. (2020) Gao, G.; Gao, J.; Liu, Q.; Wang, Q.; and Wang, Y. 2020. Cnn-based density estimation and crowd counting: A survey. arXiv preprint arXiv:2003.12783.
  • Goodfellow, Shlens, and Szegedy (2014) Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples (2014). arXiv preprint arXiv:1412.6572.
  • Hendrycks et al. (2019) Hendrycks, D.; Mu, N.; Cubuk, E. D.; Zoph, B.; Gilmer, J.; and Lakshminarayanan, B. 2019. Augmix: A simple data processing method to improve robustness and uncertainty. arXiv preprint arXiv:1912.02781.
  • Karmon, Zoran, and Goldberg (2018) Karmon, D.; Zoran, D.; and Goldberg, Y. 2018. Lavan: Localized and visible adversarial noise. In International Conference on Machine Learning, 2507–2515. PMLR.
  • Lennon, Drenkow, and Burlina (2021) Lennon, M.; Drenkow, N.; and Burlina, P. 2021. Patch Attack Invariance: How Sensitive are Patch Attacks to 3D Pose? arXiv preprint arXiv:2108.07229.
  • Li et al. (2008) Li, M.; Zhang, Z.; Huang, K.; and Tan, T. 2008. Estimating the number of people in crowded scenes by mid based foreground segmentation and head-shoulder detection. In 2008 19th international conference on pattern recognition, 1–4. IEEE.
  • Li, Zhang, and Chen (2018) Li, Y.; Zhang, X.; and Chen, D. 2018. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1091–1100.
  • Liu et al. (2020a) Liu, A.; Huang, T.; Liu, X.; Xu, Y.; Ma, Y.; Chen, X.; Maybank, S.; and Tao, D. 2020a. Spatiotemporal Attacks for Embodied Agents. In European Conference on Computer Vision.
  • Liu et al. (2019) Liu, A.; Liu, X.; Fan, J.; Ma, Y.; Zhang, A.; Xie, H.; and Tao, D. 2019. Perceptual-Sensitive GAN for Generating Adversarial Patches. In

    33rd AAAI Conference on Artificial Intelligence

  • Liu et al. (2020b) Liu, A.; Wang, J.; Liu, X.; Cao, b.; Zhang, C.; and Yu, H. 2020b. Bias-based Universal Adversarial Patch Attack for Automatic Check-out. In European Conference on Computer Vision.
  • Liu, Salzmann, and Fua (2019a) Liu, W.; Salzmann, M.; and Fua, P. 2019a. Context-aware crowd counting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5099–5108.
  • Liu, Salzmann, and Fua (2019b) Liu, W.; Salzmann, M.; and Fua, P. 2019b. Using depth for pixel-wise detection of adversarial attacks in crowd counting. arXiv preprint arXiv:1911.11484.
  • Ma et al. (2019) Ma, Z.; Wei, X.; Hong, X.; and Gong, Y. 2019. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6142–6151.
  • Madry et al. (2018) Madry, A.; Makelov, A.; Schmidt, L.; Tsipras, D.; and Vladu, A. 2018. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations.
  • Selvaraju et al. (2017) Selvaraju, R. R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; and Batra, D. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, 618–626.
  • Sindagi, Yasarla, and Patel (2020) Sindagi, V.; Yasarla, R.; and Patel, V. M. 2020. Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Song et al. (2021) Song, Q.; Wang, C.; Wang, Y.; Tai, Y.; Wang, C.; Li, J.; Wu, J.; and Ma, J. 2021. To Choose or to Fuse? Scale Selection for Crowd Counting. In Proceedings of the AAAI Conference on Artificial Intelligence, 2576–2583.
  • Szegedy et al. (2013) Szegedy, C.; Zaremba, W.; Sutskever, I.; Bruna, J.; Erhan, D.; Goodfellow, I.; and Fergus, R. 2013. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199.
  • Thys, Van Ranst, and Goedemé (2019) Thys, S.; Van Ranst, W.; and Goedemé, T. 2019. Fooling automated surveillance cameras: adversarial patches to attack person detection. In CVPRW.
  • Topkaya, Erdogan, and Porikli (2014) Topkaya, I. S.; Erdogan, H.; and Porikli, F. 2014. Counting people by clustering person detector outputs. In 2014 11th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 313–318. IEEE.
  • Tsipras et al. (2019) Tsipras, D.; Santurkar, S.; Engstrom, L.; Turner, A.; and Madry, A. 2019.

    Robustness may be at odds with accuracy.

    In International Conference on Learning Representations.
  • Uesato et al. (2018) Uesato, J.; O’Donoghue, B.; van den Oord, A.; and Kohli, P. 2018. Adversarial Risk and the Dangers of Evaluating Against Weak Attacks. In International Conference on Machine Learning.
  • Wang et al. (2020a) Wang, B.; Liu, H.; Samaras, D.; and Hoai, M. 2020a. Distribution matching for crowd counting. arXiv preprint arXiv:2009.13077.
  • Wang et al. (2021) Wang, J.; Liu, A.; Yin, Z.; Liu, S.; Tang, S.; and Liu, X. 2021. Dual Attention Suppression Attack: Generate Adversarial Camouflage in Physical World. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 8565–8574.
  • Wang et al. (2020b) Wang, Q.; Gao, J.; Lin, W.; and Li, X. 2020b. NWPU-crowd: A large-scale benchmark for crowd counting and localization. IEEE transactions on pattern analysis and machine intelligence, 43(6): 2141–2149.
  • Wu et al. (2021) Wu, Q.; Zou, Z.; Zhou, P.; Ye, X.; Wang, B.; and Li, A. 2021. Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting. In ACM MM.
  • Xie et al. (2020) Xie, C.; Tan, M.; Gong, B.; Wang, J.; Yuille, A. L.; and Le, Q. V. 2020. Adversarial examples improve image recognition. In IEEE Conference on Computer Vision and Pattern Recognition.
  • Yun et al. (2019) Yun, S.; Han, D.; Oh, S. J.; Chun, S.; Choe, J.; and Yoo, Y. 2019.

    Cutmix: Regularization strategy to train strong classifiers with localizable features.

    In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6023–6032.
  • Zhang et al. (2016) Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; and Ma, Y. 2016. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, 589–597.