Log In Sign Up

Transferable Physical Attack against Object Detection with Separable Attention

by   Yu Zhang, et al.

Transferable adversarial attack is always in the spotlight since deep learning models have been demonstrated to be vulnerable to adversarial samples. However, existing physical attack methods do not pay enough attention on transferability to unseen models, thus leading to the poor performance of black-box attack.In this paper, we put forward a novel method of generating physically realizable adversarial camouflage to achieve transferable attack against detection models. More specifically, we first introduce multi-scale attention maps based on detection models to capture features of objects with various resolutions. Meanwhile, we adopt a sequence of composite transformations to obtain the averaged attention maps, which could curb model-specific noise in the attention and thus further boost transferability. Unlike the general visualization interpretation methods where model attention should be put on the foreground object as much as possible, we carry out attack on separable attention from the opposite perspective, i.e. suppressing attention of the foreground and enhancing that of the background. Consequently, transferable adversarial camouflage could be yielded efficiently with our novel attention-based loss function. Extensive comparison experiments verify the superiority of our method to state-of-the-art methods.


page 2

page 4

page 8


Attack on Multi-Node Attention for Object Detection

This paper focuses on high-transferable adversarial attacks on detection...

Improving the Transferability of Adversarial Examples with Restructure Embedded Patches

Vision transformers (ViTs) have demonstrated impressive performance in v...

T-SEA: Transfer-based Self-Ensemble Attack on Object Detection

Compared to query-based black-box attacks, transfer-based black-box atta...

Delving into Data: Effectively Substitute Training for Black-box Attack

Deep models have shown their vulnerability when processing adversarial s...

Stochastic Variance Reduced Ensemble Adversarial Attack for Boosting the Adversarial Transferability

The black-box adversarial attack has attracted impressive attention for ...

Once a MAN: Towards Multi-Target Attack via Learning Multi-Target Adversarial Network Once

Modern deep neural networks are often vulnerable to adversarial samples....

Enhancing the Transferability via Feature-Momentum Adversarial Attack

Transferable adversarial attack has drawn increasing attention due to th...

1 Introduction

In recent years, deep neural networks (DNNs) have achieved a great success among various scenarios,

e.g. object detection Redmon and Farhadi (2018); Zhu et al. (2020); Ren et al. (2015), image classification Gong et al. (2019, 2020) and image segmentation He et al. (2017). However, it has been found that DNNs are vulnerable to adversarial samples Goodfellow et al. (2014) and the predicted results may be distorted dramatically by crafting human-imperceptible perturbations towards the input samples.

Plenty of methods have been proposed about how to generate adversarial samples Wu et al. (2020a); Wang et al. (2021c); Huang et al. (2020)

, which could be classified into

white-box vs. black-box attack according to the transparency of model information. White-box attack permits intruders to get access to model structure, model parameters and even the training dataset when generating adversarial samples. However, this information would be inaccessible for black-box attack. Actually, adversarial attack could also be categorized as untargeted vs. targeted attack and digital vs. physical attack. Untargeted attack is conducted with the aim of altering model prediction only, not caring about what the final predicted label is. On the contrary, targeted attack requires that the prediction should be misclassified to a specific label. Digital attack assumes that the adversary can manipulate image pixels in the digital space, where perturbation constrains (e.g. norm Su et al. (2019) or norm Yang et al. (2020)) would be adopted to avoid being suspected. As for physical attack, it tries to modify objects and confuse DNNs in the physical world or relevant simulation environment, where a host of practical influence factors should be taken into account, making it much more difficult and challenging than that of digital attack Wang et al. (2021a); Jiang et al. (2021). For the sake of practicality, this paper mainly focuses on untargeted physical black-box attack.

Figure 1: Typical detection results and attention maps of vehicle covered with different adversarial patterns. All the patterns are generated on Yolo-V3 and tested on Faster R-CNN. (a)(c) are patterns generated by adopting adversarial losses from FCA Jiang et al. (2021), DAS Wang et al. (2021a) and ATA Wu et al. (2020a) respectively (refer to subsection 4.2 for more details). (d) displays the camouflage manufactured with our method. (e)(h) are corresponding attention maps of (a)(d).

Although a great quantity of attack methods have sprung up, the limitations still exist: 1) The majority of methods are about classification attack Wu et al. (2020a); Wang et al. (2021c); Chen et al. (2020), which could not be applied to object detection directly. 2) It is difficult for existing 2D adversarial physical patch Eykholt et al. (2018); Thys et al. (2019) to adapt to multi-view scenario of 3D space. 3) Limited attention was paid on transferability for previous physical attack methods Maesumi et al. (2021); Huang et al. (2020); Wang et al. (2021a), which would lead to the degeneration of black-box attack. To address those issues, our work focuses on crafting a physically realizable 3D adversarial pattern that could deceive black-box detectors with high transferability. Specifically, the pattern is initialized given the sampled faces of a 3D vehicle model mesh, and then optimized by attacking the averaged multi-scale attention maps. Besides, unlike the general middle-layer attack strategies Wu et al. (2020a); Wang et al. (2021a) which take the map as a whole during optimization, we carry out attention attack by considering the contribution of foreground and background attention respectively. Consequently, a physical structured camouflage with high transferability could be yielded with our proposed method. Typical comparisons about detection results and attention distraction are displayed in Figure 1, which shows that our method has better performance than baseline methods, i.e.

our adversarial texture could bring about the greatest reduction in both detected class probability and model attention.

Our contributions include the following:1) To the best of our knowledge, we are the first to conduct Attack on Separable Attention (ASA) with a novel attention-based loss function, which is mainly made up of two modules, Foreground Attention Suppression module and Background Attention Amplification module. 2) Based on detection models, multi-scale attention maps are developed to capture features of objects at different resolutions. In order to further boost transferability, the maps are smoothed with a sequence of compound transformations, which could alleviate model-specific noise. 3) Experiments show that our method has a good performance and outperforms other state-of-the-art approaches in terms of black-box transferability.

2 Related work

Physical Adversarial Attacks

The purpose of physical adversarial attack is to craft localized visible perturbations that have the potential to deceive DNN-based vision systems. According to space dimension, physical attack could be classified into 2D physical attack Eykholt et al. (2018); Thys et al. (2019); Sharif et al. (2016); Wu et al. (2020b); Wang et al. (2021b); Du et al. (2022) and 3D physical attack Athalye et al. (2018); Jiang et al. (2021); Maesumi et al. (2021); Wang et al. (2021a). Sharif et al. Sharif et al. (2016)

developed a method of deceiving face-recognition system by generating physical eyeglass frames. Eykholt

et al. Eykholt et al. (2018) came up with Robust Physical Perturbations (RP2), a general attack method that could mislead classification of stop sign under different environmental conditions. Thys et al. Thys et al. (2019) paid attention to physical attack against person in the real world by putting one 2D adversarial patch on torso. The method presented a good performance in the front view, but cannot work well for larger shooting angles Tarchoun et al. (2021). Fortunately, this problem could be overcome by 3D physical attack. Athalye et al. Athalye et al. (2018) developed Expectation Over Transformation, the first approach of generating 3D robust adversarial samples. Maesumi et al. Maesumi et al. (2021) presented a universal 3D-to-2D adversarial attack method, where a structured patch was sampled from the reference human model and the human pose could be adjusted freely during training. Wang et al. Wang et al. (2021a)

proposed Dual Attention Suppression (DAS) attack based on an open source 3D virtual environment, and Jiang

et al. Jiang et al. (2021) extended DAS by presenting a full-coverage adversarial attack.

Black-box Attack

The intent of black-box attack is to yield a pattern that not only has a striking effect in white-box attack, but can misguide black-box models with high transferability. Generally, black-box attack could be divided into query-based attack and transfer-based attack. The former method estimates gradient roughly by obtaining the variation of model output with disturbed inputs, and then updates adversarial samples like white-box attacks

Yang et al. (2020)Xiang et al. (2021)Guo et al. (2019). Due to the complexity of DNNs, plenty of queries are normally inevitable (which might be impermissible in the real world) if one attacker wants to get a better gradient estimation. The latter method, i.e. transfer-based attack, relies on transferability of adversarial sample Zhou et al. (2018); Inkawhich et al. (2019); Wang et al. (2021c); Huang et al. (2019); Inkawhich et al. (2020). Zhou et al. Zhou et al. (2018) displayed that attack performance could be improved by enlarging the distance of intermediate feature maps between benign images and the corresponding adversarial samples. Inkawhich et al. Inkawhich et al. (2019) found that one particular layer may play a more important role than others in adversarial attack. Feature importance-aware attack was proposed in Wang et al. (2021c) with the purpose of enhancing model transferability by destroying crucial object-aware features. Chen et al. Chen et al. (2022) realized transferable attack on object detection by manipulating relevance maps. Specifically, they first computed two attention maps corresponding to the original class and another class respectively, and then minimized the ratio of them. Although their method works for multi-class datasets, it is difficult to generate heatmaps of the second class if there is only one category to be detected. Wang et al. Wang et al. (2021a) conducted attack by shattering the "heated" region with a recursive method. However, our experiments show that it would be inefficient to search for the "heated" regions for some high-resolution maps.

3 Proposed Framework

In this section, we present a framework of physically realizable adversarial attack on object detection. The main purpose of our work is to generate a structured 3D adversarial pattern such that, when pasted on the surface of one vehicle, the pattern can fool black-box detectors with high probability. The overall pipeline of our method is shown in Figure 2.

3.1 Preliminaries

Unlike 2D adversarial attack, the input image in 3D physical attack is ususally a rendered result of one object model based on renderer , i.e. , where denotes object mesh that consists of a large number of triangular faces, is the corresponding texture map and represents camera parameter. This paper adopts a modular differentiable renderer provided by Pytorch3D Ravi et al. (2020). Since the rendered 2D image is made up of foreground object and a totally white background, we can get the object mask easily by setting values of background pixels to zeros and the remaining pixels to ones. In order to obtain a physical world background , we take CARLA Dosovitskiy et al. (2017) as our 3D virtual simulation environment just like previous work of Wang et al. Wang et al. (2021a). Then the synthetic 2D input based on 3D model could be obtained as below:


For one detection model parameterized by , the aim of physical attack is to update pattern that can mislead the model by optimizing the following equation:


where indicates object ground truth and means a specific loss function that encourages the object detector to generate an incorrect output with either a misaligned bounding box or a low class (object) confidence. To realize high transferable attack, our work develops a novel attention-based loss function, which will be presented in the following subsections.

Figure 2: The proposed pipeline. We first sample a set of triangular faces from mesh, then initialize them with random noise, and finally render the 3D model on virtual background. After that, several compound transformations are carried out towards the original inputs before fed into detection model, where multi-scale attention maps can be obtained. To suppress model-specific noise of the maps, we align the maps with original input images by several inverse transformations, and subsequently obtain averaged map by linearly combining the aligned maps. Finally, the perturbation pattern is optimized by suppressing foreground attention and amplifying background attention.

3.2 Generating Multi-scale Attention Maps with Detection Model

Motivated by Grad-CAM Selvaraju et al. (2017), we propose a detection-suitable and noise-reduced attention generation method to make it qualified to transferable attack on object detection. As regards how to get attention maps, the original Grad-CAM first calculates the gradients of (the predicted probability of class ) with respect to feature maps of certain activation layer

, and then accumulate feature maps with gradient-based weights. However, unlike classification models which only output a vector of probabilities corresponding to different classes, object detection models usually yield a certain number of vectors including the predicted bounding boxes and probabilities.

Taking Yolo-V3 Redmon and Farhadi (2018) as an example, which is also one of the white-box models in our experiment, the model generates detection result of before Non-Maximum Suppression (NMS), where ,, and are parameters of the predicted boxes, indicates object confidence and represents class probability. It should be noticed that every element of is a concatenated vector of values corresponding to all the candidate objects. Similar to the original Grad-CAM, we have to first figure out what content of includes. For one-stage detection model with input , is chosen as follows:


where denotes the probability vector of class . It is worth noting that we do not consider NMS because it may yield null values with adversarial samples. With regard to two-stages detectors like Faster R-CNN, we just choose , where denotes intersection over union between detected boxes and ground truth boxes. Then we can get a coarse map of layer by linearly combining activation maps and corresponding weights in channel direction:


where represents channel index, denotes pixel index of activation map and is the total number of pixels in . For simplicity purposes, the symbol of class is ignored in the calculated result. It is obvious that the dimension of attention map in equation (4) is equal to that of , which would vary with the location of layer. To merge the attention of different layers, the maps should be adjusted to align with input images.

In general, the produced attention maps based on the above procedure would bring about model-specific noise, which may hinder the transferability of attack among different models. To alleviate this issue, we first introduce several compound transformations to the input of formula (1):


where is the total number of compound transformations, and transformation consists of a total number of base random transformations (including horizontal flip, translation, scaling, etc.). After that, the transformed inputs are concatenated in batch dimension before fed into detectors. Subsequently, we can get the corresponding attention map according to formula (3)(4). As the generated attention would be transformed simultaneously due to the adopted random transformations, so we need to introduce inverse transformations to make aligned with each other:


where denotes the aligned attention map and is compound inverse transformation that consists of serial inverse base transformations (corresponding to respectively).

In order to merge different attention maps, we propose an efficient and direct method as below:


where is the averaged attention map of layer and is the total number of attention scales. One advantage of equation (7) is that the noise could be smoothed without extra forward propagation, which can enhance the efficiency of optimization procedure remarkably.

Following the processes above, the averaged attention map could be yielded for any convolutional layer. Intuitively, attention of a single layer may not be enough to capture objects at different scales. Especially, the attention of a certain layer may even vanish for some small objects due to the convolution operation in the encoding process. Thus, it is necessary to combine multi-scale attentions to capture features of objects with various resolutions.

3.3 Attacking on Separable Attention (ASA)

Unlike previous attention-based attacks which take the generated map as a whole, we propose to attack on separable attention, i.e. our method suppresses attention of the foreground and amplifies that of the background simultaneously. Therefore, our novel attention loss is made up of two parts and can be formulated as follows:


where denotes Foreground Attention Suppression () loss and indicates Background Attention Amplification () loss. Apparently, those losses are closely related to the foreground attention and the background attention , which could be obtained with the object mask :


where denotes attention map of layer deduced in the last subsection, indicates some all-1 matrix with the same dimension as mask , and the summation on the right side hints that the separable maps cover different scales of attention.

To alter the distribution of foreground and background attention, one direct and efficient method is to change their global average values, which are defined as and respectively:


where and are pixel values of foreground and background attention at position , and means the total number of nonzero elements. Actually, adversarial texture can be optimized by minimizing and maximizing simultaneously.

However, a low (high) global average value of pixels does not guarantee that all the local pixels have low (high) values. To solve the problem, we define by sorting all the pixels of or in descending order, taking out the top values and calculating their mean value. Moreover, the ratio of to is adopted to avoid the situation where attention loss becomes too small to guide the optimization. By doing so, the ratio could be always larger than 1 unless all pixels have the same value. Thus, we can get and as below:


where and

are hyperparameters, and the minus sign in

implies that we expect the background attention to be enhanced during optimization.

3.4 The Overall Algorithm

To promote the naturalness of the pattern, we resort to 3D smooth loss Maesumi et al. (2021) showed as below:


where is RGB color vector of , represents all the edges of sampled triangular faces except for boundary edges, is the length of edge , and are the adjacent sampled triangular faces. To reduce color difference, we also bring in non-printability score (NPS) Thys et al. (2019) as follows:


where is a color vector of printable RGB triplets set , indicates the color vector of one sampled triangular face, and denotes the set of sampled faces.

On the whole, we obtain transferable physical adversarial pattern by taking separable attention loss, smooth loss and non-printability score into consideration simultaneously. Therefore, our adversarial pattern is updated by optimizing the following formula:


where and are hyperparameters.

4 Experiments and Results

In this section, we first present experimental settings, and then conduct extensive experimental evaluations about black-box transferability to verify the effectiveness of our proposed method. Afterwards, a serial of ablation studies are launched to support further analysis of our method.

4.1 Experimental Settings

Data Collection.

Both training set and testing set are collected with CARLA Dosovitskiy et al. (2017), which is an open-source 3D simulator developed for autonomous driving research. Unlike some works where images are captured at relatively nearby locations Wang et al. (2021a); Jiang et al. (2021), we carry out attack on a more complicated dataset with longer camera-object distances. Specifically, we take photos with five distance values (R=10, 20, 30, 40 and 50), three camera pitch values (=50, 70 and 90) and eight camera yaw values (=0, 45, 90, 135, 180, 225, 270 and 315). In other words, we get 120 images for a single vehicle location (24 images per distance value). To increase diversity of the dataset, we randomly select 31 points to place our vehicle in the virtual environment. Finally, we get 3600 images in total, 3000 of which are set as training set and 600 as testing set. Besides, all the images are obtained with a resolution of 10241024 in CARLA and then resized to 608608 before fed into models.

Detectors Training.

We first optimize the camouflage pattern in white-box detection models, which consist of Yolo-V3111 Redmon and Farhadi (2018), Yolo-V5222, Faster R-CNN Ren et al. (2015) and Retinanet Lin et al. (2017)

(Faster R-CNN and Retinanet are implemented by Pytorch). We conduct evaluation experiments on six black-box models, which include two classic one-stage detectors (one SSD

Liu et al. (2016) model with fixed input size of 512 and one Yolo-V5 as the above white-box model), a FPN-based one-stage detector (retinanet Lin et al. (2017)), a couple of two-stage detectors (Faster R-CNN Ren et al. (2015) and Mask R-CNN He et al. (2017)) and finally one transform-based detector (Deformable DETR Zhu et al. (2020)). All the evaluation models (except for Yolo-V5) are based on MMDetection Chen et al. (2019). Besides, all the models above are first trained on VisDrone2019 dataset Zhu et al. (2021) and then fine-tuned on our dataset captured from CARLA.

Evaluation Metrics.

In this paper, we take AP as our first metric like previous work Wu et al. (2020b). Given that AP is calculated by allowing for various of confidence thresholds, our work also takes attack success rate (ASR) Jiang et al. (2021) as another metric, which involves only one confidence threshold.

Baseline Methods.

To evaluate the effectiveness of our method, we compare our transferability from white-box detectors to evaluation models with several state-of-the-art works, including Dual Attention Suppression Attack (DAS) Wang et al. (2021a), Full-coverage Camouflage Attack (FCA) Jiang et al. (2021) and Attention-guided Transfer Attack (ATA) Wu et al. (2020a).

Implementation Details.

All the adversarial patterns are optimized under parameters as follows, the batch size is 2, the maximum epochs is 5, and the learning rate of SGD optimizer is 0.01. Our experiments are conducted on a desktop with an Intel core i9 CPU and one Nvidia RTX-3090 GPU. All experiments are carried out on the Pytorch (with torch 1.8.0 and pytorch3D 0.6.0) framework.

4.2 Comparison of Transferability

In this section, we compare the performance of our method with that of baseline methods. To analyze the performance objectively, we reproduce baselines by adopting their critical loss functions under the same controlled conditions (e.g. pattern size, 3D smooth loss, etc.). Particularly, we do not consider distracting human attention like the original DAS where a visually-natural texture was pasted on the pattern. In addition, our reproduced ATA permits attacker to yield localized and amplitude-unconstrained perturbations. Figure 1 displays typical comparisons about detection results and attention distraction, which manifest the superiority of our method. Moreover, Table 1 illustrates the transferability of different attack methods from white-box models (in the first column) to black-box detectors (in the first row). As for criteria, the lower the AP, the better the transferability, and it is just contrary for the ASR. Meanwhile, the results of clean images are also displayed in the table (denoted as Raw in the second line).

Method   Faster R-CNN     Mask R-CNN   Yolo-V5     SSD    Retinanet   Deformable Detr
AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%)
Raw 89.1 0.0 92.5 0.0 85.1 0.0 84.3 0.0 90.1 0.0 90.3 0.0
Yolo-V3 FCA 62.1 35.5 66.5 34.5 76.3 28.8 31.5 76.7 75.6 26.3 63.7 71.7
DAS 64.4 36.0 66.0 34.7 78.1 20.8 34.8 73.4 75.2 29.3 68.2 66.8
ATA 49.5 53.7 55.8 43.0 76.8 9.7 41.3 62.8 57.7 53.0 78.4 43.3
Ours 20.6 83.3 33.8 68.3 70.7 31.0 28.2 80.5 44.9 72.2 68.0 68.5
Yolo-V5 FCA 38.1 68.2 44.2 59.8 51.1 63.0 31.5 75.3 63.1 61.0 63.6 70.8
DAS 37.8 66.0 44.3 57.5 58.4 54.0 37.0 66.8 65.7 57.7 65.2 66.7
ATA 77.0 12.2 81.9 6.2 82.3 0.8 62.4 40.0 81.5 4.7 82.8 3.8
Ours 31.7 77.3 32.9 72.0 44.2 78.2 17.7 86.4 69.0 53.3 59.7 83.3
Faster R-CNN FCA 16.6 88.5 32.0 69.8 75.1 12.7 34.3 71.0 32.7 84.7 76.5 50.8
DAS 20.3 84.8 35.1 65.7 75.6 10.5 38.6 67.7 33.9 80.5 76.8 44.8
ATA 44.0 58.3 55.5 41.3 77.5 2.0 46.5 59.1 51.2 58.7 81.8 16.8
Ours 12.6 92.7 27.3 77.0 77.1 12.3 24.9 81.9 40.9 77.0 74.8 56.2
Retinanet FCA 13.5 88.8 28.4 72.2 75.3 13.8 29.9 74.8 26.7 85.0 78.6 45.5
DAS 19.0 87.2 37.1 65.7 75.8 11.0 31.2 75.7 38.6 79.0 79.4 44.8
ATA 27.3 77.7 37.0 64.3 78.7 7.3 39.7 66.4 33.4 82.0 79.3 41.7
Ours 15.3 89.8 28.2 75.2 74.2 15.3 31.1 72.7 24.8 85.8 75.9 52.2
Table 1: Transferability of different attack methods from white-box models (the first column) to typical black-box models (the first row). The higher (lower) the ASR (AP), the better the transferability.

According to the third row of Table 1, our method outperforms the rest methods when transferring from Yolo-V3 to both two-stage detectors and one-stage detectors. Specifically, the ASR (AP) of our method is roughly double (half) that of FCA and DAS when attacking Faster R-CNN, Mask R-CNN and Retinanet. In addition, the ASR (AP) of our method still takes the lead when attacking Yolo-V5 and SSD. Although our method does not have the best transferability when attacking Deformable Detr, our ASR (68.5%) is only about 3% smaller than that of FCA.

Similarly, according to the APs and ASRs in the fourth row, our method remains overwhelmingly superior when transferring from Yolo-V5 to other models. Especially, our ASRs could be about 10% higher than that of baseline methods for all the models except for Retinanet. The fifth row represents the transferability of adversarial patches generated with white-box detector of Faster R-CNN, and the results reveals that our method achieves the best performance when attacking both of the two-stage detectors, SSD and Deformable Detr. Meanwhile, FCA gets two first places (when transferring to Yolo-V5 and Retinanet) despite of its simplicity. In addition, the comparisons of the last row once again highlight the superiority of our methods.

4.3 Effect of Averaging on Attention Map

Figure 3: Effect of compound transformations on attention map. (a)(d) are the averaged attention maps with 03 compound transformations respectively.

As the inherent noise would be model-specific and hinder adversarial examples from transferring to black-box models. To suppress the noise, we introduce averaged attention maps. Figure 3(b)(d) display the averaged attention maps with 13 compound transformations respectively, where "heated" regions of the background are alleviated noticeably compared with that of Figure 3(a).

To verify the effectiveness of averaged attention on transferability, we compute all the APs and ASRs of adversarial patterns manufactured with different transformations, covering from zero transformation (i.e. original images) to four compound transformations, each of which includes horizontal flip, random scaling and random translation. The final result is demonstrated in Figure 4, where the left graph denotes correlation between APs and number of compound transformations for different black-box models, and the right graph indicates the variation of ASRs under identical conditions. According to the left graph, APs show a downward trend as we increase the number of compound transformation from zero to three (except for a slight difference for SSD under 3 transformations), and a conspicuous turning point could be seen for Retinanet and Faster R-CNN if we continue to increase transformation number. On the contrary, the ASR curves in the right graph witness an opposite variation tendency, i.e. ASRs go up along with the increase of transformations and the highest point can be achieved at 3 compound transformations for most models.

Figure 4: Effect of image transformations on attack transferability. The left (right) graph denotes the correlation between APs (ASRs) and number of compound transformations.

The experiment above hints that attack transferability could be enhanced strikingly (ASRs could be enhanced by about 20% and APs reduced by more than 10%) by smoothing attention maps. Actually, a well-trained detector could always pay its attention on objects no matter how we adjust the input images with basic transformations, but the location of noise might be diverse for different transformations. Therefore, our method has limited effect on the foreground attention but could suppress background noise, which finally brings about the enhancement of attack transferability.

4.4 Ablation Studies

In this subsection, we conduct ablation studies to figure out how hyper-parameters effect transferability, as well as the effect of different attention loss items.

The effect of hyper-parameters on transferability.

According to equation (11), is used for controlling the global average value of separable attention, and is devised to adjust the local attention with top pixels. We evaluate the effect of those hyper-parameters with variable-controlling method and the results are displayed in Table 2. The second row denotes the effect of (=1.0, 2.5, 5.0, 10) on transferability with , from where we can see that the best performance could be achieved when (except for transferring to Yolo-V5). The third row displays the impact of on transferability with and . By comparing the results of the second row with that of the third row, it is not hard to find that the existence of could enhance the attack performance of Yolo-V5, SSD and Deformable Detr. Besides, the last row (, ) gives a guidance of choosing an appropriate ().

Hyper-parameter   Faster R-CNN     Mask R-CNN   Yolo-V5     SSD    Retinanet   Deformable Detr
AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%)
45.5 61.3 50.4 54.2 73.4 31.2 29.8 74.4 65.1 48.0 65.0 71.5
27.7 77.7 37.4 64.8 69.8 35.2 33.0 72.7 49.8 69.7 65.1 66.2
23.7 81.8 35.1 65.7 70.8 30.5 27.5 79.1 46.5 73.8 67.3 71.5
29.2 75.7 43.6 59.7 72.9 17.5 32.1 73.9 50.5 69.3 74.5 60.8
63.8 35.2 66.1 33.3 79.6 16.7 27.4 77.6 74.2 32.0 65.3 74.8
65.0 31.3 65.4 36.5 77.1 28.5 24.0 82.4 75.1 28.2 64.4 76.8
69.0 24.8 71.4 27.5 75.6 37.7 13.6 93.6 77.7 18.2 64.3 76.5
74.4 13.5 77.7 13.2 74.7 30.7 14.6 94.1 81.0 7.5 71.6 68.8
73.0 20.0 75.6 19.7 75.3 31.8 10.7 95.5 79.2 15.0 68.4 73.2
69.0 24.8 71.4 27.5 75.6 37.7 13.6 93.6 77.7 18.2 64.3 76.5
72.1 20.8 73.8 23.5 75.2 33.7 12.1 93.6 78.8 14.5 68.8 73.5
72.0 20.0 72.9 23.2 75.8 37.2 16.4 92.9 78.4 15.2 68.9 75.2
Table 2: The comparison of transferability with different hyper parameters. The higher the ASR (or the lower the AP), the better the transferability.

The effect of different attention loss items.

Attention Loss   Faster R-CNN     Mask R-CNN   Yolo-V5     SSD    Retinanet   Deformable Detr
AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%) AP(%) ASR(%)
28.2 76.2 41.0 61.7 69.1 29.0 26.6 78.4 50.9 66.8 68.9 64.8
70.7 24.2 73.7 22.8 77.1 26.7 20.8 88.7 77.3 20.0 68.1 71.2
20.6 83.3 33.8 68.3 70.7 31.0 28.2 80.5 44.9 72.2 68.0 68.5
Table 3: The comparison of transferability under different attention loss items.

To verify whether a single foreground or background attention loss could make a difference, we carry out ablation studies on the effect of different attention loss items. Concretely, adversarial patterns are optimized under attention loss of , and respectively (with , and ). According to Table 3, the combination of and could contribute to the best performance (both AP and ASR) for Faster R-CNN, Mask R-CNN and Retinanet. Meanwhile, both ASR of Yolo-V5 (31.0%) and AP of Deformable Detr (68.0%) take up the first place according to the last row of Table 3. Besides, the results in the third row show that the single would bring about a weaker attack performance among black-box models, with the exception of SSD and Deformable Detr. Although does not lead to a striking result, its gap from the integrated loss (i.e. the difference between the second row and the last row) is much smaller than that of for the majority detectors, which means that foreground attention loss plays a more important role than the background attention loss.

5 Conclusions

In this work, we propose a framework of generating 3D physical adversarial camouflage against object detection by attacking multi-scale attention maps. As the noise of those maps would be model-specific and hinder the manufactured adversarial pattern from transferring to black-box detectors, our method first takes serial compound transformations on input images, and then aligns the yielded attention maps with inverse transformations. Afterwards, we suppress the noise by averaging the aligned maps directly. In order to break through the limitations of existing attention-based attack methods, we propose a direct and efficient optimization strategy based on separable attention, and develop a novel attention loss function which suppresses the foreground attention and amplifies that of the background simultaneously. Moreover, extensive comparison experiments are carried out to get transferability of black-box detectors, and the results show that our method has a notable advantage over other advanced attack methods. Last but not least, there still remain some models that are hard to be attacked (e.g. Yolo-V5), the solution of which will be explored in our future work.


  • [1] A. Athalye, L. Engstrom, A. Ilyas, and K. Kwok (2018) Synthesizing robust adversarial examples. In

    International conference on machine learning

    pp. 284–293. Cited by: §2.
  • [2] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, et al. (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.1.
  • [3] S. Chen, F. He, X. Huang, and K. Zhang (2022) Relevance attack on detectors. Pattern Recognition 124, pp. 108491. Cited by: §2.
  • [4] S. Chen, Z. He, C. Sun, J. Yang, and X. Huang (2020) Universal adversarial attack on attention and the resulting dataset damagenet. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §1.
  • [5] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun (2017) CARLA: an open urban driving simulator. In Conference on robot learning, pp. 1–16. Cited by: §3.1, §4.1.
  • [6] A. Du, B. Chen, T. Chin, Y. W. Law, M. Sasdelli, R. Rajasegaran, and D. Campbell (2022) Physical adversarial attacks on an aerial imagery object detector. In

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision

    pp. 1796–1806. Cited by: §2.
  • [7] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018) Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1625–1634. Cited by: §1, §2.
  • [8] Z. Gong, P. Zhong, and W. Hu (2020) Statistical loss and analysis for deep learning in hyperspectral image classification. IEEE Transactions on Neural Networks and Learning Systems 32 (1), pp. 322–333. Cited by: §1.
  • [9] Z. Gong, P. Zhong, Y. Yu, W. Hu, and S. Li (2019) A cnn with multiscale convolution and diversified metric for hyperspectral image classification. IEEE Transactions on Geoscience and Remote Sensing 57 (6), pp. 3599–3618. Cited by: §1.
  • [10] I. J. Goodfellow, J. Shlens, and C. Szegedy (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
  • [11] C. Guo, J. Gardner, Y. You, A. G. Wilson, and K. Weinberger (2019) Simple black-box adversarial attacks. In International Conference on Machine Learning, pp. 2484–2493. Cited by: §2.
  • [12] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: §1, §4.1.
  • [13] L. Huang, C. Gao, Y. Zhou, C. Xie, A. L. Yuille, C. Zou, and N. Liu (2020) Universal physical camouflage attacks on object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 720–729. Cited by: §1, §1.
  • [14] Q. Huang, I. Katsman, H. He, Z. Gu, S. Belongie, and S. Lim (2019) Enhancing adversarial example transferability with an intermediate level attack. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 4733–4742. Cited by: §2.
  • [15] N. Inkawhich, K. Liang, B. Wang, M. Inkawhich, L. Carin, and Y. Chen (2020) Perturbing across the feature hierarchy to improve standard and strict blackbox attack transferability. Advances in Neural Information Processing Systems 33, pp. 20791–20801. Cited by: §2.
  • [16] N. Inkawhich, W. Wen, H. H. Li, and Y. Chen (2019) Feature space perturbations yield more transferable adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7066–7074. Cited by: §2.
  • [17] T. Jiang, J. Sun, W. Zhou, X. Zhang, Z. Gong, W. Yao, X. Chen, et al. (2021) FCA: learning a 3d full-coverage vehicle camouflage for multi-view physical adversarial attack. arXiv preprint arXiv:2109.07193. Cited by: Figure 1, §1, §2, §4.1, §4.1, §4.1.
  • [18] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §4.1.
  • [19] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §4.1.
  • [20] A. Maesumi, M. Zhu, Y. Wang, T. Chen, Z. Wang, and C. Bajaj (2021) Learning transferable 3d adversarial cloaks for deep trained detectors. arXiv preprint arXiv:2104.11101. Cited by: §1, §2, §3.4.
  • [21] N. Ravi, J. Reizenstein, D. Novotny, T. Gordon, W. Lo, J. Johnson, and G. Gkioxari (2020) Accelerating 3d deep learning with pytorch3d. arXiv preprint arXiv:2007.08501. Cited by: §3.1.
  • [22] J. Redmon and A. Farhadi (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §1, §3.2, §4.1.
  • [23] S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28. Cited by: §1, §4.1.
  • [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pp. 618–626. Cited by: §3.2.
  • [25] M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter (2016) Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 acm sigsac conference on computer and communications security, pp. 1528–1540. Cited by: §2.
  • [26] J. Su, D. V. Vargas, and K. Sakurai (2019) One pixel attack for fooling deep neural networks.

    IEEE Transactions on Evolutionary Computation

    23 (5), pp. 828–841.
    Cited by: §1.
  • [27] B. Tarchoun, I. Alouani, A. B. Khalifa, and M. A. Mahjoub (2021) Adversarial attacks in a multi-view setting: an empirical study of the adversarial patches inter-view transferability. In 2021 International Conference on Cyberworlds (CW), pp. 299–302. Cited by: §2.
  • [28] S. Thys, W. Van Ranst, and T. Goedemé (2019) Fooling automated surveillance cameras: adversarial patches to attack person detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp. 0–0. Cited by: §1, §2, §3.4.
  • [29] J. Wang, A. Liu, Z. Yin, S. Liu, S. Tang, and X. Liu (2021) Dual attention suppression attack: generate adversarial camouflage in physical world. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8565–8574. Cited by: Figure 1, §1, §1, §2, §2, §3.1, §4.1, §4.1.
  • [30] Y. Wang, H. Lv, X. Kuang, G. Zhao, Y. Tan, Q. Zhang, and J. Hu (2021) Towards a physical-world adversarial patch for blinding object detection models. Information Sciences 556, pp. 459–471. Cited by: §2.
  • [31] Z. Wang, H. Guo, Z. Zhang, W. Liu, Z. Qin, and K. Ren (2021) Feature importance-aware transferable adversarial attacks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7639–7648. Cited by: §1, §1, §2.
  • [32] W. Wu, Y. Su, X. Chen, S. Zhao, I. King, M. R. Lyu, and Y. Tai (2020) Boosting the transferability of adversarial samples via attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1161–1170. Cited by: Figure 1, §1, §1, §4.1.
  • [33] Z. Wu, S. Lim, L. S. Davis, and T. Goldstein (2020) Making an invisibility cloak: real world adversarial attacks on object detectors. In European Conference on Computer Vision, pp. 1–17. Cited by: §2, §4.1.
  • [34] T. Xiang, H. Liu, S. Guo, T. Zhang, and X. Liao (2021) Local black-box adversarial attacks: a query efficient approach. arXiv preprint arXiv:2101.01032. Cited by: §2.
  • [35] J. Yang, Y. Jiang, X. Huang, B. Ni, and C. Zhao (2020) Learning black-box attackers with transferable priors and query feedback. Advances in Neural Information Processing Systems 33, pp. 12288–12299. Cited by: §1, §2.
  • [36] W. Zhou, X. Hou, Y. Chen, M. Tang, X. Huang, X. Gan, and Y. Yang (2018) Transferable adversarial perturbations. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 452–467. Cited by: §2.
  • [37] P. Zhu, L. Wen, D. Du, X. Bian, H. Fan, Q. Hu, and H. Ling (2021) Detection and tracking meet drones challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence (), pp. 1–1. External Links: Document Cited by: §4.1.
  • [38] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai (2020) Deformable detr: deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159. Cited by: §1, §4.1.