ScaleCert: Scalable Certified Defense against Adversarial Patches with Sparse Superficial Layers

Adversarial patch attacks that craft the pixels in a confined region of the input images show their powerful attack effectiveness in physical environments even with noises or deformations. Existing certified defenses towards adversarial patch attacks work well on small images like MNIST and CIFAR-10 datasets, but achieve very poor certified accuracy on higher-resolution images like ImageNet. It is urgent to design both robust and effective defenses against such a practical and harmful attack in industry-level larger images. In this work, we propose the certified defense methodology that achieves high provable robustness for high-resolution images and largely improves the practicality for real adoption of the certified defense. The basic insight of our work is that the adversarial patch intends to leverage localized superficial important neurons (SIN) to manipulate the prediction results. Hence, we leverage the SIN-based DNN compression techniques to significantly improve the certified accuracy, by reducing the adversarial region searching overhead and filtering the prediction noises. Our experimental results show that the certified accuracy is increased from 36.3 certified detection) to 60.4 certified defenses for practical use.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 6

page 14

05/17/2020

PatchGuard: Provable Defense against Adversarial Patches Using Masks on Small Receptive Fields

Localized adversarial patches aim to induce misclassification in machine...
03/14/2020

Certified Defenses for Adversarial Patches

Adversarial patch attacks are among one of the most practical threat mod...
04/28/2020

Minority Reports Defense: Defending Against Adversarial Patches

Deep learning image classification is vulnerable to adversarial attack, ...
04/26/2021

PatchGuard++: Efficient Provable Attack Detection against Adversarial Patches

An adversarial patch can arbitrarily manipulate image pixels within a re...
02/24/2022

Towards Effective and Robust Neural Trojan Defenses via Input Filtering

Trojan attacks on deep neural networks are both dangerous and surreptiti...
04/13/2022

Defensive Patches for Robust Recognition in the Physical World

To operate in real-world high-stakes environments, deep learning systems...
03/16/2022

Towards Practical Certifiable Patch Defense with Vision Transformer

Patch attacks, one of the most threatening forms of physical attack in a...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Despite the promising opportunities and great success of deep neural network (DNN) techniques in computer vision tasks 

[9, 13], DNN techniques are vulnerable to adversarial attacks that craft the input data with adversarial perturbations to manipulate the output [2, 20, 26]. Recent studies reach a consensus that compared to the adversarial example attack, adversarial patch attack is more practical and well-demonstrated in physical environment with good effectiveness and transferability in both classification and detection systems [5, 14, 15, 23, 24, 28]. The key difference between the adversarial example attack and patch attack is the constraint of perturbation. For adversarial example attack, the adversary perturbs all pixels in the image with the -norm of the perturbation in prescribed bounds, while the adversarial patch attack only changes the pixels in a confined region without bounded constraints but free to choose the values. Consider the real-world condition, it’s hard to define the

-norm constraint of the perturbation due to the environment, viewpoint and object variation. Hence, the adversarial patch attack is much more appropriate to be mounted in the physical world. The success of practical adversarial patch attacks raises the importance and urgency of defense methodologies. Existing defending methodologies can be classified into two categories, heuristic defenses and certified defenses. For heuristic defense approaches, they have good efficiency but lack robustness against a strong adaptive attack. For example, digital watermarking (DW) 

[8] utilizes the magnitude of the saliency maps to detect unusually dense regions and mask them out of the input. Local gradient smoothing (LGS) [19] pre-processes the image gradient of the classification function with a normalization and a thresholding step and then suppress the adversarial noise based on the gradient. However, these empirical defenses may be invalidated when confront with the strong adaptive attacker that has the white-box knowledge of the defense [3]. For certified defenses, they must be proved that have provable robustness, like [3] proposes the first certified defense based on the interval bound propagation (IBP) [7] that are commonly used in adversarial robustness certification problems. De-randomized smoothing (DS) [15] proposes the defense method extending randomized smoothing robustness schemes  [4] with structured ablation. Moreover, a provable robustness defense based on clipped BagNet [1] with small reception field is proposed by [33] . However, all of these certified studies achieve distinctly low certified accuracy for large scale datasets such as ImageNet [13]. A certified accuracy of 20.5%-36.3% 111Covering both certified recovery and detection methodologies. See Table 2 for more details. from these methods can hardly be applied in the real world systems. It is a dilemma that certified robustness against such a practical and pernicious attack model can be hardly scaled for realistic usage.

To address this issue, in this paper, we propose the defenses with both certified robustness and good scalability for large scale dataset that used in practical scenarios. Inspired by the existing analysis on the unique distribution of the activation map yielded by adversarial patch attacked input [16, 19, 24], and substantial neural network compression techniques without sacrificing natural accuracy [10, 31], we observe that the adversarial patches rely on the localized superficial important neurons (SINs) to poison the output and can be exceedingly alleviated by utilizing the pruning techniques. We then propose the scalable certified (ScaleCert) defense with SIN-based neural network sparsity approach. The results show that the certified accuracy of ScaleCert on ImageNet dataset achieves 60.4% against 1%-pixel patch, significantly surpassing the highest record of 36.3% in state-of-the-art certified detection methods [25]. This is the first certified work that largely reduces the gap between clean and certified accuracy, retaining the high certified accuracy with good computing efficiency, which largely pushes the certified defenses for common practice in real adoption.

2 Problem Setup

In this section, we describe the formulation and important terminology of adversarial patch attack and certified defenses against it.

2.1 Adversarial Patch Attack Formulation

Given an input with a true label , adversarial patch attack can manipulate the output label of victim model, , by adding the patch perturbation on . The goal of adversarial patch attack is to find an adversarial patch that can be added on to generate image satisfying a constraint such that . can be a targeted label (targeted attack) or any arbitrary label not equal to the benign label (untargeted attack). The adversarial patch can be generated by solving the follow equation:

(1)

The constrain is determined by the threat model. In this paper, the adversary is allowed to arbitrarily change pixels within a square contiguous region and locate this region anywhere on the image, which is similar to the previous works [15, 17, 24]. Formally, we use a binary pixel block to represent the restricted square region, where the pixels located in the region are set to one, otherwise zero. For the purpose of stealthiness, is extremely smaller than , usually has the size of 1% to 5% of the . The constraint set can be expressed as:

(2)

where refers to the element-wise product operator, and is the adversarial patch we generated from Eq. (3).

2.2 Certified Defenses

Bounding the range of a neural network outputs given a -norm of input perturbation has become an important theme for neural network verification and certified adversarial defense [4, 7, 21, 22, 27, 32]. Recently, researchers also shed light on generalizing the conventional input perturbation constraint to the adversarial patch formulation. Existing certified defenses against adversarial patch attacks can be classified into two major categories with different certification problems:

1) Certified defense for attack recovery. The classifier returns the assurance that the classification result will provably not change under any distortion within an adversarial constraint set for the input. It is a heritage from certified defenses towards adversarial example attacks. Formally, the classifier ensures = = to produce a certificate for an clean input for any adversarial example , where is the adversarial constraint set defined in Eq.( 2). Many existing studies extend the certified techniques in adversarial example attacks with the new constraints of patch perturbation to propose the certified defense against patch attacks. For example, [3] proposes their certified defense by extending interval bound propagation (IBP) [7] defense on patch attacks.  [15] proposes the derandomized smoothing methodology based on structured ablation scheme by extending the randomized smoothing technique [4]. PatchGuard [24] leverages the DNN models with small reception fields to localize the adversarial features and proposes the robust masking defense to remove adversarial effect of these features.

2) Certified defense for attack detection. The classifier returns the assurance that the image is clean or alert when the image is an adversarial input. It is a relaxed certified problem compared to certified recovery. However, it is also an important and effective defense because occluding the adversarial patches can eliminate the adversarial effect. MRD [17] is the first work that proposes the certified attack detection study. By examining the pattern of the prediction labels that occlude the windows of the images, the adversarial images can be certified detected. PatchGuard++ [25] aims to improve the certified detection rate by extending PatchGuard with the same insights of MRD.

For both certified recovery and detection defenses, the state-of-the-art approaches obtain low certified accuracy on large scale datasets and can be hardly adopted for practical use. As shown in Table 1, the best certified accuracy in recent literature is about 36.3% when the patch contains 1% pixels of the input image. Therefore, we confront with the dilemma of practicality and low certified provable robustness for the harmful and practical adversarial patch attacks. To address this issue, we aim to propose a defense methodology that bridges the gap between clean and certified accuracy and achieves both provable robustness and good scalability for large scale datasets.

max width=0.95 Certified Defense Certified Detection Certified Recovery (ImageNet, PatchSize=1%) PatchGuard++ [25] PG-Mask-BN[24] PG-Mask-DS [24] CBN [1] DS [15] Certified Accuracy 36.30% 32.30% 22.50% 21.90% 20.50%

Table 1: Certified accuracy of state-of-the-art defenses on ImageNet.

3 ScaleCert Defense

In this section, we first use an empirical study to motivate the use of superficial important neurons (SIN) distribution to distinguish the adversarial patch and eliminate the adversarial effect and noises. Then we propose the ScaleCert to defend against adversarial patch attack with SIN-based sparsification methodology. To note, Section 3.1 focuses on empirical analysis based on the results of some specific adversarial patch attack methodology. However, the ScaleCert’s robustness applies to any arbitrary adversarial patch attack. The provable robustness of this defense is demonstrated and analyzed in Section 3.3.

3.1 SIN in Patched and Benign Images

In observing that the adversarial patch determines the prediction results with a very small region of pixels for a relatively broad range of input images, we thus perform superficial feature importance distribution analysis of patched images and benign images. Though neuron importance analysis has been widely used for abnormal input detection [6, 16, 29, 30]

, the metric in our methodology (superficial feature importance) is distinct from previous studies (deep feature importance). The latter focuses on the neurons that contribute significantly to the inference output, while superficial important neurons refers to the neurons that contribute significantly to the shallow feature map values in the first (several) layer(s). Compared to deep important neurons (DIN), SIN has good discrimination to distinguish patched images, more straightforward correlation with the input adversarial regions, and smaller computing overhead without requirement for gradient calculation. More comparison between SIN and DIN is in Appendix A.3.

Specifically, we discover that the SIN of patched images exhibits extremely localized pattern and the adversarial patch effects can be eliminated if localized SINs are removed. For benign images, however, the prediction results do not rely on the localized SINs and thus being more stable. The intuition is that the prediction of typical DNN inference on the benign inputs depends on the deep features of the input image, while adversarial inputs have to strengthen the superficial features. Otherwise, its adversarial effect will fade out through the computation of all DNN layers. We quantitatively evaluate the impact of SIN during predictions of benign and patched images as follows.

Figure 1: SIN analysis in benign and patched images.(a) A showcase of SINs in the benign and patched image. (b) The distance deviation of the SIN. (c) The average important neuron cluster number distribution. (d) The prediction stability of moving out localized SINs in benign and patched images. (e) Adaptive attack effectiveness by steering around the detection of SINs

1) Localized SIN Feature Analysis: SINs in patched images are much more localized than that in benign images. We show the Top-200 neurons in the activation map of the first layer for benign images and patched images in Figure 1

(a). Then, we compare the standard deviation of distance between the Top-200 neurons to the central neuron of benign images and patched images for randomly-selected 10,000 images in ImageNet dataset. The patched images’ distance deviation is much smaller than benign images, as shown in Figure 

1(b). Additionally, we also cluster the Top-k neurons with the classic MeanShift clustering algorithm and observe the cluster number of SINs in patched images is much less than that in benign images. As shown in Figure 1(c), for patched images, about 75% of cases have only one cluster and 99% of cases have less than two clusters. While for benign images, more than 75% of the cases have more than three clusters. The results indicate that SINs are much more localized in patched images in terms of both the distance and cluster number.

2) Prediction Stability Analysis: By occluding the localized SINs in a small window, the prediction results of patched images are unstable, while benign images seldom change. Specifically, we examine when moving the localized SIN features, how the prediction results are affected in patched images and benign images. We compare the prediction accuracy of benign objects and patched objects before and after moving out the localized SINs (l-SIN), as shown in Figure 1(d). In this experiment, we consider 10000 randomly-selected ImageNet images that can be correctly classified by the model to eliminate the interference of other factors. The results show that for benign images, the prediction accuracy is not sensitive to the localized SIN features. For the patched images, originally the prediction accuracy is extremely low because of the patch effects. After moving the localized important features, the detection rate increases drastically, which indicates that the patch effect has been effectively eliminated. The results show that the nature images perform robust classification without relying on extremely localized SIN features, while patches rely on the extremely localized SIN features to deceive and induce the object detector to output the incorrect results.

3) Adaptive Attack Against SIN-based Detection: The adversarial patch attacks confront with the dilemma of either introducing large SINs or hardly affecting the prediction results. We perform the adaptive attacks on ImageNet dataset and the results are shown in Figure 1 (e). To steer clear of the detection based on SINs, the adversary trains the adversarial patch by taking the superficial activation value into consideration. Compared to Equation(1), a penalty loss of activation value is considered with the parameter , when is larger, the attack pays more attention to the stealthiness other than the attack effectiveness. In this way, the adversary aims to build the adversarial patch with both good poisoning effects, but also trying to escape from the adversary candidate searching. Adversary detection rate refers to the rate that detector correctly identifies the adversarial region in the adversarial inputs or the benign input with no adversarial region. The results show that, it is indeed that the adversary detection rate decrease to 82.1% when adversary attempts to reduce the activation magnitude in the first layer ( increases from 0 to 0.01). However, compared to the gentle slope of adversary detection rate, the attack success rate decreases much more drastically. When is increased from 0 to 0.01, the adversarial attack success rate drops to 7.9%. These results indicate that the adversary cannot maintain the two goals of high attack success rate and good stealthiness in SINs simultaneously. The detailed attack setup and the datasets are covered in Appendix A.1.

These results show the empirical evidence to the hypothesis that the adversarial patch attacks rely on the strengths of localized SINs to effectively damage the prediction results of victim models. We leverage such insights to design ScaleCert certified defense methodology. Different from the motivation of previous studies [24], we don’t constrain the reception field to localize the adversarial feature effect in the deep features, but focus on the elimination of poisoned SIN features. More comparison and discussion between these two design philosophies are provided in Appendix A.3.

3.2 ScaleCert Framework

SIN: Giving the feature map of the superficial important layer, we sort the neurons by the sum of their activation through channels and regard the top-k neurons as the SIN. SIN is not static but dynamic for each image during their inference. for occluding region Suppose we use the layer , and the is notated as the feature map of -th channel of layer . The SIN is defined as follows:

SIN-mask: SIN-mask is dynamically generated according to SINs when taking in different inputs. Specifically, SIN-mask is 2 dimensional and is the same size as the feature map of the superficial important layer. Giving the feature map of the superficial important layer, SIN-mask is the mask where the positions of SINs are set to 1 and others are set to 0.

ScaleCert Overview. ScaleCert consists of three stages: 1) SIN-based Pruning: Calculating the SIN-mask, , and pruning the rest of neurons . We first select the layer as the superficial important layer, then we keep Top- activation neurons only and prune other unimportant neurons of this layer, where is the top-ranking ratio. 2) Occluding Prediction: With the SIN-mask of the superficial layer, we calculate the candidate searching region on the input image that contributes to the SIN. Then we impose sliding windows on and check their overlaps with the candidate searching region . ScaleCert makes inference by masking out every sliding window that with dynamic SIN-based pruning independently, and obtain the occluding prediction map. The size of sliding window is , where equals to the sum of patch size () and an additional step ( is 3 in the evaluation setting). Therefore, there are at least windows that completely occlude the adversarial patch. 3) ScaleCert Detection: Finally, check the labels of the prediction map to detect whether there is an adversarial patch. If all the labels in the masked predictions are consistent, there is no attack and return the original model prediction . ScaleCert leverages SIN-based sparsity to not only reduce the occluding windows, but more importantly filter out the noisy labels in trivial prediction maps and achieves the state-of-the-art certified accuracy.

Figure 2: ScaleCert Framework with three stages: SIN-based Pruning; Occlude Prediction; and ScaleCert Detection.

Neural Network Sparsification. The key success factor of ScaleCert is leveraging the pruning technique on the superficial layer to improve certified accuracy and efficiency. In order to maintain the clean accuracy of our ScaleCert, we need to finetune the model with the sparse superficial layers. Figure 2 illustrates the the proposed winner-take-all dropout. We add the SIN-mask after the superficial layer(s) and prune the non-important neurons according to the activation magnitude. Only the top-ranking neurons (winners) can forward their outputs to the following layers. To finetune the sparsified neural networks, we only update the top-ranking neurons selected by the SIN-mask. This finetuning procedure can be treated as a shallow layers channel pruning, the SIN-mask, will also be updated during each iteration.

Occluding Prediction. With the SIN mask , we first calculate the input region that contributes to the SINs. The calculation procedure is as follows: for the neuron with the ordinate of () in superficial neural network layer , its antecedent neurons in layer that contribute to the value of this neuron are in the square region with the top left ordinate and the bottom right ordinate , respectively, where , and

are the padding size, stride size and kernel size of layer

. Then, we can iteratively calculate region in the inference pyramid until back to the initial input layer and obtain region . is the candidate region for occluding prediction, while all the pixels falling out of this region cannot affect the prediction results because of the SIN mask.

Then we impose the sliding window across all the region and obtain the occluding prediction map by moving out the pixels in the sliding window. Note that, we also mask the superficial features of occluding region when masking the input pixels to make sure the occluded region locates out of the SIN area, therefore, not affecting the prediction. The number of candidate window has been reduced because of SIN mask. To reduce the overhead of inference procedures in the further step, we additionally propose to merge adjacent occluding windows when the overlapped area among every two windows exceed the threshold (). To note, the merging operations assure to cover every occluding window that has overlap with for the certified robustness.

Figure 3: Merge sliding occluding windows. (a) Original windows. (b) Merged windows.

Detection Methodology The detailed ScaleCert detection algorithm is given in Algorithm 1. ScaleCert only imposes sliding window on the based on the SIN area since other neurons will be masked out during inference. Interestingly, the SIN-mask not only helps on saving time but also filters out most noisy area on the original images so that makes the more consistent in SIN area and benefits on improving the clean accuracy on benign images. For patch attacked cases, there must be a region that the label will be changed when we masking out this area. So the algorithm will alert (line 12) to user and provide the potential patch location that can be used to empirically recover true label by occluding the area. In some rarely happened cases, like there are noises on the input even in the SIN area, we cannot make the promise decision, and conservatively raise alarm (line 14).

1:Inputs: image , model , sliding windows on the input and candidate searching region
2:Output: Prediction or alert
3:
4:
5:for each  do
6:     if   then Filter out some potential sliding windows
7:          Obtain the label according to the mask-out input
8:         if   then
9:               Collect potential malicious windows and prediction labels               
10:if  then
11:     if  contains and for  then
12:         return alert Adversarial patch detected at area
13:     else if MajorityVoting()  then
14:         return alert Possible adversarial patch with noises in
15:     else
16:         return benign image with noises in      
17:else
18:     return benign image
Algorithm 1 Our adversarial patch detection algorithm
1:Inputs: image , label , model , sliding windows on the input and candidate searching region
2:Output: Whether the image has provable robustness to label
3:for each  do
4:     if   then
5:         
6:         if   then
7:              return False               
8:return True
Algorithm 2 Our certify algorithm

3.3 Provable Robust Analysis

ScaleCert Certification In this paper, we assume the attacker has full knowledge of our defense method and is allowed to arbitrarily change pixels in a square region based on any needed information of the threat model. Our detection defense method ScaleCert aims to provably ensure that certified image can be correctly classified or yields an alarm together with potential patch location and maintain high accuracy on benign images.

Theorem 1

If Algorithm 2 returns True for a given image , our defense in Algorithm 1 can either make a correct prediction or alert on any adversarial image

Figure 4: The certified benign image with only Top-k region retained(a) and occluded images that masking benign image out with occluding windows (b, c) are all classified as true label(Algorithm 2). Attacked images that patch locates at the Top-k region(d, e, f) or outside the Top-k region (g, h, i) will alert (f, i) that ensures the correctness of certification.

Proof: The Algorithm 2 returns True indicates that all single sliding window that do not affect the prediction results for image like Figure 4(b) while the effect of other windows are all eliminated since the top-k pruning like Figure 4(c). Then if we apply adversarial patch attack on , which means any can be fed to Algorithm 1, there are two possible cases:

  • Case 1: The adversarial patch overlaps with the SIN area of certified benign image like Figure 4(d) and leads a malicious label. For occluding windows that not overlap with the patch, the outputs are not changed like Figure 4(e). However, consider of the size of our sliding window and the patch size are and , and respect to the condition that , there must have sliding windows that completely mask out the adversarial patch like Figure 4(f), and the new pruned SIN area is the same as the benign images with a same occluding window like Figure 4(b), so the masked out prediction is exactly the true label. Therefore, in this case Algorithm 1 returns alert (line 12).

  • Case 2: The adversarial patch locates outside the SIN area like Figure 4(g) and outputs a malicious label. Occluded images that not occluding the patch still maintain the malicious labels. But, there also exists windows that masked out the patch completely and recover the correct label like Figure 4(i) which triggers the alarm (Algorithm 1 line 12).

Note that, the line 13-16 in Algorithm 1 is not necessary to the certified defense, since the patch is provably caught in line 11 when the input returns True in Algorithm 2. They are designed for the benign image with noises in SIN area or both patch and noises happened in SIN area. In summary, Algorithm 1 is a general empirical adversarial patch detection method but when combining with the Algorithm 2, it can conclusively output normal predictions on clean images but alert when an attack takes place for the certified inputs.

4 Experiments

4.1 Experiment Setup

Evaluation Metrics Certified accuracy: note that our Algorithm 1 can return a provable answer if the input image is benign or can be recovered by our algorithms, but it may be incapable of rare examples that violate our assumption. So the certified accuracy we defined here is a lower bound of the real certified accuracy.

Datasets and Models. We evaluate the clean and certified robustness results on the high resolution dataset with 224224 sizes (1000-class ImageNet [13]) and low resolution dataset with 3232 sizes (10-class CIFAR-10 [12]). In our evaluation, we use ResNet50 [9] for ImageNet dataset and ResNet18 [9] for CIFAR-10 dataset. We train the models with pretrained weights and finetune them with SIN-based pruning and window occluding of random locations to improve the performance of inference.

Baselines. We compare the defense performance with the state-of-the-art studies: PatchGuard++ [25], PG-mask-BN [24], PG-mask-DS [24], Clipped BagNet (CBN) [1], De-randomized Smoothing [15], and Interval Bound Propogation based certified defense (IBP) [3]. We use the optimal parameter settings in their configurations obtained from their reports.

Attack Patch Size We adopt the similar configurations from previous studies that defend against a single sequare adversairal patch that consists up to 1%, 2%, or 3% pixels of the ImageNet images, and 0.4% and 2.4% pixels of CIFAR-10 images. Additionally, we also report the extremely large patches with 5% and 8% in the Appendix A.4.

4.2 Provable Robustness Results

The certified accuracy results of ScaleCert and other recent studies are presented in Table 2

. For ScaleCert, 10 trials have been run to get the average results with different random seeds. The small variance that 0.89 for clean accuracy and 0.27 for certified accuracy for ImageNet with a patch of 1% pixels indicates that ScaleCert can give stable results. For PatchGuard++, MRD, and ScaleCert, the certified accuracy refers to the certified detection accuracy. For others, the certified accuracy refers to the certified recovery accuracy. These results hold for any attack within the patch size constrain. We draw the following conclusions from the results.

Existing studies achieve good certified accuracy in low-resolution images, but very poor provable robustness in high-resolution images.

For example, when patch size is 2.4% of the images, PatchGuard++, PG-Mask-DS, PG-Mask-BN achieves 74.1%, 57.7%, and 47.3% of certified accuracy. For ImageNet dataset, their certified accuracy decreases to 33.9%, 19.0% and 26.0%. There are two potential reasons behind this phenomenon: 1) Deeper models are used for the classification of high-resolution images. With deeper layers, the reception field becomes larger even using BagNet with limited small kernel size. It is hard to distinguish and eliminate the adversarial effect at the deep feature extraction layer. 2) For high-resolution images, the random ablation in DS-based algorithms is unable to produce satisfactory accuracy due to the loss of the most information of the input images.

MRD and IBP are computational infeasible for high-resolution images. MRD requires to compute the prediction map for every occluding window, which introduces infeasible computational overhead with inference rounds. For example, MRD requires about 36481 inference rounds when the patch is 2%, which costs about 49 for V100 GPU platform to verify a single input image with 224224 pixels even after parallelization optimization. As a comparison, our study reduces the candidate windows number to less than 100, which significantly removes the computation overhead.

ScaleCert outperforms other defenses with much higher certified accuracy for ImageNet datasets, which pushes the certified defenses into practical usage. ScaleCert achieves both good certified accuracy for low-resolution images and high-resolution images. For high resolution images, particularly, the experimental results show that ScaleCert achieves the best certified accuracy. Specifically, when patch contains 1% pixels of the input images, ScaleCert achieves 60.4%, largely surpassing the PatchGuard++ with 36.3% of certified accuracy. CIFAR-10 images are very small with the only size of 32 32, and thus are much more sensitive to pruning operations and masking operations. Therefore the certified accuracy of ScaleCert drops slightly compared to MRD.

max width=0.95 Dataset ImageNet Cifar Patch size 1% pixels 2% pixles 3% pixels 0.4% pixels 2.4% pixels Accuracy clean robust clean robust clean robust clean robust clean robust Certified Recovery IBP computationally infeasible 65.8 51.9 47.8 30.3 CBN 53.5 21.9 53.5 13.7 53.5 7.4 83.2 51.0 83.2 16.2 DS 44.4 20.5 44.4 17.0 44.4 14.9 83.9 68.9 83.9 56.3 PG-Mask-DS 44.5 22.5 44.1 19.0 43.7 16.6 84.7 69.2 84.6 57.7 PG-Mask-BN 55.0 32.3 54.6 26.0 54.0 19.8 84.5 63.8 83.9 47.3 Certified Detection MR computationally infeasible 87.6 82.5 84.2 78.1 PatchGuard++ 61.8 36.3 61.6 33.9 61.5 31.1 82.0 78.8 78.2 74.1 ScaleCert 62.8 60.4 58.5 55.4 56.4 52.8 83.1 81.0 78.9 75.3

Table 2: Comparison of clean and certified accuracy under different defenses

4.3 Detailed Analysis of ScaleCert

The insight of ScaleCert is to leverage the intrinsic differences of SINs in patched and benign images to achieve effective defenses. There are two key parameters affecting the clean and certified accuracy: the pruning rate in the shadow region and the occluding window size to target the adversarial patch (more discussion in Appendix A.2). In the ScaleCert algorithm, some hyperparameters can affect these two important parameters: the winner rate of SIN mask (

), the overlap ratio for merging the searching window (), and the superficial layer selection (). We analyze the impact of , , (in Appendix A.4) on the certified defense effectiveness and efficiency.

Pruning Rate of SIN Mask: We test the vanilla model accuracy, clean accuracy, and certified accuracy with different winner rate ranging from 10% to 20%, as shown in Table 3. The winner rate refers to the top neurons in superficial layer and pruning others. The results show that: 1) the vanilla model accuracy maintains when the top-ranking rate is larger than 10%. 2) The selection of the winner rate is the tradeoff between the activation sparsification level and inference accuracy. On one hand, a smaller reduces more overhead of certified defenses by reducing the candidate sliding windows and filtering more noise prediction labels. On the other hand, it is more challenging to maintain the accuracy of the neural network with the heavy pruning activation map.

max width=0.9 Top-Ranking Rate 10% 15% 20% Patch size Model Clean Cert. Model Clean Cert. Model Clean Cert. 1% 73.6 61.4 56.0 74.2 60.5 58.4 75.0 62.8 60.4 2% 73.8 58.7 51.8 74.6 58.7 54.9 74.4 55.9 53.3 3% 73.2 54.0 48.6 74.3 57.3 49.6 74.2 56.5 51.6

Table 3: Effect of Top-Ranking rate

Occluding Window Sizes. We also test the impact of overlap ratio threshold () when merging the occluding windows to reduce the searching windows for the better computing efficiency. When is larger, less windows will be merged, resulting with larger candidate searching window number and smaller occluding window size. Larger occluding window may hurt the model prediction accuracy. Less candidate searching window can accelerate the detection procedure and remove noises in the prediction map. Hence, the selection of overlap ratio is the tradeoff between occluding window number and window size. As shown in Table 4, the results indicate that ScaleCert can achieve both good certified accuracy and introduce the smallest computing overhead, when overlap ratio threshold is 0.3. We achieve over 100 of speedup compared to MRD [18] and comparable computing overhead with PG-Mask-DS [24].

max width=0.8 Overlap Ratio () 0.3 0.5 0.6 Candidates 25 36 64 Execution Latency (V100 GPU) 219ms 306ms 508ms Occluding Window Size 77 64 58 Clean Accuracy 62.5 59.8 61.1 Certified Accuracy 56.6 55.2 56.0

Table 4: Effect of overlap ratio threshold (PatchSize=2%, =10%)

4.4 Empirical recovery of ScaleCert

We evaluate the empirical recovery ability of our method on 10,000 randomly selected images that can be correctly by ResNet50 model from ImageNet. Firstly, for each image, a square patch with 5% pixels is generated and placed on the image at a random location. The classification accuracy drops to 14.0% after attack patchs. Then, we localize the patch of each image using algorithm 1. We tear the localized patch areas off from each image and put the masked image into inference to get the recovery label. The classification accuracy is significantly improved to 85.8%, which empirically shows that ScaleCert can effectively recover the true label from patch attacks by occluding the patch regions.

5 Conclusion and Future Work

In this paper, we propose the certified defense methodology ScaleCert for the detection of adversarial patches in high resolution images. We observe that adversarial patch intends to leverage localized superficial important neurons (SIN) to manipulate the prediction results. Hence, we propose SIN-based sparse techniques to reduce the adversarial region searching overhead and filter the prediction noises, which significantly improves the certified accuracy with good computing efficiency.

Acknowledgements

This work is partially supported by the Beijing Natural Science Foundation (JQ18013), the NSF of China(under Grants 61925208, 62002338, U19B2019), Beijing Academy of Artificial Intelligence (BAAI) and Beijing Nova Program of Science and Technology (Z191100001119093) , CAS Project for Young Scientists in Basic Research(YSBR-029), Youth Innovation Promotion Association CAS and Xplore Prize.

Broader Impact

As the widely usage of deep learning systems in the real world, our method can practically improve the robustness of DNN applications against adversarial patch attacks. Overall, we believe our paper has positive impact in the society but may potentially be misused to identify the weakness of DNNs.

References

  • W. Brendel and M. Bethge (2019) Approximating CNNs with bag-of-local-features models works surprisingly well on ImageNet. In 7th International Conference on Learning Representations (ICLR), Cited by: §1, Table 1, §4.1.
  • N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §1.
  • P. Chiang, R. Ni, A. Abdelkader, C. Zhu, C. Studor, and T. Goldstein (2020) Certified defenses for adversarial patches. In 8th International Conference on Learning Representations (ICLR), Cited by: §1, §2.2, §4.1.
  • J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In

    Proceedings of the 36th International Conference on Machine Learning (ICML 2019)

    ,
    Vol. 97, pp. 1310–1320. Cited by: §1, §2.2, §2.2.
  • K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018) Robust physical-world attacks on deep learning visual classification. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 1625–1634. Cited by: §1.
  • Y. Gan, Y. Qiu, J. Leng, M. Guo, and Y. Zhu (2020) Ptolemy: architecture support for robust deep learning. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 241–255. Cited by: §A.3, §3.1.
  • S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. A. Mann, and P. Kohli (2019) Scalable verified training for provably robust image classification. pp. 4841–4850. Cited by: §1, §2.2, §2.2.
  • J. Hayes (2018) On visible adversarial perturbations & digital watermarking. In 2018 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPR Workshops), pp. 1597–1604. Cited by: §1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §1, §4.1.
  • Y. He, X. Zhang, and J. Sun (2017) Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397. Cited by: §1.
  • S. M. Jayakumar, R. Pascanu, J. W. Rae, S. Osindero, and E. Elsen (2020) Top-kast: top-k always sparse training. In NeurIPS, Cited by: §A.3.
  • A. Krizhevsky, G. Hinton, et al. (2009) Learning multiple layers of features from tiny images. Technical Report TR-2009. Cited by: §4.1.
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    Imagenet classification with deep convolutional neural networks

    .
    In Advances in neural information processing systems (NIPS), pp. 1097–1105. Cited by: §1, §4.1.
  • M. Lee and Z. Kolter (2019) On physical adversarial patches for object detection. arXiv preprint arXiv:1906.11897. Cited by: §1.
  • A. Levine and S. Feizi (2020) (De)randomized smoothing for certifiable defense against patch attacks. In Conference on Neural Information Processing Systems, (NeurIPS), Cited by: §1, §2.1, §2.2, Table 1, §4.1.
  • F. Li, X. Liu, X. Zhang, Q. Li, K. Sun, and K. Li (2021) Detecting localized adversarial examples: a generic approach using critical region analysis. arXiv preprint arXiv:2102.05241. Cited by: §A.3, §1, §3.1.
  • M. McCoyd, W. Park, S. Chen, N. Shah, R. Roggenkemper, M. Hwang, J. X. Liu, and D. A. Wagner (2020a) Minority reports defense: defending against adversarial patches. In Applied Cryptography and Network Security Workshops (ACNS Workshops), Vol. 12418, pp. 564–582. Cited by: §2.1, §2.2.
  • M. McCoyd, W. Park, S. Chen, N. Shah, R. Roggenkemper, M. Hwang, J. X. Liu, and D. Wagner (2020b) Minority reports defense: defending against adversarial patches. In International Conference on Applied Cryptography and Network Security, pp. 564–582. Cited by: §4.3.
  • M. Naseer, S. Khan, and F. Porikli (2019) Local gradients smoothing: defense against localized adversarial attacks. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1300–1307. Cited by: §1, §1.
  • C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. Cited by: §1.
  • S. Wang, H. Zhang, K. Xu, X. Lin, S. Jana, C. Hsieh, and J. Z. Kolter (2021) Beta-crown: efficient bound propagation with per-neuron split constraints for neural network robustness verification. In

    ICML 2021 Workshop on Adversarial Machine Learning

    ,
    Cited by: §2.2.
  • E. Wong and J. Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In ICML, Cited by: §2.2.
  • Z. Wu, S. Lim, L. S. Davis, and T. Goldstein (2020) Making an invisibility cloak: real world adversarial attacks on object detectors. In European Conference on Computer Vision, pp. 1–17. Cited by: §1.
  • C. Xiang, A. N. Bhagoji, V. Sehwag, and P. Mittal (2021) PatchGuard: a provably robust defense against adversarial patches via small receptive fields and masking. Cited by: §A.3, §1, §1, §2.1, §2.2, Table 1, §3.1, §4.1, §4.3.
  • C. Xiang and P. Mittal (2021) PatchGuard++: efficient provable attack detection against adversarial patches. arXiv preprint arXiv:2104.12609. Cited by: §1, §2.2, Table 1, §4.1.
  • K. Xu, S. Liu, P. Zhao, P. Chen, H. Zhang, Q. Fan, D. Erdogmus, Y. Wang, and X. Lin (2019a) Structured adversarial attack: towards general implementation and better interpretability. In International Conference on Learning Representations, Cited by: §1.
  • K. Xu, Z. Shi, H. Zhang, Y. Wang, K. Chang, M. Huang, B. Kailkhura, X. Lin, and C. Hsieh (2020a) Automatic perturbation analysis for scalable certified robustness and beyond. Advances in Neural Information Processing Systems (NeurIPS). Cited by: §2.2.
  • K. Xu, G. Zhang, S. Liu, Q. Fan, M. Sun, H. Chen, P. Chen, Y. Wang, and X. Lin (2020b) Adversarial t-shirt! evading person detectors in a physical world. In ECCV, pp. 665–681. Cited by: §1.
  • Z. Xu, F. Yu, and X. Chen (2019b) DoPa: a comprehensive cnn detection methodology against physical adversarial attacks. arXiv preprint arXiv:1905.08790. Cited by: §A.3, §3.1.
  • Z. Xu, F. Yu, and X. Chen (2020c) Lance: a comprehensive and lightweight cnn defense methodology against physical adversarial attacks on embedded multimedia applications. In 2020 25th Asia and South Pacific Design Automation Conference (ASP-DAC), pp. 470–475. Cited by: §A.3, §3.1.
  • R. Yu, A. Li, C. Chen, J. Lai, V. I. Morariu, X. Han, M. Gao, C. Lin, and L. S. Davis (2018) Nisp: pruning networks using neuron importance score propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203. Cited by: §1.
  • H. Zhang, H. Chen, C. Xiao, B. Li, D. Boning, and C. Hsieh (2020a) Towards stable and efficient training of verifiably robust neural networks. In International Conference on Learning Representations, Cited by: §2.2.
  • Z. Zhang, B. Yuan, M. McCoyd, and D. Wagner (2020b) Clipped bagnet: defending against sticker attacks with clipped bag-of-features. In 3rd Deep Learning and Security Workshop (DLS), Cited by: §1.
  • B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba (2016) Learning deep features for discriminative localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929. Cited by: §A.3.

Appendix A Appendix

a.1 Experimental Implementation Details

In Section 3.1, we empirically analyze the differences between SINs of patched and benign images. The benign image set consists of 10000 images randomly selected from ImageNet validation set. The patched images are generated by applying the adversarial patch to the random location of the images in benign image set. The adversarial patch, , is generated by maximizing the expectation of possibility for ResNet50 to output targeted malicious label (label 859, toaster in the evaluation) with all adversarial inputs derived from these 10000 images .

(3)

SIN distribution analysis. We analyze the distribution of SINs in terms of both the standard deviation distance and cluster number. We first cluster the Top-200 neurons in the first layer with the MeanShift clustering algorithm and obtain the cluster number and the coordinates of cluster central nodes in the benign and patched images. Then we calculate the standard deviation distance with the following method: As formulated in Eq. (4), for each cluster , we obtain the coordinate of central point by averaging the coordinates of SINs in cluster . Then, we calculate the standard deviation distance as Eq. (5). Figure 1(b) illustrates the distribution of benign image set and patched image set.

(4)
(5)

Prediction stability analysis. In the prediction stability analysis (Figure 1(d)), we only consider the benign images (out of the 10000 selected images) that can be predicted correctly to avoid the interference of other factors.

Adaptive attack methodology. In Figure 1(e), we evaluate the adversarial patch detection rate of SIN-based detection methodology under adaptive attack. In this evaluation, the adversarial detection methodology is based on sliding windows to find the patch candidate region where the SIN ratio in it exceeds the threshold of 90%. Under adaptive attack, the adversary gets known the full knowledge of the defensive strategies. To steer clear of the localized SIN-based detection, the adversary trains the adversarial patch with the following optimization function by taking the superficial activation value into consideration. Compared to Equation (3), a penalty loss of activation value is considered. In this way, the adversary aims to build the adversarial patch with both good poisoning effects, but also tries to escape from the adversary candidate detection by reducing the superficial activation value of the patch region. is the parameter to control the scale of the penalty loss in superficial activation value, is the weight of the first layer.

is updated during the optimization of the loss function.

(6)

a.2 Key Insight of ScaleCert

The insights of ScaleCert are shown in Figure 5. Figure 5(a) illustrates the superficial neuron importance of the first layer in one patched and benign image. For both patched and benign images, the nodes falling out of the Top-k rate are not essential for the prediction results and are pruned for computing efficiency and less prediction noise. SIN-based pruning is compatible with well-studied activation-based neural network compression techniques that are proved to introduce negligible accuracy loss to the DNN models. The distribution of retained SINs is distinct for patched images and benign images (as shown in Figure 5(b) and Figure 5(c) ). The basic idea of ScaleCert is based on the prediction stability after removing localized SINs in benign images.

Two factors are affecting the certified accuracy of ScaleCert: 1) Top-Ranking rate: it is the trade-off between the computing efficiency and model accuracy. 2) Occluding window size. The lower bound of the window size is patch size. The upper bound of window size is determined by the trade-off between computing efficiency and certified accuracy. Therefore, we evaluate the clean and certified accuracy under different parameters in Section 4.3.

Figure 5: Key idea of ScaleCert. The sorted superfical importance of patched image is much larger than benign image on the retained top-k part while similar on the rest pruned part(a). Benign image has dispersed SINs and stable prediction when moving out those SINs(b).On the contrast, patched image has localized SINs and is sensitive to the occluding(c).

a.3 DIN vs. SIN in Adversarial Patch Defenses

DIN vs. SIN. Deep neuron importance has been widely used for abnormal adversarial example detection in previous studies [6, 16, 29, 30]. Specifically, the class activation map (CAM) is a commonly-used technique to indicate the discriminative image regions to identify a particular class [34]. It leverages the weight of deep features (activation in the last convolutional layer) to indicate the importance of deep features for specific classes. We envision that deep feature importance is not a good candidate for adversarial patch detection because it cannot leverage the unique purturbation localization restriction of adversarial patches: 1) Both the benign images and the adversarial images exhibit the localized discriminative regions. Therefore, the discrimination of deep features in the benign images and adversarial images is not intuitive as that in superficial features (as shown in Figure 5(a, b, c)). It is challenging to eliminate the adversarial effect in deep features since the adversarial feature and benign feature are accumulated together. 2) Calculating such deep feature importance is time-consuming, which demands the gradient information and the complete backward propagation process.

We propose the superficial input feature importance as the metric for discrimination analysis based on the intuition that in order to efficiently manage the output prediction results with a very small region of the input data, the adversarial patch must incur large activation from the first place instead of the accumulation of the deep feature extraction.

PatchGuard vs. ScaleCert. To defend against adversarial patch attacks, PatchGuard leverages the deep feature importance analysis and clips the potential malicious deep features (with large weight values) based on robust aggregation. To isolate the adversarial features from benign features, PatchGuard tries to restrict the adversarial effect based on small reception field techniques. However, a smaller reception field may introduce large accuracy degradation (BagNet introduces about 20% of accuracy degradation compared to ResNet [24]). Additionally, for deep neural networks with more layers, small kernel sizes still result in a large reception field and raise the difficulty to isolate and distinguish malicious features from benign features. ScaleCert, on the other hand, optimizes the defense techniques based on SIN-based neural network sparsity, which utilizes the localization bound restriction of patch attacks and introduces negligible loss to model accuracy [11].

a.4 Sensitivity Study of ScaleCert

Large patch defenses. We also test the certified accuracy for the cases with super large adversarial patches, as shown in Table 5 and Table 6. With the increasing of patch size, the certified accuracies of all the defenses decrease. However, for ImageNet dataset, ScaleCert still retains the certified accuracy above 40% even under extreme cases that the patch contains 8% pixels of the input images.

max width=0.95 Dataset ImageNet Cifar Patch size 5% pixels 8% pixels 5% pixels 8% pixels Accuracy clean robust clean robust clean robust clean robust Certified Recovery IBP computationally infeasible 24.8 17.8 19.0 13.8 CBN 53.5 1.4 53.5 0.5 83.2 0.8 83.2 0.1 DS 44.4 11.7 44.4 8.8 83.9 46.3 83.9 34.8 PG-Mask-DS 43.1 13.2 42.2 9.5 84.7 47.6 84.3 35.5 PG-Mask-BN 52.8 9.2 52.0 5.4 83.4 26.8 82.6 16.9 Certified Detection MR computationally infeasible 82.1 75.4 79.9 72.0 PatchGuard++ 61.4 25.3 61.5 22.2 73.5 67.6 70.8 64.2 ScaleCert 54.0 48.7 50.6 42.7 76.3 71.2 73.5 67.6

Table 5: Certified accuracy comparison with large patch sizes

max width=0.9 Top-Ranking Rate 10% 15% 20% Patch size Model Clean Cert. Model Clean Cert. Model Clean Cert. 5% 72.9 53.0 44.4 74.2 51.3 47.6 74.6 54.0 48.7 8% 72.7 48.4 40.2 73.3 46.0 41.6 73.8 50.6 42.7

Table 6: Certified accuracy with different Top-Ranking rate

Superficial layer selection. Additionally, we test the impact of using different superficial layers for the SIN mask computing and the results are shown in Table 7. When the superficial layer is closer to the input, the targeted searching region is more precise and smaller. Otherwise, the searching region is larger due to the reception field effects. The performance drop of Layer 2 is introduced by the MaxPooling layer between Layer1 and Layer 2 (ResNet50) which leads to information losses and pruning in Layer 2 would hurt the performance. We suggest that do not select the superficial layer right after the MaxPooling layers and recommend using the first superficial layer because of both good accuracy and less computing overhead.

Superficial Layer 1 2 3
Model Accuracy 73.8 69.8 73.4
Clean Accuracy 58.7 47.0 54.0
Certified Accuracy 51.8 42.4 49.4
Table 7: Effect of Superficial Layer Selection on Certified Accuracy