# (De)Randomized Smoothing for Certifiable Defense against Patch Attacks

Patch adversarial attacks on images, in which the attacker can distort pixels within a region of bounded size, are an important threat model since they provide a quantitative model for physical adversarial attacks. In this paper, we introduce a certifiable defense against patch attacks that guarantees for a given image and patch attack size, no patch adversarial examples exist. Our method is related to the broad class of randomized smoothing robustness schemes which provide high-confidence probabilistic robustness certificates. By exploiting the fact that patch attacks are more constrained than general sparse attacks, we derive meaningfully large robustness certificates. Additionally, the algorithm we propose is de-randomized, providing deterministic certificates. To the best of our knowledge, there exists only one prior method for certifiable defense against patch attacks, which relies on interval bound propagation. While this sole existing method performs well on MNIST, it has several limitations: it requires computationally expensive training, does not scale to ImageNet, and performs poorly on CIFAR-10. In contrast, our proposed method effectively addresses all of these issues: our classifier can be trained quickly, achieves high clean and certified robust accuracy on CIFAR-10, and provides certificates at the ImageNet scale. For example, for a 5*5 patch attack on CIFAR-10, our method achieves up to around 57.8 (with a classifier around 83.9 certified accuracy for the existing method (with a classifier with around 47.8 clean accuracy), effectively establishing a new state-of-the-art. Code is available at https://github.com/alevine0/patchSmoothing.

## Authors

• 5 publications
• 28 publications
• ### Robustness Certificates for Sparse Adversarial Attacks by Randomized Ablation

Recently, techniques have been developed to provably guarantee the robus...
11/21/2019 ∙ by Alexander Levine, et al. ∙ 18

• ### Minority Reports Defense: Defending Against Adversarial Patches

Deep learning image classification is vulnerable to adversarial attack, ...
04/28/2020 ∙ by Michael McCoyd, et al. ∙ 8

• ### PatchGuard: Provable Defense against Adversarial Patches Using Masks on Small Receptive Fields

Localized adversarial patches aim to induce misclassification in machine...
05/17/2020 ∙ by Chong Xiang, et al. ∙ 0

• ### Adversarial Framing for Image and Video Classification

Neural networks are prone to adversarial attacks. In general, such attac...
12/11/2018 ∙ by Michał Zając, et al. ∙ 0

• ### Robustness Verification for Classifier Ensembles

We give a formal verification procedure that decides whether a classifie...
05/12/2020 ∙ by Dennis Gross, et al. ∙ 0

• ### Black-box Smoothing: A Provable Defense for Pretrained Classifiers

We present a method for provably defending any pretrained image classifi...
03/04/2020 ∙ by Hadi Salman, et al. ∙ 7

• ### On Pruning Adversarially Robust Neural Networks

In safety-critical but computationally resource-constrained applications...
02/24/2020 ∙ by Vikash Sehwag, et al. ∙ 0

## Code Repositories

### patchSmoothing

Code for the paper "(De)Randomized Smoothing for Certifiable Defense against Patch Attacks" by Alexander Levine and Soheil Feizi.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, adversarial attacks have become a topic of great interest in machine learning. (Szegedy et al., 2013; Madry et al., 2017; Carlini and Wagner, 2017) However, in many instances the threat models considered for these attacks, such as small

distortions to every pixel of an image, implicitly require the attacker to be able to directly interfere with the input to a neural network: this represents only a limited practical threat. (We recognize that

(Kurakin et al., 2018) has shown that attacks may be effective even when the distorted image has been printed and re-photographed, but this is clearly a limited setting.) However, the development of physical adversarial attacks (Eykholt et al., 2018), in which small visible changes are made to real world objects in order to disrupt classification of images of these objects, represents a more concerning security threat.

Physical adversarial attacks can often be modeled as patch adversarial attacks, in which the attacker can make arbitrary changes to pixels within a region of bounded size. Indeed, there is often a direct relationship between the two: for example, the universal patch attack proposed by (Brown et al., 2017) is effective as a physical sticker. In that attack, the pixels of the attack do not depend on the attacked image. Image-specific patch attacks have also been proposed, such as LaVAN (Karmon et al., 2018), which reduces ImageNet classification accuracy to 0% using only a pixel square patch (on images of size ). In this paper, we also consider attacks on square patches, of size .

Practical defenses against patch adversarial attacks have been proposed (Hayes, 2018; Naseer et al., 2018). For the aforementioned pixel attacks on ImageNet, (Naseer et al., 2018) claims the current state-of-the-art practical defense. However, (Chiang et al., 2020) has recently broken this defense, reducing classification accuracy on ImageNet to 14%. In the same work, (Chiang et al., 2020) also proposes the first certified defense against patch adversarial attacks, which uses interval bound propagation (Gowal et al., 2018). In a certifiably robust classification scheme, in addition to providing a classification for each image, the classifier may also return an assurance that the classification will provably not change under any distortion of a certain magnitude and threat model. One then both reports the clean accuracy (normal accuracy) of the model, as well as the certified accuracy (percent of images which are both correctly classified, and for which it is guaranteed that the classification will not change under a certain attack type). Certified defenses are preferable to practical defences because they guarantee that no future adversary (under a certain threat model) will be effective against the defense.

The certified defense proposed by (Chiang et al., 2020), however, can not defend against the attack proposed by (Chiang et al., 2020). Specifically, while this certified defense performs well on MNIST, it achieves poor certified accuracy on CIFAR-10 and, to quote, “is unlikely to scale to ImageNet.” In this work, we propose a certified defense against patch adversarial attacks which overcomes these issues:

We note that our method has top-1 certified accuracy on ImageNet classification which is approximately equal to the 14% empirical accuracy of the state-of-the art practical defense (Naseer et al., 2018) under the attack proposed by (Chiang et al., 2020) (although our clean accuracy is lower, 43% vs. 71%). The certified defense proposed by (Chiang et al., 2020) also has a computationally expensive training algorithm: the training time for the reported best model was 8.4 GPU hours using NVIDIA 2080 Ti GPUs. Our MNIST models, by contrast, took approximately 1.0 GPU hour to train on the same model of GPU.

Our certifiably robust classification scheme is based on randomized smoothing, a class of techniques for certifiably robust classification which has been proposed for various threat models, including (Li et al., 2018; Cohen et al., 2019; Salman et al., 2019), (Lecuyer et al., 2019; Teng et al., 2020), (Lee et al., 2019; Levine and Feizi, 2019a), and Wasserstein (Levine and Feizi, 2019b) metrics. All of these methods rely on a similar mechanism where noisy versions of an input image are used in the classification. Such noisy inputs are created either by adding random noise to all pixels (Lecuyer et al., 2019) or by removing (ablating) some of the pixels (Levine and Feizi, 2019a). A large number of noisy images are then classified by a base classifier and then the consensus of these classifications is reported as the final classification result. For an adversarial image at a bounded distance from

, the probability distributions of possible noisy images which can be produced from

and will substantially overlap. This implies that, if a sufficiently large fraction of noisy images derived from are classified to some class , then with high probability, a plurality of noisy images derived from will also be assigned to this class. While recent work (Kumar et al., 2020; Blum et al., 2020; Yang et al., 2020) has shown that these techniques may not extend to certain metrics (particularly the metric) in the classification setting, randomized smoothing has also been extended to protect against threat models beyond inference-time attacks on classification, to defend against attacks on machine learning interpretation (Levine et al., 2019) and poisoning attacks (Rosenfeld et al., 2020).

In this paper, we focus on patch adversarial attacks which can be considered as a special case of (sparse) adversarial attacks: in an attack, the adversary can choose a limited number of pixels and apply unbounded distortions to them. A patch adversarial attack is therefore a restricted version of a sparse adversarial attack: the attacker is additionally constrained to selecting only a block of adjacent pixels to attack, rather than any arbitrary pixels.

The current state-of-the-art certified defense against sparse adversarial attacks is a randomized smoothing method proposed by (Levine and Feizi, 2019a). In this method, a base classifier, , is trained to make classifications based on only a small number of randomly-selected pixels: the rest of the image is ablated, meaning that it is encoded as a null value. At test time, the final classification is taken as the class most likely to be returned by on a randomly ablated version of the image. (In practice,

is estimated by evaluating

on large random samples of possible ablations.) Note that regardless of the choice of pixels that an adversary might distort, the probability that any of these pixels is also one of the pixels present in each particular ablated sample used by the base classifier is bounded. Then if the base classifier returns the correct classification on a sufficiently large proportion of ablated images, one can conclude that no adversarial examples with fewer than a certain number of distorted pixels exist.

In practice, we find that applying this scheme directly to patch attacks yields poor results (Table 2). This is because the defense proposed in (Levine and Feizi, 2019a) does not incorporate the additional structure of the attack. In particular, for patch attacks, we can use the fact that the attacked pixels form a contiguous square to develop a more effective defense. In this paper, we propose a structured ablation scheme, where instead of independently selecting pixels to use for classification, we select pixels in a correlated way in order to reduce the probability that the adversarial patch is sampled. In Theorems 1 and 2, we characterize the robustness certificates of proposed structured ablation methods against patch adversarial attacks. Empirically, structured ablation certificates yields much improved certified accuracy to patch attacks, compared to the naive certificate. (52.83% for patches on MNIST, compared to 8.04%.)

By reducing the total number of possible ablations of an image, structured ablation also allows us to de-randomize our algorithm, yielding improved, deterministic certificates. For robustness, (Levine and Feizi, 2019a) achieves the largest median certificates on MNIST by using a base classifier which classifies using only out of pixels. Note that there are ways to make this selection. It is therefore not feasible to evaluate precisely the probability that returns any particular class : one must estimate this based on random samples, following techniques originally developed by (Cohen et al., 2019) for randomized smoothing. Using our proposed methods, the number of possible ablations is small enough so that it is tractable to classify using all possible ablations: we can exactly evaluate the probability that returns each class. Our certificate is therefore exact, rather than probabilistic: we know what returns given each possible ablated image as input, and we know the maximum number of these classifications which could possibly be distorted by the adversary, so we can determine whether or not the plurality classification could change using simple counting. Determinism provides additional benefits, which we explore in Sections 2.6 and 3.2. In a concurrent work, (Rosenfeld et al., 2020) has also independently proposed a de-randomized version of a randomized smoothing technique. However, the threat model of that work is quite different: (Rosenfeld et al., 2020) develops a smoothing defense against label-flipping poisoning (training-time) attacks, where the adversary is able to change the label of a bounded number of training samples. Notably, (Rosenfeld et al., 2020)‘s result only applies directly to linear base classifiers (although in the particular threat model considered, the inputs to this linear classifier may be nonlinear features learned from the unlabeled training data, which the adversary cannot perturb). By restricting to linear classifiers, (Rosenfeld et al., 2020) is able to analytically determine the probabilities of returning each class. By contrast, our de-randomized smoothing technique for inference-time patch attacks makes no restriction on the architecture of the base classifier , in practice a deep convolutional network.

## 2 Certifiable Defenses against Patch Attacks

### 2.1 Sparse Ablation (Levine and Feizi, 2019a)

As mentioned in the introduction, patch attacks can be regarded as a restricted case of attacks. In particular, let be the magnitude of an adversarial attack: the attacker modifies pixels and leaves the rest unchanged. A patch attack, with an adversarial patch, is also an attack, with . We can then attempt to apply existing certifiably robust classification schemes for the threat model to the patch attack threat model: we simply need to certify to an radius of . Consider specifically the smoothing-based certifiably robust classifier introduced by (Levine and Feizi, 2019a). In this classification scheme, given an input image , the base classifier classifies a large number of distinct randomly-ablated versions of , in each of which only pixels of the original image are randomly and independently selected to be retained and used by the base classifier . Therefore, for any choice of pixels that the attacker could choose to attack, the probability that any of these pixels is also one of the pixels used in ’s classification is:

 Δ :=Pr(f uses attacked pixels) Δ =1−(hw−ρk)(hwk)≈kρhw=km2hw(k,ρ<

where is the number of attacked pixels, is the number of retained pixels used by the base classifier, and the overall dimensions of the input image are . To understand this, note that the classifier has opportunities to choose an attacked pixel, and out of pixels are attacked. Clearly, if does not use any of the attacked pixels, then its output will not be corrupted by the attacker. Therefore, the attacker can change the output of with probability at most . Let be the majority classification at (i.,e., ). If with probability greater than , then for any distorted image , one can conclude that with probability greater than , and therefore that . While this technique produces state-of-the art guarantees against general attacks, it yields rather poor certified accuracies when applied to patch attacks, because it does not take advantage of the structure of the attack (Table 2).

### 2.2 Structured Ablation

In order to exploit the restricted nature of patch attacks, we propose two structured ablation methods, which select correlated groups of pixels in order to reduce the probability that the adversarial patch is sampled:

• Block Smoothing: In this method, we select a single square block of pixels, and ablate the rest of the image. The number of retained pixels is then . Note that for an adversarial patch, out of the possible selections for blocks to use for classification, of them will intersect the patch. Then we have:

 Δblock=(m+s−1)2hw=(m+√k−1)2hw<4max(m2,k)hw. (1)

As illustrated in Figure 1, this implies a substantially decreased probability of intersecting the adversarial patch, compared to sampling pixels independently.

• Band Smoothing: In this method, we select a single band (a column or a row) of pixels of width , and ablate the rest of the image. In the case of a column, the number of retained pixels is then . For an adversarial patch, out of the possible selections for bands to use for classification, of them will intersect the patch. Then we have:

 Δcol.=m+s−1w=m+k/h−1w

For both of these methods, it is tractable to use the base classifier to classify all possible ablated versions of an image (there are possible ablations for block smoothing, and for column smoothing): this allows us to exactly compute , and to compute deterministic certificates. Our experiments show that structured ablation produces higher certified accuracy than sparse ablation. This is because, for similar values of , structured ablation methods yield much higher base classifier accuracies. (Figure 2). Empirically, we find that the band method (and specifically, column smoothing) produces the most certifiably robust classifiers (Tables 4,5). In Section 3.3, we explore structured ablation using multiple blocks or bands of pixels.

### 2.3 Notation for Algorithms

Following the notation of (Levine and Feizi, 2019a), let be the set of possible values for a pixel in an input image. Because our base classifier only uses a subset of the image, we must have a way to encode missing pixels: let NULL represent the absence of information about a pixel, and let be the set . Let be the number of classes in the classification problem. We will assume that our base classifier is a deterministic function in , with . In other words, the base classifier will take an image with some pixels ablated to NULL and others retained, and output softmaxedlogits representing the confidence that the classifier has in each class. As a practical matter, can easily be represented as a neural network: we encode the additional NULL symbol in the input in the same manner described by (Levine and Feizi, 2019a) for each dataset tested. For the block smoothing method, we will define the operation , which will take an image , and return an ablated image in that retains only the pixels in a block with upper-left corner , and ablates the rest. For the column (or row, by symmetry) smoothing method, similarly define , which retains only a column of width starting at . In both cases, if the retained block or column would extend beyond the size of the image, the retained region will wrap around; see Figure 3. (This is necessary to ensure that an adversarial patch at the center of an image is not more likely to be sampled than one at the edge of the image.)

Additionally we define StableArgSort of a list to return the indices that would sort the list in descending order, such that, in the event of a tie, the lower index is listed first. For example .

### 2.4 Block Smoothing Algorithm

The “block” variation of our algorithm is presented as Algorithm 1. In short, we evaluate the base classifier on all possible blocks. On each input, rather than simply taking the maximum class returned by the base classifier, we count every class for which the logit exceeds a threshold . The classifier can abstain entirely if no class reaches this threshold (for example if the block contains no useful information). If the gap in the counts between the top class and the next class is greater than twice the number of classifications which could possibly be affected by an adversarial patch, then we can certify that no such patch can change the final smoothed classification output . During training, as in prior smoothing works, we train on ablated samples, using a single randomly-determined ablation pattern (selection of block to retain) on all samples in each batch.

###### Theorem 1.

For any image , base classifier , smoothing block size , and threshold , if Algorithm 1 returns class on input and certifies this classification, then Algorithm 1 will return class on all inputs that differ from only within an region of pixels.

###### Proof.

Let represent the upper-right corner of the patch in which and differ. Note that the output of will be equal to the output of , unless the block retained (starting at ) intersects with the adversarial patch (starting at ). This condition occurs only when both is in the range between and , inclusive, and is in the range between and , inclusive. Note that there are values each for and which meet this condition, and therefore such pairs . Therefore in all but cases.

(If , then the intersecting values for , taking into account the wrapping behavior of the Ablate operator, will be through and through : there are still such values, and similar logic applies to .)

As a consequence, will equal in all but cases, so the Counts which are incremented will be the same in all but at most (= AffectedBlocks) iterations. Let and represent the state of the Counts array after the main loop of the algorithm on inputs and , respectively, and similarly define , , and . Because, for each class , is incremented by at most per iteration, we have that:

 |Countsx[k]−Countsx′[k]|≤AffectedBlocks. (3)

If , then by sorting, we have that for all classes , . Then by Equation 3, , and so .

If and then for all , . In the inequality case, we already have that . For the equality case, consider any class for which . Note that and also, by stable sorting of , . Then because , we have that . By Equation 3, . Then, by stable sorting of ,

### 2.5 Column Smoothing Algorithm

The column version of our algorithm is presented as Algorithm 2. (An algorithm using rows rather than columns can be derived by simply transposing the input). Note that this algorithm is nearly identical to the block variation: the difference is that we only need to consider rows rather than blocks. Formally, we claim that:

###### Theorem 2.

For any image , base classifier , smoothing column size , and threshold , if Algorithm 2 returns class on input and certifies this classification, then Algorithm 2 will return class on all inputs that differ from only within an region of pixels.

This theorem can be proved similarly to Theorem 1.

### 2.6 Changes from Standard Randomized Smoothing

In conventional randomized smoothing algorithms (Lecuyer et al., 2019; Levine and Feizi, 2019a; Cohen et al., 2019; Salman et al., 2019), rather than computing the probability that returns each class directly by exhaustive iteration, one must instead lower-bound, with high confidence, the probability that returns the plurality class and upper-bound the probabilities that returns all other classes, based on samples. This leads to decreased certified accuracy due to estimation error. Additionally, all of these bounds must hold simultaneously: in order to ensure with high confidence that the gap between and is sufficiently large for each to prove robustness, one must bound the population probabilities for every class. (Lecuyer et al., 2019) does this directly using a union bound, leading to increased error as the number of classes increases. However, later works, following (Cohen et al., 2019), use a simpler method: one only needs to use samples to lower-bound the probability that the base classifier returns the top class. One can then upper bound all other class probabilities by observing that . In other words, rather than determining whether will stay the plurality class at an adversarial point, one instead determines whether will stay the majority class. This works well if the probability that returns a class other than is concentrated in a single “runner-up” class, which (Cohen et al., 2019) finds to be typical, at least in the smoothing case. This is also the estimation method used by (Levine and Feizi, 2019a) for certificates: this is why, when describing that method in the Section 2.1, we gave the condition for certification as . In our deterministic method, we can use a less strict condition, that . (As detailed in the above proof, we can sometimes even certify in the equality case, if it is assured that, by stable sorting, will be selected if there is a tie between the class probabilities at the distorted point.)

In this work, we sidestep the estimation problem entirely by computing the population probabilities exactly. However, by avoiding the assumption of (Cohen et al., 2019), that all probability not assigned to is instead assigned to a single adversarial class, we can make an important additional optimization: we can add an ‘abstain’ option. If there is no compelling evidence for any particular class in an ablated image (i.e., if all logits are below a threshold value ), our classifier abstains, not incrementing the ‘Counts’ for any class. This prevents blocks which contain no information from being assigned to an arbitrary, likely incorrect class. Table 5 shows that this significantly increases the certified accuracy. Our threshold system also allows the base classifier to select multiple classes, if there is strong evidence for each of them. This is intended to increase certified accuracy in the case of a large number of classes (i.e., ImageNet), where the top-1 accuracy of the base classifier might be very low: if the correct class consistently occurs within the top several classes, it may still be possible to certify robustness.

## 3 Results

### 3.1 MNIST, CIFAR-10, ImageNet

Certified robustness against patch attacks is presented for patches on MNIST and CIFAR-10 in Tables 5 and 4, respectively, and for ImageNet for patches in Table 3. On MNIST and CIFAR-10, we tested using both block and column smoothing, for all block/column sizes for which certificates are mathematically possible (). (On MNIST, we also tested smoothing with rows rather than columns, with slightly worse results: this is presented in Appendix A.) On ImageNet (Table 3), we tested with column smoothing, as this worked best for both CIFAR-10 and MNIST. We selected to use columns of width , on the rough intuition that, averaging over CIFAR-10 and MNIST, the optimal column width for smoothing was at : further exploration of this parameter space would likely yield improved certificates on ImageNet.

### 3.2 Advantages due to De-randomization

As discussed in Section 2.6, there are two benefits to de-randomization: first, we can eliminate estimation error, and second, it allows the classifier to abstain or select multiple classes without complicating estimation. In order to distinguish these effects, we present in Table 6 the MNIST column certificates using randomized column smoothing (with the estimation scheme from (Cohen et al., 2019)), versus deterministic column smoothing without abstentions or multiple-selections: that is to say, counting just the top-1 class returned by the base classifier. The “Top-1” deterministic values are also presented in Table 5, on the right. We see that, while determinism alone provides some benefit, the thresholding system provides ane even greater improvement.

### 3.3 Multiple Blocks, Multiple Bands

In Section 2.2, we argued for using a single contiguous group of pixels on the grounds that, compared to selecting individual pixels, it provides for a smaller risk of intersecting the adversarial patch. However, there may be some benefit to getting information from multiple distinct areas of an image, even if there is some associated increase in . Rather than just looking at the extremes of entirely independent pixels (Table 2) versus a single band or block (Table 5) we also explored, on MNIST, the intermediate case of using a small number of bands or blocks (Table 7). We show all mathematically possible multiple-column certificates on MNIST, as well as several certificates for multiple-blocks with . Note that in order to avoid overlapping columns/blocks we select columns/blocks aligned to a grid: details on the certificates (which are deterministic) are provided in Appendix D. Interestingly, while the certificates using multiple columns are far below optimal, the certified accuracy for two blocks is only marginally below the best single-block certified accuracy.

## Conclusion

In this paper, we proposed two related methods for image classification which are provably robust to patch adversarial attacks. These methods, which are adaptations of randomized smoothing, far exceed the current state-of-the-art certified accuracy to patch attacks on CIFAR-10. One of our methods, column smoothing, provides certified robustness on ImageNet comparable to the empirical robustness of state-of-the-art empirical defenses against patch attacks, with very little parameter tuning: it is likely possible to tune this technique to provide even greater certified accuracy.

## References

• A. Blum, T. Dick, N. Manoj, and H. Zhang (2020) Random smoothing might be unable to certify robustness for high-dimensional images. arXiv preprint arXiv:2002.03517. External Links: 2002.03517 Cited by: §1.
• T. B. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer (2017) Adversarial patch. External Links: 1712.09665 Cited by: §1.
• N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In 2017 38th IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1.
• P. Chiang, R. Ni, A. Abdelkader, C. Zhu, C. Studor, and T. Goldstein (2020) Certified defenses for adversarial patches. In International Conference on Learning Representations, External Links: Link Cited by: Table 1, §1, §1, §1.
• J. Cohen, E. Rosenfeld, and Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In International Conference on Machine Learning, pp. 1310–1320. Cited by: Appendix E, §1, §1, §2.6, §2.6, §3.2.
• K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song (2018)

Robust physical-world attacks on deep learning visual classification

.
In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

,
pp. 1625–1634. Cited by: §1.
• S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, R. Arandjelovic, T. Mann, and P. Kohli (2018) On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §1.
• J. Hayes (2018) On visible adversarial perturbations & digital watermarking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1597–1604. Cited by: §1.
• D. Karmon, D. Zoran, and Y. Goldberg (2018) Lavan: localized and visible adversarial noise. arXiv preprint arXiv:1801.02608. Cited by: §1.
• A. Kumar, A. Levine, T. Goldstein, and S. Feizi (2020) Curse of dimensionality on randomized smoothing for certifiable robustness. arXiv preprint arXiv:2002.03239. External Links: 2002.03239 Cited by: §1.
• A. Kurakin, I. J. Goodfellow, and S. Bengio (2018) Adversarial examples in the physical world. In Artificial Intelligence Safety and Security, pp. 99–112. Cited by: §1.
• M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In 2019 2019 IEEE Symposium on Security and Privacy (SP), Vol. , Los Alamitos, CA, USA, pp. 726–742. External Links: ISSN 2375-1207, Document, Link Cited by: §1, §2.6.
• G. Lee, Y. Yuan, S. Chang, and T. S. Jaakkola (2019) Tight certificates of adversarial robustness for randomly smoothed classifiers. arXiv preprint arXiv:1906.04948. Cited by: §1.
• A. Levine and S. Feizi (2019a) Robustness certificates for sparse adversarial attacks by randomized ablation. arXiv preprint arXiv:1911.09272. Cited by: Appendix E, Appendix E, §1, §1, §1, §1, Figure 1, Figure 2, §2.1, §2.1, §2.3, §2.6, Table 2.
• A. Levine and S. Feizi (2019b) Wasserstein smoothing: certified robustness against wasserstein adversarial attacks. arXiv preprint arXiv:1910.10783. Cited by: §1.
• A. Levine, S. Singla, and S. Feizi (2019) Certifiably robust interpretation in deep learning. arXiv preprint arXiv:1905.12105. Cited by: §1.
• B. Li, C. Chen, W. Wang, and L. Carin (2018) Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113. Cited by: §1.
• A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1.
• M. Naseer, S. Khan, and F. M. Porikli (2018) Local gradients smoothing: defense against localized adversarial attacks. 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1300–1307. Cited by: §1, §1.
• E. Rosenfeld, E. Winston, P. Ravikumar, and J. Z. Kolter (2020) Certified robustness to label-flipping attacks via randomized smoothing. arXiv preprint arXiv:2002.03018. External Links: 2002.03018 Cited by: §1, §1.
• H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn, and S. Bubeck (2019) Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §1, §2.6.
• C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
• J. Teng, G. Lee, and Y. Yuan (2020) $\ell_1$ adversarial robustness certificates: a randomized smoothing approach. External Links: Link Cited by: §1.
• G. Yang, T. Duan, E. Hu, H. Salman, I. Razenshteyn, and J. Li (2020) Randomized smoothing of all shapes and sizes. arXiv preprint arXiv:2002.08118. External Links: 2002.08118 Cited by: §1.

## Appendix A Results for Row Smoothing

We tested smoothing with rows, rather than columns, on MNIST. This resulted in slightly lower certified accuracy under patch attacks (45.35% certified accuracy, versus 53.98% using column smoothing). Full results are presented in Table 8.

## Appendix B Results for Multi-column and Multi-block Smoothing for All Tested Values of Parameter θ

In Table 9, we present complete results for multi-column and multi-block smoothing on MNIST, for all tested values of threshold parameter .

## Appendix C Results on CIFAR-10 for All Tested Values of Parameter θ

In Table 12, we present complete results on CIFAR-10, for all tested values of threshold parameter .

## Appendix D Certificates for Multi-column and Multi-block Smoothing

For smoothing with multiple blocks or multiple columns, we consider only blocks or columns aligned to a grid starting at the upper-left corner of the image. For example, if using block size , we consider only retaining blocks with upper-left corner , where and are both multiples of . This prevents retained blocks from overlapping, and also reduces the (large) number of possible selections of multiple blocks, allowing for derandomized smoothing. Let the number of retained blocks or bands be , and, as in the paper, let the block or band size be , the image size be , and the adversarial patch size be .

For the block case, note that there are such axis-aligned blocks. Of these, the adversarial patch will overlap at most blocks. For example, for a adversarial patch, using block size , the adversarial patch will overlap exactly blocks, regardless of position: see Figure 4.

When performing derandomized smoothing, we classify all possible choices of blocks. Of these classifications, at least

 (⌈hs⌉×⌈ws⌉−(⌈m−1s⌉+1)2κ)

will use none of the at most blocks which may be affected by the adversary. Therefore, the number of classifications which might be affected by the adversary is at most:

 (⌈hs⌉×⌈ws⌉κ)−(⌈hs⌉×⌈ws⌉−(⌈m−1s⌉+1)2κ).

We can then use the above quantity in place of the variable AffectedBlocks in the certification algorithm (Algorithm 1). This modification, in addition to classifying all selections of axis-aligned blocks, is sufficient to adapt the certification algorithm to a multi-block setting.

The column case is similar: there are axis-aligned bands (defined as bands which start at a column index which is a multiple of ). Of these, the adversarial patch will overlap at most bands. When performing smoothing, we classify all possible choices of bands. Of these classifications, at least

 (⌈ws⌉−(⌈m−1s⌉+1)κ)

will use none of the at most bands which may be affected by the adversary. Therefore, the number of classifications which might be affected by the adversary is at most:

 (⌈ws⌉κ)−(⌈ws⌉−(⌈m−1s⌉+1)κ).

Full results for multi-block and multi-band smoothing are shown in Table 9.

## Appendix E Architecture and Training Details

As discussed in the paper, we used the method introduced by (Levine and Feizi, 2019a) to represent images with pixels ablated: this requires increasing the number of input channels from one to two for greyscale images (MNIST) and from three to six for color images. For MNIST, we used the simple CNN architecture from the released code of (Levine and Feizi, 2019a), consisting of two convolutional layers and three fully-connected layers. For CIFAR-10 and ImageNet, we used modified versions ResNet-18 and ResNet-50, respectively, with the number of input channels increased to six. Training details are presented in Tables 10, 11 and 13.

For randomized smoothing experiments, we follow the empirical estimation methods proposed by (Cohen et al., 2019). We certify to confidence, using 1000 random samples to select the putative top class, and 10000 random samples to lower-bound the probability of this class. For sparse randomized ablation on MNIST, we use released pretrained models from (Levine and Feizi, 2019a).

## Appendix F Proof for Column Smoothing Algorithm (Theorem 2)

For completeness, we include a proof of the correctness of the Column smoothing algorithm (Algorithm 2). This is very similar to the proof of Theorem 1.

###### Proof.

Let represent the upper-right corner of the patch in which and differ. Note that the output of will be equal to the output of , unless the band (of width ) retained, starting at column , intersects with the adversarial patch (starting at ). This condition occurs only when both is in the range between and , inclusive. Note that there are values for which meet this condition. Therefore in all but cases.

(If , then the intersecting values for , taking into account the wrapping behavior of the Ablate operator, will be through and through : there are still such values.)

As a consequence, will equal in all but cases, so the Counts which are incremented will be the same in all but at most (= AffectedBands) iterations. Let and represent the state of the Counts array after the main loop of the algorithm on inputs and , respectively, and similarly define , , and . Because, for each class , is incremented by at most per iteration, we have that:

 |Countsx[k]−Countsx′[k]|≤AffectedBands. (4)

If , then by sorting, we have that for all classes , . Then by Equation 4, , and so .

If and then for all , . In the inequality case, we already have that . For the equality case, consider any class for which . Note that and also, by stable sorting of , . Then because , we have that . By Equation 4, . Then, by stable sorting of ,