# Wasserstein Smoothing: Certified Robustness against Wasserstein Adversarial Attacks

In the last couple of years, several adversarial attack methods based on different threat models have been proposed for the image classification problem. Most existing defenses consider additive threat models in which sample perturbations have bounded L_p norms. These defenses, however, can be vulnerable against adversarial attacks under non-additive threat models. An example of an attack method based on a non-additive threat model is the Wasserstein adversarial attack proposed by Wong et al. (2019), where the distance between an image and its adversarial example is determined by the Wasserstein metric ("earth-mover distance") between their normalized pixel intensities. Until now, there has been no certifiable defense against this type of attack. In this work, we propose the first defense with certified robustness against Wasserstein Adversarial attacks using randomized smoothing. We develop this certificate by considering the space of possible flows between images, and representing this space such that Wasserstein distance between images is upper-bounded by L_1 distance in this flow-space. We can then apply existing randomized smoothing certificates for the L_1 metric. In MNIST and CIFAR-10 datasets, we find that our proposed defense is also practically effective, demonstrating significantly improved accuracy under Wasserstein adversarial attack compared to unprotected models.

## Authors

• 5 publications
• 27 publications
• ### Robustness Certificates for Sparse Adversarial Attacks by Randomized Ablation

Recently, techniques have been developed to provably guarantee the robus...
11/21/2019 ∙ by Alexander Levine, et al. ∙ 18

• ### Wasserstein Adversarial Examples via Projected Sinkhorn Iterations

A rapidly growing area of work has studied the existence of adversarial ...
02/21/2019 ∙ by Eric Wong, et al. ∙ 8

We propose functional adversarial attacks, a novel class of threat model...
05/29/2019 ∙ by Cassidy Laidlaw, et al. ∙ 0

• ### Should Adversarial Attacks Use Pixel p-Norm?

Adversarial attacks aim to confound machine learning systems, while rema...
06/06/2019 ∙ by Ayon Sen, et al. ∙ 0

• ### Defending Neural Backdoors via Generative Distribution Modeling

Neural backdoor attack is emerging as a severe security threat to deep l...
10/10/2019 ∙ by Ximing Qiao, et al. ∙ 0

• ### Interpreting and Evaluating Neural Network Robustness

Recently, adversarial deception becomes one of the most considerable thr...
05/10/2019 ∙ by Fuxun Yu, et al. ∙ 0

• ### Random Smoothing Might be Unable to Certify ℓ_∞ Robustness for High-Dimensional Images

We show a hardness result for random smoothing to achieve certified adve...
02/10/2020 ∙ by Avrim Blum, et al. ∙ 16

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In recent years, adversarial attacks against machine learning systems, and defenses against these attacks, have been heavily studied

(Szegedy et al., 2013; Madry et al., 2017; Carlini and Wagner, 2017)

. Although these attacks have been applied in a variety of domains, image classification tasks remain a major focus of research. In general, for a specified image classifier

, the goal of an adversarial attack on an image is to produce a perturbed image that is imperceptibly ‘close’ to , such that classifies differently than . This ‘closeness’ notion can be measured in a variety of different ways under different threat models. Most existing attacks and defenses consider additive threat models where the norm of is bounded.

Recently, non-additive threat models (Wong et al., 2019; Laidlaw and Feizi, 2019; Engstrom et al., 2019; Assion et al., 2019) have been introduced which aim to minimize the distance between and according to other metrics. Among these attacks is the attack introduced by Wong et al. (2019) which considers the Wasserstein distance between and

, normalized such that the pixel intensities of the image can be treated as probability distributions. Informally, the Wasserstein distance between probability distributions

and measures the minimum cost to ‘transport’ probability mass in order to transform into , where the cost scales with both the amount of mass transported and the distance over which it is transported with respect to some underlying metric. The intuition behind this threat model is that shifting pixel intensity a short distance across an image is less perceptible than moving the same amount of pixel intensity a larger distance (See Figure 1 for an example of a Wasserstein adversarial attack.)

A variety of practical approaches have been proposed to make classifiers robust against adversarial attack including adversarial training (Madry et al., 2017)

(Papernot et al., 2016), and obfuscated gradients (Papernot et al., 2017). However, as new defenses are proposed, new attack methodologies are often rapidly developed which defeat these defences (Tramèr et al., 2017; Athalye et al., 2018; Carlini and Wagner, 2016). While updated defences are often then proposed (Tramèr et al., 2017), in general, we cannot be confident that new attacks will not in turn defeat these defences.

To escape this cycle, approaches have been proposed to develop certifiably robust classifiers (Wong and Kolter, 2018; Gowal et al., 2018; Lecuyer et al., 2019; Li et al., 2018; Cohen et al., 2019; Salman et al., 2019): in these classifiers, for each image , one can calculate a radius such that it is provably guaranteed that any other image with distance less than from will be classified similarly to . This means that no adversarial attack can ever be developed which produces adversarial examples to the classifier within the certified radius.

One effective approach to develop certifiably robust classification is to use randomized smoothing with a probabilistic robustness certificate (Lecuyer et al., 2019; Li et al., 2018; Cohen et al., 2019; Salman et al., 2019). In this approach, one uses a smoothed classifier , which represents the expectation of over random perturbations of . Based on this smoothing, one can derive an upper bound on how steeply the scores assigned to each class by can change, which can then be used to derive a radius in which the highest class score must remain highest111

In practice, samples are used to estimate the expectation

, producing an empirical smoothed classifier : the certification is therefore probabilistic, with a degree of certainty dependent on the number of samples..

In this work, we present the first certified defence against Wasserstein adversarial attacks using an adapted randomized smoothing approach, which we call Wasserstein smoothing. To develop the robustness certificate, we define a (non-unique) representation of the difference between two images, based on the flow of pixel intensity necessary to construct one image from another. In this representation, we show that the norm of the minimal flow between two images is equal to the Wasserstein distance between the images. This allows us to apply existing smoothing-based defences, by adding noise in the space of these representations of flows. We show that empirically that this gives improved robustness certificates, compared to using a weak upper bound of Wasserstein distance given by randomized smoothing in the feature space of images directly. We also show that our Wasserstein smoothing defence protects against Wasserstein adversarial attacks empirically, with significantly improved robustness compared to baseline models. For small adversarial perturbations on the MNIST dataset, our method achieves higher accuracy under adversarial attack than all existing practical defences for the Wasserstein threat model.

In summary, we make the following contributions:

• We develop a novel certified defence for the Wasserstein adversarial attack threat model. This is the first certified defence, to our knowledge, that has been proposed for this threat model.

• We demonstrate that our certificate is nonvacuous, in that it can certify Wasserstein radii larger than those which can be certified by exploiting a trivial upper bound on Wasserstein distance.

• We demonstrate that our defence effectively protects against existing Wasserstein adversarial attacks, compared to an unprotected baseline.

## 2 Background

Let denote a two dimensional image, of height and width . We will normalize the image such that , so that can be interpreted as a probability distribution on the discrete support of pixel coordinates of the two-dimensional image.222In the case of multi-channel color images, the attack proposed by Wong et al. (2019) does not transport pixel intensity between channels. This allows us to defend against these attacks using our 2D Wasserstein smoothing with little modification. See Section 6.3, and Corollary 2 in the appendix Following the notation of Wong et al. (2019), we define the p-Wasserstein distance between and as:

###### Definition 2.1.

Given two distributions , and a distance metric , the p-Wasserstein distance as:

 Wp(x,x′)= minΠ∈R(n⋅m)×(n⋅m)+<Π,C>, (1) Π1=x,ΠT1=x′, C(i,j),(i′,j′):=[d((i,j),(i′,j′))]p.

is the cost of transporting a mass unit from the position to in the image.

Note that, for the purpose of matrix multiplication, we are treating

as vectors of length

. Similarly, the transport plan matrix and the cost matrix are in .

Intuitively, represents the amount of probability mass to be transported from pixel to , while represents the cost per unit probability mass to transport this probability. We can choose to be any measure of distance between pixel positions in an image. For example, in order to represent the distance metric between pixel positions, we can choose:

 d((i,j),(i′,j′))=|i−i′|+|j−j′|. (2)

Moreover, to represent the distance metric between pixel positions, we can choose:

 d((i,j),(i′,j′))=√(i−i′)2+(j−j′)2. (3)

Our defence directly applies to the 1-Wasserstein metric using the distance as the metric , while the attack developed by Wong et al. (2019) uses the distance. However, because images are two dimensional, these differ by at most a constant factor of , so we adapt our certificates to the setting of Wong et al. (2019) by simply scaling our certificates by . All experimental results will be presented with this scaling. We emphasize that this it not the distinction between 1-Wasserstein and 2-Wasserstein distances: this paper uses the 1-Wasserstein metric, to match the majority of the experimental results of Wong et al. (2019).

To develop our certificate, we rely an alternative linear program formulation for the 1-Wasserstein distance on a two-dimensional image with the

distance metric, provided by Ling and Okada (2007):

 W1(x,x′)=ming∑(i,j) ∑(i′,j′)∈N(i,j)g(i,j),(i′,j′) (4)

where and ,

 ∑(i′,j′)∈N(i,j)g(i,j),(i′,j′)−g(i′,j′),(i,j)=x′i,j−xi,j

Here, denotes the (up to) four immediate (non-diagonal) neighbors of the position ; in other words, . For the distance in two dimensions, Ling and Okada (2007) prove that this formulation is in fact equivalent to the linear program given in Equation 1. Note that only elements of with need to be defined: this means that the number of variables in the linear program is approximately , compared to the elements of in Equation 1. While this was originally used to make the linear program more tractable to be solved directly, we exploit the form of this linear program to devise a randomized smoothing scheme in the next section.

## 3 Robustness Certificate

In order to present our robustness certificate, we first introduce some notation. Let denote a local flow plan. It specifies a net flow between adjacent pixels in an image , which, when applied, transforms to a new image . See Figure 2 for an explanation of the indexing. For compactness, we write where , and in general refer to the space of possible local flow plans as the flow domain. We define the function , which applies a local flow to a distribution.

###### Definition 3.1.

The local flow plan application function is defined as:

 Δ(x,δ)i,j=xi,j+δvert.i−1,j−δvert.i,j+δhoriz.i,j−1−δhoriz.i,j (5)

where we let .333Note that the new image is not necessarily a probability distribution because it may have negative components. However, note that normalization is preserved: . This is because every component of is added once and subtracted once to elements in .

Note that local flow plans are additive:

 Δ(Δ(x,δ),δ′)=Δ(x,δ+δ′) (6)

Using this notation, we make a simple transformation of the linear program given in Equation 4, removing the positivity constraint from the variables and reducing the number of variables to :

###### Lemma 1.

For any normalized probability distributions :

 W1(x,x′)=minδ:x′=Δ(x,δ)∥δ∥1 (7)

where denotes the 1-Wasserstein metric, using the distance as the underlying distance metric .

In other words, we can upper-bound the Wasserstein distance between two images using the norm of any feasible local flow plan between the two images. This enables us to extend existing results for smoothing-based certificates (Lecuyer et al., 2019) to the Wasserstein metric, by adding noise in the flow domain.

###### Definition 3.2.

We denote by as the Laplace noise with parameter in the flow domain of dimension .

Given a classification score function , we define as the Wasserstein-smoothed classification function as follows:

 ¯f=Eδ∼L(σ)[f(Δ(x,δ))]. (8)

Let be the class assignment of using the Wasserstein-smoothed classifier (i.e. ).

###### Theorem 1.

For any normalized probability distribution , if

 ¯fi(x)≥e2√2ρ/σmaxi′≠i¯fi′(x) (9)

then for any perturbed probability distribution such that , we have:

 ¯fi(~x)≥maxi′≠i¯fi′(~x). (10)

All proofs are presented in the appendix.

## 4 Intuition: One-Dimensional Case

To provide an intuition about the proposed Wasserstein smoothing certified robustness scheme, we consider a simplified model, in which the support of is a one-dimensional array of length , rather than a two-dimensional grid (i.e. ). In this case, we can denote a local flow plan , so that for :

 x′i=xi+δi−1−δi (11)

where . In this one-dimensional case, for any fixed (with the normalization constraint that ), there is a unique solution to :

 δi=i∑j=1xj−i∑j=1x′j (12)

Note at this reminds us a well-known identity describing optimal transport between two distributions which share a continuous, one-dimensional support (see section 2.6 of Peyré et al. (2019), for example):

 W1(X,Y)=∞∫−∞|FX(z)−FY(z)|dz (13)

where denote cumulative density functions. If we apply this result to our discretized case, with the index taking the place of , and apply the identity to and , this becomes:

 W1(x,x′)=n∑i=1∣∣ ∣∣i∑j=1xj−i∑j=1x′j∣∣ ∣∣=n∑i=1|δi|=∥δ∥1 (14)

By the uniqueness of the solution given in Equation 12, for any , we can define as the solution to , where is an arbitrary fixed reference distribution (e.g. suppose for ). Therefore, instead of operating on the images directly, we can equivalently operate on and in the flow domain instead. We will therefore define a flow-domain version of our classifier :

 fflow(δ):=f(Δ(r,δ)). (15)

We will now perform classification entirely in the flow-domain, by first calculating and then using as our classifier. Now, consider and an adversarial perturbation , and let be the unique solution to . By equation 14, . Then:

 ~x=Δ(x,δ)=Δ(Δ(r,δx),δ)=Δ(r,δx+δ) (16)

where the second equality is by equation 6. Moreover, by the uniqueness of Equation 12, , or . Therefore

 ∥δ~x−δx∥1=W1(x,~x). (17)

In other words, if we classify in the flow-domain, using , the distance between point is the Wasserstein distance between the distributions and . Then, we can perform smoothing in the flow-domain, and use the existing robustness certificate provided by Lecuyer et al. (2019), to certify robustness. Extending this argument to two-dimensional images adds some complication: images can no longer be represented uniquely in the flow domain, and the relationship between distance and the Wasserstein distance is now an upper bound. Nevertheless, the same conclusion still holds for 2D images as we state in Theorem 1. Proofs for the two-dimensional case are given in the appendix.

## 5 Practical Certification Scheme

To generate probabilistic robustness certificates from randomly sampled evaluations of the base classifier , we adapt the procedure outlined by Cohen et al. (2019) for certificates. We consider a hard smoothed classifier approach: we set if the base classifier selects class at point , and otherwise. We also use a stricter form of the condition given as Equation 9:

 ¯fi(x)≥e2√2ρ/σ(1−¯fi(x)) (18)

This means that we only need to provide a probabilistic lower bound of the expectation of the largest class score, rather than bounding every class score. This reduces the number of samples necessary to estimate a high-confidence lower bound on , and therefore to estimate the certificate with high confidence. Cohen et al. (2019) provides a statistically sound procedure for this, which we use: refer to that paper for details. Note that, when simply evaluating the classification given by , we will also need to approximate using random samples. Cohen et al. (2019) also provides a method to do this which yields the expected classification with high confidence, but may abstain from classifying. We will also use this method when evaluating accuracies.

Since the Wasserstein adversarial attack introduced by Wong et al. (2019) uses the distance metric, to have a fair performance evaluation against this attack, we are interested in certifying a radius in the 1-Wasserstein distance with underlying distance metric, rather than . Let us denote this radius as . In two-dimensional images, the elements of the cost matrix in this metric may be smaller by up to a factor of , so we have:

 ρ2≥1√2ρ (19)

Therefore, by certifying to a radius of , we can effectively certify against the metric 1-Wasserstein attacks of radius ; our condition then becomes:

 ¯fi(x)≥e4ρ2/σ(1−¯fi(x)). (20)

## 6 Experimental Results

In all experiments, we use 10,000 random noised samples to predict the smoothed classification of each image; to generate certificates, we first use 1000 samples to infer which class has highest smoothed score, and then 10,000 samples to lower-bound this score. All probabilistic certificates and classifications are reported to confidence. The model architectures used for the base classifiers for each data set are the same as used in Wong et al. (2019). When reporting results, median certified accuracy refers to the maximum radius such that at least of classifications for images in the data set are certified to be robust to at least this radius, and these certificates are for the correct ground truth class. If over of images are not certified for the correct class, this statistic is reported as .

### 6.1 Comparison to naive Laplace Smoothing

Note that one can derive a trivial but sometimes tight bound, that, under any distance metric, if , then . (See Corollary 1 in the appendix.) This enables us to write a condition for -radius Wasserstein certified robustness by applying Laplace smoothing directly, and simply converting the certificate. In our notation, this condition is:

 ¯fLaplacei(x)≥e4√2ρ2/σ(1−¯fLaplacei(x)) (21)

where is a smoothed classifier with Laplace noise added to every pixel independently. It may appear as if our Wasserstein-smoothed bound should only be an improvement over this bound by a factor of in the certified radius . However, as shown in Table 1, we in fact improve our certificates by a larger factor. This is because, for a fixed noise standard deviation, the base classifier is able to achieve a higher accuracy after adding noise in the flow-domain, compared to adding noise directly to the pixels. When adding noise in the flow-domain, we add and subtract noise in equal amounts between adjacent pixels, preserving more information for the base classifier.

To give a concrete example, consider some square patch of an image. Suppose that the overall aggregate pixel intensity in this patch (i.e. the sum of the pixel values) is a salient feature for classification (This is a highly plausible situation: for example, in MNIST, this may indicate whether or not some region of an image is occupied by part of a digit.) Let us call this feature , and calculate the variance of in smoothing samples under Laplace and Wasserstein smoothing, both with variance . Under Laplace smoothing (Figure 4-a), independent instances of Laplace noise are added to , so the resulting variance will be : this is proportional to the area of the region. In the case of Wasserstein smoothing, by contrast, probability mass exchanged between between pixels in the interior of the patch has no effect on the aggregate quantity . Instead, only noise on the perimeter will affect the total feature value : the variance is therefore (Figure 4-b). Wasserstein smoothing then reduces the effective noise variance on the feature by a factor of .

We measure the performance of our smoothed classifier against the Wasserstein-metric adversarial attack proposed in Wong et al. (2019), and compare to models tested in that work. Results are presented in Figure 5. For testing, we use the same attack parameters as in Wong et al. (2019): the ’Standard” and ’Adversarial Training’ results are therefore replications of the experiments from that paper, using the publicly available code and pretrained models.

In order to attack our hard smoothed classifier, we adapt the method proposed by Salman et al. (2019): in particular, note that we cannot directly calculate the gradient of the classification loss with respect to the image for a hard

smoothed classifier, because the derivatives of the logits of the base classifier are not propagated. Therefore, we must instead attack a

soft smooth classifier: we take the expectation over samples of the softmaxed logits of the base classifier, instead of the final classification output. In each step of the attack, we use 128 noised samples to estimate this gradient, as used in Salman et al. (2019).

In the attack proposed by Wong et al. (2019), the images are attacked over 200 iterations of projected gradient descent, projected onto a Wasserstein ball, with the radius of the ball every 10 iterations. The attack succeeds, and the final radius is recorded, once the classifier misclassifies the image. In order to preserve as much of the structure (and code) of the attack as possible to provide a fair comparison, it is thus necessary for us to evaluate each image using our hard classifier, with the full 10,000 smoothing samples, at each iteration of the attack. We count the classifier abstaining as a misclassification for these experiments. However, note that this may somewhat underestimate the true robustness of our classifier: recall that our classifier is nondeterministic; therefore, because we are repeatedly evaluating the classifier and reporting a perturbed image as adversarial the first time it is missclassified, we may tend to over-count misclassifications. However, because we are using a large number of noise samples to generate our classifications, this is only likely to happen with examples which are close to being adversarial. Still, the presented data should be regarded as a lower bound on the true accuracy under attack of our Wasserstein smoothed classifier.

In Figure 5

, we note two things: first, our Wasserstein smoothing technique appears to be an effective empirical defence against Wasserstein adversarial attacks, compared to an unprotected (’Standard’) network. (It is also more robust than the binarized and

-robust models tested by Wong et al. (2019): see appendix.) However, for large perturbations, our defence is less effective than the adversarial training defence proposed by Wong et al. (2019). This suggests a promising direction for future work: Salman et al. (2019) proposed an adversarial training method for smoothed classifiers, which could be applied in this case. Note however that both Wasserstein adversarial attacks and smoothed adversarial training are computationally expensive, so this may require significant computational resources.
Second, the median radius of attack to which our smoothed classifier is empirically robust is larger than the median certified robustness of our smoothed classifier by two orders of magnitude. This calls for future work both to develop improved robustness certificates as well as to develop more effective attacks in the Wasserstein metric.

### 6.3 Experiments on color images (CIFAR-10)

Wong et al. (2019) also apply their attack to color images in CIFAR-10. In this case, the attack does not transport probability mass between color channels: therefore, in our defence, it is sufficient to add noise in the flow domain to each channel independently to certify robustness (See Corollary 2 in the appendix for a proof of the validity of this method). Certificates are presented in Table 2, while empirical robustness is as Figure 6. Again, we compare directly to models from Wong et al. (2019). We note that again, empirically, our model significantly outperforms an unprotected model, but is not as robust as a model trained adversarially. We also note that the certified robustness is orders of magnitude smaller than computed for MNIST: however, the unprotected model is also significantly less robust empirically than the equivalent MNIST model.

## 7 Conclusion

In this paper, we developed a smoothing-based certifiably robust defence for Wasserstein-metric adversarial examples. To do this, we add noise in the space of possible flows of pixel intensity between images. To our knowledge, this is the first certified defence method specifically tailored to the Wasserstein threat model. Our method proves to be an effective practical defence against Wasserstein adversarial attacks, with significantly improved empirical adversarial robustness compared to a baseline model.

## References

• F. Assion, P. Schlicht, F. Greßner, W. Gunther, F. Huger, N. Schmidt, and U. Rasheed (2019) The attack generator: a systematic approach towards constructing adversarial attacks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

,
pp. 0–0. Cited by: §1.
• A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In International Conference on Machine Learning, pp. 274–283. Cited by: §1.
• N. Carlini and D. Wagner (2016) Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311. Cited by: §1.
• N. Carlini and D. Wagner (2017)

Towards evaluating the robustness of neural networks

.
In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1.
• J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918. Cited by: §1, §1, §5.
• L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry (2019) Exploring the landscape of spatial robustness. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 1802–1811. Cited by: §1.
• S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, T. Mann, and P. Kohli (2018) On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §1.
• C. Laidlaw and S. Feizi (2019) Functional adversarial attacks. arXiv preprint arXiv:1906.00001. Cited by: §1.
• M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In 2019 2019 IEEE Symposium on Security and Privacy (SP), Vol. , Los Alamitos, CA, USA, pp. 726–742. External Links: ISSN 2375-1207, Document, Link Cited by: Appendix A, Appendix C, §1, §1, §3, §4.
• B. Li, C. Chen, W. Wang, and L. Carin (2018) Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113. Cited by: §1, §1.
• H. Ling and K. Okada (2007) An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE transactions on pattern analysis and machine intelligence 29 (5), pp. 840–853. Cited by: Appendix A, §2.
• A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

Towards deep learning models resistant to adversarial attacks

.
arXiv preprint arXiv:1706.06083. Cited by: §1, §1.
• N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1.
• N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §1.
• G. Peyré, M. Cuturi, et al. (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: §4.
• H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn, and S. Bubeck (2019) Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §1, §1, §6.2, §6.2.
• C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
• F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §1.
• E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §1.
• E. Wong, F. Schmidt, and Z. Kolter (2019) Wasserstein adversarial examples via projected Sinkhorn iterations. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6808–6817. External Links: Link Cited by: Appendix B, Figure 7, Appendix B, Appendix C, Document, Figure 1, §1, §2, §2, §5, Figure 5, Figure 6, §6.2, §6.2, §6.2, §6.3, §6, footnote 2.

## References

• F. Assion, P. Schlicht, F. Greßner, W. Gunther, F. Huger, N. Schmidt, and U. Rasheed (2019) The attack generator: a systematic approach towards constructing adversarial attacks. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops

,
pp. 0–0. Cited by: §1.
• A. Athalye, N. Carlini, and D. Wagner (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. In International Conference on Machine Learning, pp. 274–283. Cited by: §1.
• N. Carlini and D. Wagner (2016) Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311. Cited by: §1.
• N. Carlini and D. Wagner (2017)

Towards evaluating the robustness of neural networks

.
In 2017 IEEE Symposium on Security and Privacy (SP), pp. 39–57. Cited by: §1.
• J. M. Cohen, E. Rosenfeld, and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918. Cited by: §1, §1, §5.
• L. Engstrom, B. Tran, D. Tsipras, L. Schmidt, and A. Madry (2019) Exploring the landscape of spatial robustness. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 1802–1811. Cited by: §1.
• S. Gowal, K. Dvijotham, R. Stanforth, R. Bunel, C. Qin, J. Uesato, T. Mann, and P. Kohli (2018) On the effectiveness of interval bound propagation for training verifiably robust models. arXiv preprint arXiv:1810.12715. Cited by: §1.
• C. Laidlaw and S. Feizi (2019) Functional adversarial attacks. arXiv preprint arXiv:1906.00001. Cited by: §1.
• M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu, and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In 2019 2019 IEEE Symposium on Security and Privacy (SP), Vol. , Los Alamitos, CA, USA, pp. 726–742. External Links: ISSN 2375-1207, Document, Link Cited by: Appendix A, Appendix C, §1, §1, §3, §4.
• B. Li, C. Chen, W. Wang, and L. Carin (2018) Second-order adversarial attack and certifiable robustness. arXiv preprint arXiv:1809.03113. Cited by: §1, §1.
• H. Ling and K. Okada (2007) An efficient earth mover’s distance algorithm for robust histogram comparison. IEEE transactions on pattern analysis and machine intelligence 29 (5), pp. 840–853. Cited by: Appendix A, §2.
• A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu (2017)

Towards deep learning models resistant to adversarial attacks

.
arXiv preprint arXiv:1706.06083. Cited by: §1, §1.
• N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami (2017) Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pp. 506–519. Cited by: §1.
• N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami (2016) Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pp. 582–597. Cited by: §1.
• G. Peyré, M. Cuturi, et al. (2019) Computational optimal transport. Foundations and Trends® in Machine Learning 11 (5-6), pp. 355–607. Cited by: §4.
• H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn, and S. Bubeck (2019) Provably robust deep learning via adversarially trained smoothed classifiers. arXiv preprint arXiv:1906.04584. Cited by: §1, §1, §6.2, §6.2.
• C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2013) Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
• F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §1.
• E. Wong and Z. Kolter (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pp. 5283–5292. Cited by: §1.
• E. Wong, F. Schmidt, and Z. Kolter (2019) Wasserstein adversarial examples via projected Sinkhorn iterations. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 6808–6817. External Links: Link Cited by: Appendix B, Figure 7, Appendix B, Appendix C, Document, Figure 1, §1, §2, §2, §5, Figure 5, Figure 6, §6.2, §6.2, §6.2, §6.3, §6, footnote 2.

## Appendix A Proofs

###### Lemma 1.

For any normalized probability distributions , there exists at least one such that . Furthermore:

 minδ:x′=Δ(x,δ)∥δ∥1=W1(x,x′) (22)

Where denotes the 1-Wasserstein metric, using distance as the underlying distance metric.

###### Proof.

We first show the equivalence of the above minimization problem with the linear program proposed by Ling and Okada (2007), restated here:

 W1(x,x′)=ming∑(i,j) ∑(i′,j′)∈N((i,j))g(i,j),(i′,j′) (23)

where and ,

 ∑(i′,j′)∈N((i,j))g(i,j),(i′,j′)−g(i′,j′),(i,j)=x′i,j−xi,j

It suffices to show that (1) there is a transformation from the variables in Equation 23 to the variables in Equation 22, such that all points which are feasible in Equation 23 are feasible in 22 and the minimization objective in Equation 22 is less than or equal to the minimization objective in Equation 23, and (2) there is a transformation from the variables in Equation 22 to the variables in Equation 23, such that all points which are feasible in Equation 22 are feasible in Equation 23 and the minimization objective in Equation 23 is less than or equal to the minimization objective in Equation 22.

 δvert.i,j:=g(i,j),(i+1,j)−g(i+1,j),(i,j)δhoriz.i,j:=g(i,j),(i,j+1)−g(i,j+1),(i,j) (24)

Where we let . To show feasibility, we write out fully the flow constraint of Equation 23:

 g(i,j),(i+1,j)−g(i+1,j),(i,j)+g(i,j),(i−1,j)−g(i−1,j),(i,j)+g(i,j),(i,j+1)−g(i,j+1),(i,j)+g(i,j),(i,j−1)−g(i,j−1),(i,j)=x′i,j−xi,j (25)

Substituting in Equation 24:

 δvert.i,j+−δ%vert.i−1,j+δhoriz.i,j+−δhoriz.i,j−1=x′i,j−xi,j (26)

But by Definition 3.1, this is exactly:

 Δ(x,δ)i,j=x′i,j (27)

Which is the sole constraint in Equation 22: then any solution which is feasible in Equation 23 is feasible in Equation 22. Also note that:

 (28)

Where the inequality follows from triangle inequality applied to Equation 24, and in the second sum in the fourth line, we exploit the fact that to shift indices. This shows that the minimization objective in Equation 22 is less than or equal to the minimization objective in Equation 23.
Moving on to (2), we give the transformation as:

 g(i,j),(i+1,j):=max(δvert.i,j,0)g(i,j),(i−1,j):=max(−δvert.i−1,j,0)g(i,j),(i,j+1):=max(δ%horiz.i,j,0)g(i,j),(i,j−1):=max(−δhoriz.i,j−1,0) (29)

Note that the non-negativity constraint of Equation 23 is automatically satisfied by the form of these definitions. Shifting indices, we also have:

 g(i−1,j),(i,j)=max(δvert.i−1,j,0)g(i+1,j),(i,j)=max(−δ%vert.i,j,0)g(i,j−1),(i,j)=max(δ% horiz.i,j−1,0)g(i,j+1),(i,j)=max(−δ%horiz.i,j,0) (30)

From the constraint on Equation 22, we have:

 x′i,j−xi,j=δvert.i,j+−δvert.i−1,j+δhoriz.i,j+−δhoriz.i,j−1=max(δvert.i,j,0)−max(−δvert.i,j,0)+max(−δvert.i−1,j−max(δ%vert.i−1,j,0),0)+max(δhoriz.i,j,0)−max(−δhoriz.i,j,0)+max(−δhoriz.i,j−1,0)−max(δhoriz.i,j−1,0)=g(i,j),(i+1,j)−g(i+1,j),(i,j)+g(i,j),(i−1,j)−g(i−1,j),(i,j)+g(i,j),(i,j+1)−g(i,j+1),(i,j)+g(i,j),(i,j−1)−g(i,j−1),(i,j) (31)

Which is exactly the second constraint of Equation 23: then any solution which is feasible in Equation 23 is feasible in Equation 22. Also note that:

 ∑(i,j) ∑(i′,j′)∈N((i,j))g(i,j),(i′,j′)=∑i,jg(i,j),(i+1,j)+g(i,j),(i,j+1)+∑i,jg(i,j),(i−1,j)+g(i,j),(i,j−1)=∑i,jmax(δvert.i,j,0)+max(δhoriz.i,j,0)+∑i,jmax(−δvert.i−1,j,0)+max(−δhoriz.i,j−1,0)=∑i,jmax(δvert.i,j,0)+max(−δvert.i,j,0)+∑i,jmax(δhoriz.i,j,0)+max(−δhoriz.i,j,0)=∑i,j|δvert.i,j|+|δhoriz.i,j|=∥δ∥1 (32)

Where we again exploit the fact that to shift indices, in the fourth line. This shows that the minimization objective in Equation 23 is less than or equal to the minimization objective in Equation 22, completing (2).
Finally, now that we have shown that Equations 22 and 23 are in fact equivalent minimizations (i.e., we have proven Equation 22 correct), we would like to show that there is always a feasible solution to 22, as claimed. By the above transformations, it suffices to show that there is always a feasible solution to Equation 23. Ling and Okada (2007) show that any feasible solution the the general Wasserstein minimization LP (Definition 1) can be transformed into a solution to Equation 23, so it suffices to show that the LP in Definition 1 always has a feasible solution. This is trivially satisfied by taking , where we note that , a probability distribution, is non-negative. ∎

###### Theorem 1.

Consider a normalized probability distribution , and a classification score function . Let refer to the Wasserstein-smoothed classification function:

 (33)

Let be the class assignment of using the smoothed classifier (i.e. ). If

 ¯fi(x)≥e2√2ρ/σmaxi′≠i¯fi′(x) (34)

Then for any perturbed probability distribution such that :

 ¯fi(~x)≥maxi′≠i¯fi′(~x) (35)
###### Proof.

Let be the uniform probability vector. As a consequence of Lemma 1, for any distribution , there exists a nonempty set of local flow plans :

 Sx={δ|x=Δ(u,δ)} (36)

Also, we may define a version of the classifier on the local flow plan domain:

 fflow(δ)=f(Δ(u,δ)) (37)

Let be an arbitrary element in , and consider any perturbed such that . By Theorem 1:

 minδ:~x=Δ(x,δ)∥δ∥1=W1(x,~x) (38)

Then, using Equation 6:

 minδ:~x=Δ(u,δx+δ)∥δ∥1=W1(x,~x) (39)

Let the minimum be achieved at . Making a change of variables (), we have:

 ∥δ~x−δx∥1=W1(x,~x)where ~x=Δ(u,δ~x) (40)

Note that for any (for ) :

 (41)

We can now apply Proposition 1 from Lecuyer et al. (2019), restated here:

###### Proposition.

Consider a vector , and a classification score function . Let , and let be the class assignment of using a Laplace-smoothed version of the classifier :

 i=argmaxi′Eϵ[hi′(v+ϵ)] (42)

If:

 Eϵ[hi(v+ϵ)]≥e2√2ρ/σmaxi′≠iEϵ[hi′(v+ϵ)] (43)

Then for any perturbed probability distribution such that :

 Eϵ[hi(~v+ϵ)]≥maxi′≠iEϵ[hi′(~v+ϵ)] (44)

We apply this proposition to , noting that :

 Eδ′[fflowi(δx+δ′))]≥e2√2ρ/σmaxi′≠iEδ′[fflowi′(δx+δ′))]⟹Eδ′[fflowi(δ~x+δ′))]≥maxi′≠iEδ′[fflowi′(δ~x+δ′))] (45)

Then, using Equation 41:

 ¯fi(x)≥e2√2ρ/σmaxi′≠i¯fi′(x)⟹¯fi(~x)≥maxi′≠i¯fi′(~x) (46)

Which was to be proven. ∎

###### Corollary 1.

For any normalized probability distributions , if , then , where is the 1-Wasserstein metric using any norm as the underlying distance metric. Furthermore, there exist distributions where these inequalities are tight.

###### Proof.

Let indicate the optimal transport plan between and . From Definition 1, we have and . Then:

 (ΠT−Π)1=x′−x (47)

Let represent a modified version of , with the diagonal elements set to zero. Note that and . Then, using triangle inequality:

 ∥(Π′)T1∥1+∥(Π′)1∥1≥∥((Π′)T−Π′)1∥1=∥x