# Understanding the Intrinsic Robustness of Image Distributions using Conditional Generative Models

Starting with Gilmer et al. (2018), several works have demonstrated the inevitability of adversarial examples based on different assumptions about the underlying input probability space. It remains unclear, however, whether these results apply to natural image distributions. In this work, we assume the underlying data distribution is captured by some conditional generative model, and prove intrinsic robustness bounds for a general class of classifiers, which solves an open problem in Fawzi et al. (2018). Building upon the state-of-the-art conditional generative models, we study the intrinsic robustness of two common image benchmarks under ℓ_2 perturbations, and show the existence of a large gap between the robustness limits implied by our theory and the adversarial robustness achieved by current state-of-the-art robust models. Code for all our experiments is available at https://github.com/xiaozhanguva/Intrinsic-Rob.

## Authors

• 60 publications
• 10 publications
• 87 publications
• 24 publications
• ### Empirically Measuring Concentration: Fundamental Limits on Intrinsic Robustness

Many recent works have shown that adversarial examples that fool classif...
05/29/2019 ∙ by Saeed Mahloujifar, et al. ∙ 2

• ### Conditional WaveGAN

Generative models are successfully used for image synthesis in the recen...
09/27/2018 ∙ by Chae Young Lee, et al. ∙ 0

• ### Limitations of adversarial robustness: strong No Free Lunch Theorem

This manuscript presents some new results on adversarial robustness in m...
10/08/2018 ∙ by Elvis Dohmatob, et al. ∙ 0

02/20/2019 ∙ by Gavin Weiguang Ding, et al. ∙ 0

• ### Improved Estimation of Concentration Under ℓ_p-Norm Distance Metrics Using Half Spaces

Concentration of measure has been argued to be the fundamental cause of ...
03/24/2021 ∙ by Jack Prescott, et al. ∙ 2

• ### On Adversarial Robustness of Synthetic Code Generation

Automatic code synthesis from natural language descriptions is a challen...
06/22/2021 ∙ by Mrinal Anand, et al. ∙ 0

• ### Rethinking Perturbations in Encoder-Decoders for Fast Training

We often use perturbations to regularize neural models. For neural encod...
04/05/2021 ∙ by Sho Takase, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Deep neural networks (DNNs) have achieved remarkable performance on many visual

(Sutskever et al., 2012; He et al., 2016) and speech (Hinton et al., 2012) recognition tasks, but recent studies have shown that state-of-the-art DNNs are surprisingly vulnerable to adversarial perturbations, small imperceptible input transformations that are designed to switch the prediction of the classifier (Szegedy et al., 2014; Goodfellow et al., 2015)

. This has led to a vigorous arms race between heuristic defenses

(Papernot et al., 2016; Madry et al., 2018; Chakraborty et al., 2018; Wang et al., 2019) that propose ways to defend against existing attacks and newly-devised attacks (Carlini and Wagner, 2017; Athalye et al., 2018; Tramer et al., 2020) that are able to penetrate such defenses. Reliable defenses appear to be elusive, despite progress on provable defenses, including formal verification (Katz et al., 2017; Tjeng et al., 2019) and relaxation-based certification methods (Sinha et al., 2018; Raghunathan et al., 2018; Wong and Kolter, 2018; Gowal et al., 2019; Wang et al., 2018). Even the strongest of these defenses leave large opportunities for adversaries to find adversarial examples, while suffering from high computation costs and scalability issues.

Witnessing the difficulties of constructing robust classifiers, a line of recent works (Gilmer et al., 2018; Fawzi et al., 2018; Mahloujifar et al., 2019a; Shafahi et al., 2019) aims to understand the limitations of robust learning by providing theoretical bounds on adversarial robustness for arbitrary classifiers. By imposing different assumptions on the underlying data distributions and allowable perturbations, all of these theoretical works show that no adversarially robust classifiers exist for an assumed metric probability space, as long as the perturbation strength is sublinear in the typical norm of the inputs. Although such impossibility results seem disheartening to the goal of building robust classifiers, it remains unknown to what extent real image distributions satisfy the assumptions needed to obtain these results.

In this paper, we aim to bridge the gap between the theoretical robustness analyses on well-behaved data distributions and the maximum achievable adversarial robustness, which we call intrinsic robustness (formally defined by Definition 3), for typical image distributions. More specifically, we assume the underlying data lie on a separable low-dimensional manifold, which can be captured using a conditional generative model, then systematically study the intrinsic robustness based on the conditional generating process from both theoretical and experimental perspectives. Our main contributions are:

• We prove a fundamental bound on intrinsic robustness (Section 4), provided that the underlying data distribution can be captured by a conditional generative model, solving an open problem in Fawzi et al. (2018).

• Building upon a trained conditional generative model that mimics the underlying data generating process, we empirically evaluate the intrinsic robustness on image distributions based on MNIST and ImageNet (Section

5.2

). Our estimates of intrinsic robustness demonstrate that there is still a large gap between the limits implied by our theory and the state-of-the-art robustness achieved by robust training methods (Section

5.3).

• We theoretically characterize the fundamental relationship between the in-distribution adversarial risk (which restricts adversarial examples to lie on the image manifold, and is formally defined by Definition 3) and the intrinsic robustness (Remark 4), and propose an optimization method to search for in-distribution adversarial examples with respect to a given classifier. Our estimated in-distribution robustness for state-of-the-art adversarially trained classifiers, together with the derived intrinsic robustness bound, provide a better understanding on the intrinsic robustness for natural image distributions (Section 5.4).

Notation. We use lower boldfaced letters such as

to denote vectors, and

to denote the index set . For any and , denote by the -ball around with radius in some distance metric . When the metric is free of context, we simply write . We use to denote the

-dimensional standard Gaussian distribution, and let

be its probability measure. For the one dimensional case, we use

to denote the cumulative distribution function (CDF) of

, and use to denote its inverse function. For any function and probability measure defined over , denotes the push-forward measure of . The -norm of a vector is defined as .

## 2 Related Work

Several recent works (Gilmer et al., 2018; Mahloujifar et al., 2019a; Shafahi et al., 2019; Dohmatob, 2019; Bhagoji et al., 2019)

derived theoretical bounds on maximum achievable adversarial robustness using isoperimetric inequality under different assumptions of the input space. For instance, based on the assumption that the input data are uniformly distributed over two concentric

-spheres (Gilmer et al., 2018) or the underlying metric probability space satisfies a concentrated property (Mahloujifar et al., 2019a), any classifier with constant test error was proven to be vulnerable to adversarial perturbations sublinear to the input dimension. Shafahi et al. (2019) showed that adversarial examples are inevitable, provided the maximum density of the underlying input distribution is small relative to uniform density. However, none of the above theoretical works provide any experiments to justify the imposed assumptions hold for real datasets, thus it is unclear whether the derived theoretical bounds are meaningful for typical image distributions. Our work belongs to this line of research, but encompasses the practical goal of understanding the robustness limits for real image distributions.

The most related literature to ours is Fawzi et al. (2018), which proved a classifier-independent upper bound on intrinsic robustness, provided the underlying distribution is well captured by a smoothed generative model with Gaussian latent space and small Lipschitz parameter. However, their proposed theory cannot be applied to image distributions that lie on a low-dimensional, non-smooth manifold, as their framework requires examples from different classes to be close enough in the latent space. In contrast, our proposed theoretical bounds on intrinsic robustness are more general in that they can be applied to non-smoothed data manifolds, such as image distributions generated by conditional models. In addition, we propose an empirical method to estimate the intrinsic robustness on the generated image distributions under worst-case perturbations.

Mahloujifar et al. (2019b) proposed to understand the inherent limitations of robust learning using heuristic methods to measure the concentration of measure based on a given set of i.i.d. samples. However, it is unclear to what extent the estimated sample-based concentration approximates the actual intrinsic robustness with respect to the underlying data distribution. In comparison, we assume the underlying data distribution can be captured by a conditional generative model and directly study the robustness limit on the generated data distribution.

## 3 Preliminaries

We focus on the task of image classification. Let be a metric probability space, where denotes the input space,

is a probability distribution over

and is some distance metric defined on . Suppose there exists a ground-truth function, , that gives a label to any image , where denotes the set of all possible class labels. The objective of classification is to learn a function that approximates well. In the context of adversarial examples, is typically evaluated based on risk, which captures the classification accuracy of on normal examples, and adversarial risk, which captures the classifier’s robustness against adversarial perturbations:

Let be a metric probability space and be the ground-truth classifier. For any classifier , the risk of is defined as:

 Riskμ(f)=Pr\bx∼μ[f(\bx)≠f∗(\bx)].

The adversarial risk of against perturbations with strength in metric is defined as:

Other definitions of adversarial risk also exist in literature, such as the definition used in Madry et al. (2018) and the one proposed in Fawzi et al. (2018). However, these definitions are equivalent to each other under the assumption that small perturbations do not change the ground-truth labels. Another closely-related definition for adversarial robustness is the expected distance to the nearest error (see Diochnos et al. (2018) for the relation between these definitions). Our results can be applied to this definition as well.

Under different assumptions of the input metric probability space, previous works proved model-independent bounds on adversarial robustness. Intrinsic robustness, defined originally by Mahloujifar et al. (2019b), captures the maximum adversarial robustness that can be achieved for a given robust learning problem:

Using the same settings as in Definition 3 and let be some class of classifiers. The intrinsic robustness with respect to is defined as:

In this work, we consider the class of imperfect classifiers that have risk at least some .

Motivated by the great success of producing natural-looking images using conditional generative adversarial nets (GANs) (Mirza and Osindero, 2014; Odena et al., 2017; Brock et al., 2019), we assume the underlying data distribution can be modeled by some conditional generative model. A generative model can be seen as a function that maps some latent distribution, usually assumed to be multivariate Gaussian, to some generated distribution over .

Conditional generative models incorporate the additional class information into the data generating process. A conditional generative model can be considered as a set of generative models , where images from the -th class can be generated by transforming latent Gaussian vectors through . More rigorously, we say a probability distribution can be generated by a conditional generative model , if , where is the total number of different class labels, and represents the probability of sampling an image from class .

Based on the conditional model, we introduce the definition of in-distribution adversarial risk:

Consider the same settings as in Definition 3. Suppose can be captured by a conditional generative model . For any given classifier , the in-distribution adversarial risk of against -perturbations is defined as:

Given the fact that the in-distribution adversarial risk restricts the adversarial examples to be on the image manifold, it holds that, for any classifier , . As will be shown in the next section, such a notion of in-distribution adversarial risk is closely related to the intrinsic robustness for the considered class of imperfect classifiers.

## 4 Main Theoretical Results

In this section, we present our main theoretical results on intrinsic robustness, provided the underlying distribution can be modeled by some conditional generative model (our results and proof techniques could also be easily applied to unconditional generative models). Based on the underlying generative process, the following local Lipschitz condition connects perturbations in the image space to the latent space.

###### Condition .

Let be a generative model that maps the latent Gaussian distribution to some generated distribution. Consider Euclidean distance as the distance metric for , and as the metric for . Given , is said to be -locally Lipschitz with probability at least , if it satisfies

 Pr\bz∼νd[∀\bz′∈\cB(\bz,r),Δ(g(\bz′),g(\bz))≤L(r)⋅∥\bz′−\bz∥2]≥1−δ.

As the main tool for bounding the intrinsic robustness, we present the Gaussian Isoperimetric inequality for the sake of completeness. This inequality, proved by Borell (1975) and Sudakov and Tsirelson (1978), bounds the minimum expansion of any subset with respect to the standard Gaussian measure.

[Gaussian Isoperimetric Inequality] Consider metric probability space , where is the probability measure for -dimensional standard Gaussian distribution , and denotes the Euclidean distance. For any subset and , let be the -expansion of , then it holds that

 νd(\cEr)≥Φ(Φ−1(νd(\cE))+r), (1)

where is the CDF of , and denotes its inverse.

In particular, when belongs to the set of half-spaces, the equality is achieved in (1).

Making use of the Gaussian Isoperimetric Inequality and the local Lipschitz condition of the conditional generator, the following theorem proves a lower bound on the (in-distribution) adversarial risk for any given classifier, provided the underlying distribution can be captured by a conditional generative model.

Let be a metric probability space and be the underlying ground-truth. Suppose can be generated by a conditional generative model . Given , suppose there exist constants and such that for any , satisfies -local Lipschitz property with probability at least and . Then for any classifier , it holds that

where is the pushforward measure of though , for any .

We provide a proof in Appendix A.1. Theorem 4 suggests the (in-distribution) adversarial risk is related to the risk on each data manifold and the ratio between the perturbation strength and the Lipschitz constant.

The following theorem, proved in Appendix A.2, gives a theoretical upper bound on the intrinsic robustness with respect to the class of imperfect classifiers.

Under the same setting as in Theorem 4, let . Consider the class of imperfect classifiers with , then the intrinsic robustness with respect to can be bounded as,

provided that for any . In addition, if we consider the family of classifiers that have conditional risk at least for each class, namely , then the intrinsic robustness with respect to can be bounded by

 Robϵμ(~\cFα)≤1+δ−K∑i=1pi⋅Φ(Φ−1(α)+ϵLmax(r)).

Theorem 4 shows that if the data distribution can be captured by a conditional generative model, the intrinsic robustness bound with respect to imperfect classifiers will largely depend on the ratio . For instance, if we assume the ratio , then Theorem 4 suggests that no classifier with initial risk at least can achieve robust accuracy exceeding for the assumed data generating process. In addition, if we assume the local Lipschitz parameter

is some constant, then adversarial robustness is indeed not achievable for high-dimensional data distributions, provided the perturbation strength

is sublinear to the input dimension, which is the typical setting considered.

The intrinsic robustness is closely related to the in-distribution adversarial risk. For the class of classifiers , one can prove that the intrinsic robustness is equivalent to the maximum achievable in-distribution adversarial robustness:

Trivially, holds for any . For a given , one can construct an such that if and otherwise, where denotes the error region of and is the considered image manifold. The construction immediately suggests , which implies,

Combining both directions proves the soundness of (2). This equivalence suggests the in-distribution adversarial robustness of any classifier in can be viewed as a lower bound on the actual intrinsic robustness, which motivates us to study the intrinsic robustness by estimating the in-distribution adversarial robustness of trained robust models in our experiments.

## 5 Experiments

This section provides our empirical evaluations of the intrinsic robustness on real image distributions to evaluate the tightness of our bound. We test our bound on two image distributions generated using MNIST (LeCun et al., 1998) and ImageNet (Deng et al., 2009) datasets.

### 5.1 Conditional GAN Models

Instead of directly evaluating the robustness on real datasets, we make use of conditional GAN models to generate datasets from the learned data distributions and evaluate the robustness of several state-of-the-art robust models trained on the generated dataset for a fair comparison with the theoretical robustness limits. Note that this approach is only feasible with conditional generative models as unconditional models cannot provide the corresponding labels for the generated data samples. For MNIST, we adopt ACGAN (Odena et al., 2017) which features an additional auxiliary classifier for better conditional image generation. The ACGAN model generates images from a -dimension latent space concatenated with an addition

-dimension one-hot encoding of the conditional class labels. For ImageNet, we adopt the BigGAN model

(Brock et al., 2019) which is the state-of-the-art GAN model in conditional image generation. It generates images from a -dimension latent space. We down-sampled the generated images to for efficiency propose. We consider a standard Gaussian111The original BigGAN model uses truncated Gaussian. We adapted it to standard Gaussian distribution. as the latent distribution for both conditional generative models. Figure 1 shows examples of the generated MNIST and ImageNet images. For both figures, each column of images corresponds to a particular label class of the considered dataset.

### 5.2 Local Lipschitz Constant Estimation

From Theorem 4, we observe that given a class of classifiers with risk at least , the derived intrinsic robustness upper bound is mainly decided by the perturbation strength and the local Lipschitz constant . While is usually predesignated in common robustness evaluation settings, the local Lipschitz constant is unknown for most real world tasks. Computing an exact Lipschitz constant of a deep neural network is a difficult open problem. Thus, instead of obtaining the exact value, we approximate using a sample-based approach with respect to the generative models.

Recalling Definition 4, we consider as the distance and and are easy to compute via the generator network. Computing , however, is much more complicated as it requires obtaining a maximum value within a radius- ball. To deal with this, our approach approximates by sampling points in the neighborhood around and takes the maximum value as the estimation of the true maximum value within the ball. Since the definition of local Lipschitz is probabilistic, we take multiple samples of the latent vectors to estimate the local Lipschitz constant . The estimation procedure is summarized in Algorithm 1

, which gives an underestimate of the underlying truth. Developing better Lipschitz estimation methods is an active area in machine learning research, but is not the main focus of this work.

Tables 1 and 2 summarize the local Lipschitz constants estimated for the trained ACGAN and BigGAN generators conditioned on each class. In particular, we report both the mean estimates averaged over

repeated trials and the standard deviations. For both conditional generators, we set

, , and in Algorithm 1 for Lipschitz estimation. For BigGAN, the specifically selected classes from ImageNet are reported in Table 2.

Compared with unconditional generative models, conditional ones generate each class using a separate generator. Thus, the local Lipschitz constant of each class-conditioned generator is expected to be smaller than that of unconditional ones, as the within-class variation is usually much smaller than the between-class variation for a given classification dataset. For instance, we trained an unconditional GAN generator (Goodfellow et al., 2014) on MNIST dataset, which yields an overall local Lipschitz constant of from Algorithm 1 under the same parameter settings. If we plug in this estimated Lipschitz constant into the theoretical results in Fawzi et al. (2018), the implied intrinsic robustness bound is in fact vacuous (above ) with perturbations strength in distance.

### 5.3 Comparisons with Robust Classifiers

We compare our derived intrinsic robustness upper bound with the empirical adversarial robustness achieved by the current state-of-the-art defense methods under perturbations. Specifically, we consider three robust training methods: LP-Certify: optimization-based certified robust defense (Wong et al., 2018); Adv-Train: PGD attack based adversarial training (Madry et al., 2018); and TRADES: adversarial training by accuracy and robustness trade-off (Zhang et al., 2019). We adopt these robust training methods to train robust classifiers over a set of generated training images and evaluate their robustness on the corresponding generated test set.

For MNIST, we use our trained ACGAN model to generate classes of hand-written digits with training images and testing images. For ImageNet, we use the BigGAN model to generate selected classes of images, which contains images for training set and images for test set. We refer to the -class BigGAN generated dataset as ‘ImageNet10’. We set for training robust models using Adv-Train and TRADES for both generated datasets, whereas we only train the LP-based certified robust classifier with on generated MNIST data, as it is not able to scale with ImageNet10 as well as generated MNIST with larger (see Appendix B.1 for all the selected hyper-parameters and network architectures).

A commonly-used method to evaluate the robustness of a given model is by performing carefully-designed adversarial attacks. Here we adopt the PGD attack (Madry et al., 2018), and report the robust accuracy (classification accuracy on inputs generated using the PGD attack) as the empirically measured model robustness. We test both the natural classification accuracy and the robustness of the aforementioned adversarially trained classifiers under perturbations with perturbation strength selected from . See Appendix B.1 for PGD parameter settings.

Table 3 compares the empirically measured robustness of the trained robust classifiers and the derived theoretical upper bound on intrinsic robustness. For empirically measured adversarial robustness, we report both the mean and the standard deviation with respect to repeated trials. For computing our theoretical robust bounds, we plug the estimated local Lipschitz constants into Theorem 4 with risk threshold for generated MNIST and for ImageNet10, to reflect the best natural accuracy achieved by the considered robust classifiers.

Under most settings, there exists a large gap between the robust limit implied by our theory and the best adversarial robustness achieved by state-of-the-art robust classifiers. For instance, Adv-Train and TRADES only achieve less than robust accuracy on the generated ImageNet10 data with , whereas the estimated robustness bound is as high as . The gap becomes even larger when we increase the perturbation strength . In contrast to the previous theoretical results on artificial distributions, for these image classification problems we cannot simply conclude from the intrinsic robustness bound that adversarial examples are inevitable. This huge gap between the empirical robustness of the best current image classifiers and the estimated theoretical bound suggests that either there is a way to train better robust models or that there exist other explanations for the inherent limitations of robust learning against adversarial examples.

In Section 5.3, we empirically show the unconstrained robustness of existing robust classifiers is far below the intrinsic robustness upper bound implied by our theory for real distributions. However, it is not clear whether the reason is that current robust training methods are far from perfect, or that our derived upper bound is not tight enough due to the Lipschitz relaxation step used for proving such bound. In this section, we empirically study the in-distribution adversarial risk for a better characterization of the actual intrinsic robustness. As shown in Remark 4, the in-distribution adversarial robustness of any classifier with risk at least can be regarded as a lower bound for the intrinsic robustness . This provides us a more accurate characterization of the intrinsic robustness bound and enables better understanding of intrinsic robustness.

While there are many types of attack algorithms in the literature that can be used to evaluate the unconstrained robustness of a given classifier in the image space, little has been done in terms of how to evaluate the in-distribution robustness. In order to empirically evaluate the in-distribution robustness, we straightforwardly formulate the following optimization problem to find adversarial examples on the image manifold:

 min\bz \cL(f(G(\bz,y)),y) s.t. ∥G(\bz,y)−\bx∥2 ≤ϵ, (3)

where , is the data sample in the image space to be attacked, is the given classifier, and

denotes the adversarial loss function. The goal of (

3) is to optimize the latent vector to lower the adversarial loss (make the robust classifier mis-classify some generated images) while keeping the distance between the generated image and the test image within perturbation limit. The key difficulty in solving (3) lies in the fact that we cannot perform any type of projection operations as we are optimizing over but the constraints are imposed on the generated image space . This prohibits the use of common attack algorithms such as PGD. In order to solve (3), we transform (3) into the following Lagrangian formulation:

 min\bz∥G(\bz,y)−\bx∥2+λ⋅\cL(f(G(\bz,y)),y). (4)

This formulation ignores the perturbation constraint of and tries to find the in-distribution adversarial examples with the smallest possible perturbation. In order to evaluate the intrinsic robustness under a given perturbation budget, we need to further check all in-distribution adversarial examples found and only count those with perturbations within the constraint. Note that even though (4) provides us a feasible way to compute the in-distribution robustness of a classifier, equation (4) itself could be hard to solve in general. First, it is not obvious how to initialize . Random initialization of could lead to bad local optima which prevent the optimizer from efficiently solving (4) or even finding a that could make close enough to . Second, the hyper-parameter could be quite sensitive to different test examples. Failing to choose a proper could also lead to failures in finding in-distribution adversarial examples within constraint. In order to the tackle the aforementioned challenges, we propose to solve another optimization problem for the initialization of and adopt binary search for the best choice of (see Appendix B.2 for more details of our implementation).

Figure 2 summarizes results from our empirical evaluations on intrinsic robustness of the generated MNIST and ImageNet10 data. We evaluate the empirical robustness of three types of robust training methods at different time points during the training procedure. To be more specific, we evaluate the robustness of the intermediate models produced every

training epochs. For each method, we plot both the unconstrained robustness measured by PGD attacks and the in-distribution robustness measured using the aforementioned strategies. In addition, based on the local Lipschitz constants estimated in Section

5.2, we plot the implied theoretical bound on intrinsic robustness as the dotted line curve for direct comparison.

Compared with the intrinsic robustness upper bound (dotted curve line), the unconstrained robustness of various robustly-trained models is much smaller, and the gap between them becomes more obvious as we increase . This aligns with our observations in Section 5.3. However under all the considered settings, the estimated in-distribution adversarial robustness is much higher than the unconstrained one and closer to the theoretical upper bound, especially for the ImageNet10 data. Note that according to Remark 4, the actual intrinsic robustness should lie between the in-distribution robustness of any given classifier with risk at least and the derived intrinsic robustness upper bound. Observing the big gap between the estimated in-distribution and unconstrained robustness of various robustly trained models, one would expect the current state-of-the-art robust models are still far from approaching the actual intrinsic robustness limit for real image distributions.

## 6 Conclusions

We studied the intrinsic robustness of typical image distributions using conditional generative models. By deriving theoretical upper bounds on intrinsic robustness and providing empirical estimates on the generated image distributions, we observed a large gap between the theoretical intrinsic robust limit and the best robustness achieved by state-of-the-art robust classifiers. Our results imply that the inevitability of adversarial examples claimed in recent theoretical studies, such as Fawzi et al. (2018), do not apply to real image distributions, and suggest that there is a need for deeper understanding on the intrinsic robustness limitations for real data distributions.

## Appendix A Proof of Main Theorem

This section presents the detailed proofs of Theorems 4 and 4 in Section 4.

### a.1 Proof of Theorem 4

###### Proof.

Let be the error region in the image space and be the -expansion of in metric . By Definition 3, we have

Since according to Definition 3, we have for any . Thus, it remains to lower bound each term individually. For any classifier , we have

 In-AdvRiskϵμi(f) =Pr\bz∼νd[∃\bz′∈\RRd,%s.t.Δ(gi(\bz′),gi(\bz))≤ϵ % and f(gi(\bz′))≠f∗(gi(\bz′))] ≥Pr\bz∼νd[∃\bz′∈\cB(\bz,ϵ/Li(r)), s.t. f(gi(\bz′))≠f∗(gi(\bz′))]I−δ (5)

where the first inequality is due to , and the second inequality holds because is -locally Lipschitz with probability at least and for any .

To further bound the term , we make use of the Gaussian Isoperimetric Inequality as presented in Lemma 4. Let be the corresponding error region in the latent space. By Lemma 4, we have

 I ≥Φ(Φ−1(νd(\cAf))+ϵLi(r))=Φ(Φ−1(Riskμi(f))+ϵLi(r)). (6)

Finally, plugging (6) into (A.1), we complete the proof. ∎

### a.2 Proof of Theorem 4

###### Proof.

According to Definition 3 and Theorem 4, for any , we have

 Robϵμ(\cFα) (7)

where the last inequality holds because is monotonically increasing. For any , let be the error region and be the measure of under the -th conditional distribution.

Thus, to obtain an upper bound on using (A.2), it remains to solve the following optimization problem:

 (8)

Note that for classifier in , by definition, we can simply replace in (8), which proves the upper bound on .

Next, we are going to show that the optimal value of (8) is achieved, only if there exists a class such that and for any . Consider the simplest case where . Note that and are both monotonically increasing functions, which implies that holds when optimum achieved, thus the optimization problem for can be formulated as follows

 (9)

Suppose holds for the initial setting. Now consider another setting where , . Let and . According to the equality constraint of the optimization problem (9), we have

 p1⋅∫Φ−1(α1)+s1Φ−1(α1)1√2π⋅exp−x2/2dx=p2⋅∫Φ−1(α2)Φ−1(α2)−s21√2π⋅exp−x2/2dx. (10)

Let for simplicity. By simple algebra, we have

 p1⋅∫Φ−1(α1)+s1+ηΦ−1(α1)+η1√2π⋅exp−x2/2dx =p1⋅∫Φ−1(α1)+s1Φ−1(α1)1√2π⋅exp−u2/2−η⋅u−η2/2du

where the first inequality holds because for any , the second inequality follows from (10) and the fact that , and the last inequality holds because for any . Therefore, the optimal value of (9) will be achieved when or . For general setting with , since are independent in the objective, we can fix and optimize and first, then deal with incrementally using the same technique. ∎

## Appendix B Experimental Details

This section provides additional details for our experiments.

### b.1 Network Architectures and Hyper-parameter Settings

For the certified robust defense (LP-Certify), we adopt the the same four-layer neural network architecture as implemented in Wong et al. (2018), with two convolutional layers and two fully connected layers, and use the an Adam optimizer with learning rate and batch size for training the robust classifier. In particular, the adversarial loss function is based on the robust certificate under proposed in Wong et al. (2018).

For training attack-based robust models (Adv-Train and TRADES), we use a seven-layer CNN architecture which contains four convolution layers and three fully connected layers. We use a SGD optimizer to minimize the attack-based adversarial loss with learning rate on MNIST and learning rate on ImageNet10. Table 4 summarizes all the hyper-parameters we used for training the robust models ( is an additional parameter specifically used in TRADES).

For evaluating the unconstrained adversarial robustness, we implemented PGD attack with metric. Table 5 shows all the hyper-parameters we used for robustness evaluation.

### b.2 Strategies for Estimating In-distribution Adversarial Robustness

Initialization of : For MNIST data, we design an initialization strategy for in order to make sure the perturbation term can be efficiently optimized. To be more specific, starting from random noise, we first solve another optimization problem:

 \bzinit=\argmin\bz∥G(\bz,y)−\bx∥2.

By setting as our initial point, we minimize the initial perturbation distance. Here can start from any random initial point as we will then optimize the generated image under distance.

For ImageNet10 data, even applying the above optimization procedure doesn’t result in an initial such that when is small. Therefore, we use another strategy by recording the when generating the test sample , i.e., . And we adopt as the initial point for in solving (4). This makes sure that the whole optimization procedure could at least find one point satisfying the perturbation constraint222We didn’t use as the initialization for MNIST data as our empirical study shows that the optimization-based initialization achieves better performances on MNIST..

The choice of : Inspired by Carlini and Wagner (2017), we also adopt binary search strategy for finding better regularization parameter . Specifically, we set initial and if we successfully find an adversarial example, we lower the value of via binary search. Otherwise, we raise the value of . For each batch of examples, we perform times binary search in order to find qualified in-distribution adversarial examples.

Hyper-parameters: We use Adam optimizer with learning rate for finding in-distribution adversarial examples. We set maximum iterations for each binary search as .

### Acknowledgements

This research was sponsored in part by the National Science Foundation SaTC-1717950 and SaTC-1804603, and additional support from Amazon, Baidu, and Intel. The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies.

## References

• Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. (2018). Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning (ICML).
• Bhagoji et al. (2019) Bhagoji, A. N., Cullina, D., and Mittal, P. (2019). Lower bounds on adversarial robustness from optimal transport. In Advances in Neural Information Processing Systems (NeurIPS).
• Borell (1975) Borell, C. (1975). The Brunn-Minkowski inequality in Gauss space. Inventiones mathematicae, 30(2):207–216.
• Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. (2019). Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR).
• Carlini and Wagner (2017) Carlini, N. and Wagner, D. (2017). Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy.
• Chakraborty et al. (2018) Chakraborty, A., Alam, M., Dey, V., Chattopadhyay, A., and Mukhopadhyay, D. (2018). Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069.
• Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In

Conference on Computer Vision and Pattern Recognition (CVPR)

.
• Diochnos et al. (2018) Diochnos, D., Mahloujifar, S., and Mahmoody, M. (2018). Adversarial risk and robustness: General definitions and implications for the uniform distribution. In Advances in Neural Information Processing Systems (NeurIPS).
• Dohmatob (2019) Dohmatob, E. (2019). Generalized no free lunch theorem for adversarial robustness. In International Conference on Machine Learning (ICML).
• Fawzi et al. (2018) Fawzi, A., Fawzi, H., and Fawzi, O. (2018). Adversarial vulnerability for any classifier. In Advances in Neural Information Processing Systems (NeurIPS).
• Gilmer et al. (2018) Gilmer, J., Metz, L., Faghri, F., Schoenholz, S. S., Raghu, M., Wattenberg, M., and Goodfellow, I. (2018). Adversarial spheres. arXiv preprint arXiv:1801.02774.
• Goodfellow et al. (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing Systems (NeurIPS).
• Goodfellow et al. (2015) Goodfellow, I., Shlens, J., and Szegedy, C. (2015). Explaining and harnessing adversarial examples. In International Conference on Learning Representations (ICLR).
• Gowal et al. (2019) Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Mann, T., and Kohli, P. (2019). Scalable verified training for provably robust image classification. In International Conference on Computer Vision (ICCV).
• He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR).
• Hinton et al. (2012) Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Kingsbury, B., et al. (2012). Deep neural networks for acoustic modeling in speech recognition. IEEE Signal processing magazine, 29.
• Katz et al. (2017) Katz, G., Barrett, C., Dill, D. L., Julian, K., and Kochenderfer, M. J. (2017). Reluplex: An efficient SMT solver for verifying deep neural networks. In International Conference on Computer Aided Verification.
• LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., et al. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
• Madry et al. (2018) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2018).

Towards deep learning models resistant to adversarial attacks.

In International Conference on Learning Representations (ICLR).
• Mahloujifar et al. (2019a) Mahloujifar, S., Diochnos, D. I., and Mahmoody, M. (2019a). The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure. In

AAAI Conference on Artificial Intelligence

.
• Mahloujifar et al. (2019b) Mahloujifar, S., Zhang, X., Mahmoody, M., and Evans, D. (2019b). Empirically measuring concentration: Fundamental limits on intrinsic robustness. In Advances in Neural Information Processing Systems (NeurIPS).
• Mirza and Osindero (2014) Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
• Odena et al. (2017) Odena, A., Olah, C., and Shlens, J. (2017). Conditional image synthesis with auxiliary classifier gans. In International Conference on Machine Learning (ICML).
• Papernot et al. (2016) Papernot, N., McDaniel, P., Wu, X., Jha, S., and Swami, A. (2016). Distillation as a defense to adversarial perturbations against deep neural networks. In IEEE Symposium on Security and Privacy.
• Raghunathan et al. (2018) Raghunathan, A., Steinhardt, J., and Liang, P. (2018). Certified defenses against adversarial examples. In International Conference on Learning Representations (ICLR).
• Shafahi et al. (2019) Shafahi, A., Huang, W. R., Studer, C., Feizi, S., and Goldstein, T. (2019). Are adversarial examples inevitable? In International Conference on Learning Representations (ICLR).
• Sinha et al. (2018) Sinha, A., Namkoong, H., and Duchi, J. (2018). Certifying some distributional robustness with principled adversarial training. In International Conference on Learning Representations (ICLR).
• Sudakov and Tsirelson (1978) Sudakov, V. N. and Tsirelson, B. S. (1978). Extremal properties of half-spaces for spherically invariant measures. Journal of Soviet Mathematics, 9(1):9–18.
• Sutskever et al. (2012) Sutskever, I., Hinton, G. E., and Krizhevsky, A. (2012).

ImageNet classification with deep convolutional neural networks.

Advances in Neural Information Processing Systems (NeurIPS).
• Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2014). Intriguing properties of neural networks. In International Conference on Learning Representations (ICLR).
• Tjeng et al. (2019) Tjeng, V., Xiao, K. Y., and Tedrake, R. (2019). Evaluating robustness of neural networks with mixed integer programming. In International Conference on Learning Representations (ICLR).
• Tramer et al. (2020) Tramer, F., Carlini, N., Brendel, W., and Madry, A. (2020). On adaptive attacks to adversarial example defenses. arXiv preprint arXiv:2002.08347.
• Wang et al. (2018) Wang, S., Chen, Y., Abdou, A., and Jana, S. (2018). MixTrain: Scalable training of formally robust neural networks. arXiv preprint arXiv:1811.02625.
• Wang et al. (2019) Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., and Gu, Q. (2019). On the convergence and robustness of adversarial training. In International Conference on Machine Learning (ICML).
• Wong and Kolter (2018) Wong, E. and Kolter, Z. (2018). Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning (ICML).
• Wong et al. (2018) Wong, E., Schmidt, F., Metzen, J. H., and Kolter, J. Z. (2018). Scaling provable adversarial defenses. In Advances in Neural Information Processing Systems (NeurIPS).
• Zhang et al. (2019) Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., and Jordan, M. (2019). Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning (ICML).