# Adversarially Robust Generalization Requires More Data

Machine learning models are often susceptible to adversarial perturbations of their inputs. Even small perturbations can cause state-of-the-art classifiers with high "standard" accuracy to produce an incorrect prediction with high confidence. To better understand this phenomenon, we study adversarially robust learning from the viewpoint of generalization. We show that already in a simple natural data model, the sample complexity of robust learning can be significantly larger than that of "standard" learning. This gap is information theoretic and holds irrespective of the training algorithm or the model family. We complement our theoretical results with experiments on popular image classification datasets and show that a similar gap exists here as well. We postulate that the difficulty of training robust classifiers stems, at least partially, from this inherently larger sample complexity.

## Authors

• 25 publications
• 18 publications
• 19 publications
• 31 publications
• 34 publications
01/02/2019

### Adversarial Robustness May Be at Odds With Simplicity

Current techniques in machine learning are so far are unable to learn cl...
02/26/2020

### Revisiting Ensembles in an Adversarial Context: Improving Natural Accuracy

A necessary characteristic for the deployment of deep learning models in...
12/19/2020

### Sample Complexity of Adversarially Robust Linear Classification on Separated Data

We consider the sample complexity of learning with adversarial robustnes...
02/11/2020

### More Data Can Expand the Generalization Gap Between Adversarially Robust and Standard Models

Despite remarkable success in practice, modern machine learning models h...
10/29/2018

Many machine learning models are vulnerable to adversarial attacks. It h...
01/27/2019

### An Information-Theoretic Explanation for the Adversarial Fragility of AI Classifiers

We present a simple hypothesis about a compression property of artificia...
06/06/2020

### Unique properties of adversarially trained linear classifiers on Gaussian data

Machine learning models are vulnerable to adversarial perturbations, tha...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Modern machine learning models achieve high accuracy on a broad range of datasets, yet can easily be misled by small perturbations of their input. While such perturbations are often simple noise to a human or even imperceptible, they cause state-of-the-art models to misclassify their input with high confidence. This phenomenon has first been studied in the context of secure machine learning for spam filters and malware classification [DDMSV04, LM05, biggio2017wild]. More recently, researchers have demonstrated the phenomenon under the name of adversarial examples in image classification [SzegedyZSBEGF13, GoodfellowSS14], question answering [jia2017adversarial], voice recognition [CMVZSSWZ16, ZYJZZX17, SM17, carlini2018audio], and other domains (for instance, see [GPMBM16, CANK17, AMT17, BM17, HPGDA17, KFS17, XCLRDS17, HAHO18]). Overall, the existence of such adversarial examples raises concerns about the robustness of trained classifiers. As we increasingly deploy machine learning systems in safety- and security-critical environments, it is crucial to understand the robustness properties of our models in more detail.

A growing body of work is exploring this robustness question from the security perspective by proposing attacks (methods for crafting adversarial examples) and defenses

(methods for making classifiers robust to such perturbations). Often, the focus is on deep neural networks, e.g., see

[SharifBBR16, MoosDez16, papernot2016distillation, CarliniW16a, TramerKPBM17, madry2017towards, xu2017feature, he2017weak]. While there has been success with robust classifiers on simple datasets [madry2017towards, kolter2017provable, sinha2017certifiable, raghunathan2018certified], more complicated datasets still exhibit a large gap between “standard” and robust accuracy [CarliniW16a, athalye2018obfuscated]. An implicit assumption underlying most of this work is that the same training dataset that enables good standard accuracy also suffices to train a robust model. However, it is unclear if this assumption is valid.

So far, the generalization

aspects of adversarially robust classification have not been thoroughly investigated. Since adversarial robustness is a learning problem, the statistical perspective is of integral importance. A key observation is that adversarial examples are not at odds with the standard notion of generalization as long as they occupy only a small total measure under the data distribution. So to achieve adversarial robustness, a classifier must generalize in a stronger sense. We currently do not have a good understanding of how such a stronger notion of generalization compares to standard “benign” generalization, i.e., without an adversary.

In this work, we address this gap and explore the statistical foundations of adversarially robust generalization. We focus on sample complexity as a natural starting point since it underlies the core question of when it is possible to learn an adversarially robust classifier. Concretely, we pose the following question:

How does the sample complexity of standard generalization compare to that of adversarially robust generalization?

To study this question, we analyze robust generalization in two distributional models. By focusing on specific distributions, we can establish information-theoretic lower bounds and describe the exact sample complexity requirements for generalization. We find that even for a simple data distribution such as a mixture of two class-conditional Gaussians, the sample complexity of robust generalization is significantly larger than that of standard generalization. Our lower bound holds for any model and learning algorithm. Hence no amount of algorithmic ingenuity is able to overcome this limitation.

In spite of this negative result, simple datasets such as MNIST have recently seen significant progress in terms of adversarial robustness [madry2017towards, kolter2017provable, sinha2017certifiable, raghunathan2018certified]. The most robust models achieve accuracy around 90% against large -perturbations. To better understand this discrepancy with our first theoretical result, we also study a second distributional model with binary features. This binary data model has the same standard generalization behavior as the previous Gaussian model. Moreover, it also suffers from a significantly increased sample complexity whenever one employs linear classifiers to achieve adversarially robust generalization. Nevertheless, a slightly non-linear classifier that utilizes thresholding turns out to recover the smaller sample complexity of standard generalization. Since MNIST is a mostly binary dataset, our result provides evidence that -robustness on MNIST is significantly easier than on other datasets. Moreover, our results show that distributions with similar sample complexity for standard generalization can still exhibit considerably different sample complexity for robust generalization.

To complement our theoretical results, we conduct a range of experiments on MNIST, CIFAR10, and SVHN. By subsampling the datasets at various rates, we study the impact of sample size on adversarial robustness. When plotted as a function of training set size, our results show that the standard accuracy on SVHN indeed plateaus well before the adversarial accuracy reaches its maximum. On MNIST, explicitly adding thresholding to the model during training significantly reduces the sample complexity, similar to our upper bound in the binary data model. On CIFAR10, the situation is more nuanced because there are no known approaches that achieve more than 50% accuracy even against a mild adversary. But as we show in the next subsection, there is clear evidence for overfitting in the current state-of-the-art methods.

Overall, our results suggest that current approaches may be unable to attain higher adversarial accuracy on datasets such as CIFAR10 for a fundamental reason: the dataset may not be large enough to train a standard convolutional network robustly. Moreover, our lower bounds illustrate that the existence of adversarial examples should not necessarily be seen as a shortcoming of specific classification methods. Already in a simple data model, adversarial examples provably occur for any learning approach, even when the classifier already achieves high standard accuracy. So while vulnerability to adversarial -perturbations might seem counter-intuitive at first, in some regimes it is an unavoidable consequence of working in a statistical setting.

### 1.1 A motivating example: Overfitting on CIFAR10

Before we describe our main results, we briefly highlight the importance of generalization for adversarial robustness via two experiments on MNIST and CIFAR10. In both cases, our goal is to learn a classifier that achieves good test accuracy even under -bounded perturbations. We follow the standard robust optimization approach [wald1945statistical, ben2009robust, madry2017towards] – also known as adversarial training [SzegedyZSBEGF13, GoodfellowSS14] – and (approximately) solve the saddle point problem

 minθEx[max∥x′−x∥∞≤εloss(θ,x′)]

via stochastic gradient descent over the model parameters

. We utilize projected gradient descent for the inner maximization problem over allowed perturbations of magnitude (see [madry2017towards] for details). Figure 1 displays the training curves for three quantities: (i) adversarial training error, (ii) adversarial test error, and (iii) standard test error.

The results show that on MNIST, robust optimization is able to learn a model with around 90% adversarial accuracy and a relatively small gap between training and test error. However, CIFAR10 offers a different picture. Here, the model (a wide residual network [ZK16]) is still able to fully fit the training set even against an adversary, but the generalization gap is significantly larger. The model only achieves 47% adversarial test accuracy, which is about 50% lower than its training accuracy.111We remark that this accuracy is still currently the best published robust accuracy on CIFAR10 [athalye2018obfuscated]. For instance, contemporary approaches to architecture tuning do not yield better robust accuracies [ZCLS18]. Moreover, the standard test error is about 87%, so the failure of generalization indeed primarily occurs in the context of adversarial robustness. This failure might be surprising particularly since properly tuned convolutional networks rarely overfit much on standard vision datasets.

### 1.2 Outline of the paper

In the next section, we describe our main theoretical results at a high level. Sections 3 and 4 then provide more details for our lower bounds on -robust generalization. Section 5 complements these results with experiments. We conclude with a discussion of our results and future research directions.

## 2 Theoretical Results

Our theoretical results concern statistical aspects of adversarially robust classification. In order to understand how properties of data affect the number of samples needed for robust generalization, we study two concrete distributional models. While our two data models are clearly much simpler than the image datasets currently being used in the experimental work on -robustness, we believe that the simplicity of our models is a strength in this context.

After all, the fact that we can establish a separation between standard and robust generalization already in our Gaussian data model is evidence that the existence of adversarial examples for neural networks should not come as a surprise. The same phenomenon (i.e., classifiers with just enough samples for high standard accuracy necessarily being vulnerable to - attacks) already occurs in much simpler settings such as a mixture of two Gaussians.

Also, our main contribution is a lower bound. So establishing a hardness result for a simple problem means that more complicated distributional setups that can “simulate” the Gaussian model directly inherit the same hardness.

Finally, as we describe in the subsection on the Bernoulli model, the benefits of the thresholding layer predicted by our theoretical analysis do indeed appear in experiments on MNIST as well. Since multiple defenses against adversarial examples have been primarily evaluated on MNIST [kolter2017provable, raghunathan2018certified, sinha2017certifiable], it is important to note that -robustness on MNIST is a particularly easy case: adding a simple thresholding layer directly yields nearly state-of-the-art robustness against moderately strong adversaries (), without any further changes to the model architecture or training algorithm.

### 2.1 The Gaussian model

Our first data model is a mixture of two spherical Gaussians with one component per class.

###### Definition 1 (Gaussian model).

Let

be the per-class mean vector and let

be the variance parameter. Then the

-Gaussian model is defined by the following distribution over : First, draw a label uniformly at random. Then sample the data point from .

While not explicitly specified in the definition, we will use the Gaussian model in the regime where the norm of the vector is approximately . Hence the main free parameter for controlling the difficulty of the classification task is the variance , which controls the amount of overlap between the two classes.

To contrast the notions of “standard” and “robust” generalization, we briefly recap a standard definition of classification error.

###### Definition 2 (Classification error).

Let be a distribution. Then the classification error of a classifier is defined as .

Next, we define our main quantity of interest, which is an adversarially robust counterpart of the above classification error. Instead of counting misclassifications under the data distribution, we allow a bounded worst-case perturbation before passing the perturbed sample to the classifier.

###### Definition 3 (Robust classification error).

Let be a distribution and let be a perturbation set.222We write to denote the power set of , i.e., the set of subsets of . Then the -robust classification error of a classifier is defined as .

Since -perturbations have recently received a significant amount of attention, we focus on robustness to -bounded adversaries in our work. For this purpose, we define the perturbation set . To simplify notation, we refer to robustness with respect to this set also as -robustness. As we remark in the discussion section, understanding generalization for other measures of robustness (, rotatations, etc.) is an important direction for future work.

#### Standard generalization.

The Gaussian model has one parameter for controlling the difficulty of learning a good classifier. In order to simplify the following bounds, we study a regime where it is possible to achieve good standard classification error from a single sample.333We remark that it is also possible to study a more general setting where standard generalization requires a larger number of samples. As we will see later, this also allows us to calibrate our two data models to have comparable standard sample complexity.

Concretely, we prove the following theorem, which is a direct consequence of Gaussian concentration. Note that in this theorem we use a linear classifier: for a vector , the linear classifier is defined as .

###### Theorem 4.

Let be drawn from a -Gaussian model with and where is a universal constant. Let be the vector

. Then with high probability, the linear classifier

has classification error at most 1%.

To minimize the number of parameters in our bounds, we have set the error probability to 1%. By tuning the model parameters appropriately, it is possible to achieve a vanishingly small error probability from a single sample (see Corollary 19 in Appendix A.1).

#### Robust generalization.

As we just demonstrated, we can easily achieve standard generalization from only a single sample in our Gaussian model. We now show that achieving a low -robust classification error requires significantly more samples. To this end, we begin with a natural strengthening of Theorem 4 and prove that the (class-weighted) sample mean can also be a robust classifier (given sufficient data).

###### Theorem 5.

Let be drawn i.i.d. from a -Gaussian model with and . Let be the weighted mean vector . Then with high probability, the linear classifier has -robust classification error at most 1% if

 n≥{1 for ε≤14d−\sfrac14c2ε2√d for 14d−\sfrac14≤ε≤14.

We refer the reader to Corollary 23 in Appendix A.1 for the details. As before, and are two universal constants. Overall, the theorem shows that it is possible to learn an -robust classifier in the Gaussian model as long as is bounded by a small constant and we have a large number of samples.

Next, we show that this significantly increased sample complexity is necessary. Our main theorem establishes a lower bound for all learning algorithms, which we formalize as functions from data samples to binary classifiers. In particular, the lower bound applies not only to learning linear classifiers.

###### Theorem 6.

Let be any learning algorithm, i.e., a function from samples to a binary classifier . Moreover, let , let , and let be drawn from . We also draw samples from the -Gaussian model. Then the expected -robust classification error of is at least if

 n≤c2ε2√dlogd.

The proof of the theorem can be found in Corollary 23 (Appendix A.2) and we provide a brief sketch in Section 3. It is worth noting that the classification error in the lower bound is tight. A classifier that always outputs a fixed prediction trivially achieves perfect robustness on one of the two classes and hence robust accuracy .

Comparing Theorems 5 and 6, we see that the sample complexity required for robust generalization is bounded as

 clogd≤nε2√d≤c′.

Hence the lower bound is nearly tight in our regime of interest. When the perturbation has constant -norm, the sample complexity of robust generalization is larger than that of standard generalization by , i.e., polynomial in the problem dimension. This shows that for high-dimensional problems, adversarial robustness can provably require a significantly larger number of samples.

Finally, we remark that our lower bound applies also to a more restricted adversary. As we outline in Sections 3, the proof uses only a single adversarial perturbation per class. As a result, the lower bound provides transferable adversarial examples and applies to worst-case distribution shifts without a classifier-adaptive adversary. We refer the reader to Section 7 for a more detailed discussion.

### 2.2 The Bernoulli model

As mentioned in the introduction, simpler datasets such as MNIST have recently seen significant progress in terms of -robustness. We now investigate a possible mechanism underlying these advances. To this end, we study a second distributional model that highlights how the data distribution can significantly affect the achievable robustness. The second data model is defined on the hypercube , and the two classes are represented by opposite vertices of that hypercube. When sampling a datapoint for a given class, we flip each bit of the corresponding class vertex with a certain probability. This data model is inspired by the MNIST dataset because MNIST images are close to binary (many pixels are almost fully black or white).

###### Definition 7 (Bernoulli model).

Let be the per-class mean vector and let be the class bias parameter. Then the -Bernoulli model is defined by the following distribution over : First, draw a label uniformly at random from its domain. Then sample the data point by sampling each coordinate from the distribution

 xi={−y⋅θ⋆iwith % probability\sfrac12+τ−y⋅θ⋆iwith probability\sfrac12−τ.

As in the previous subsection, the model has one parameter for controlling the difficulty of learning. A small value of

makes the samples less correlated with their respective class vectors and hence leads to a harder classification problem. Note that both the Gaussian and the Bernoulli model are defined by simple sub-Gaussian distributions. Nevertheless, we will see that they differ significantly in terms of robust sample complexity.

#### Standard generalization.

As in the Gaussian model, we first calibrate the distribution so that we can learn a classifier with good standard accuracy from a single sample.444To be precise, the two distributions have comparable sample complexity for standard generalization in the regime where .

The following theorem is a direct consequence of the fact that bounded random variables exhibit sub-Gaussian concentration.

###### Theorem 8.

Let be drawn from a -Bernoulli model with where is a universal constant. Let be the vector . Then with high probability, the linear classifier has classification error at most 1%.

To simplify the bound, we have set the error probability to be 1% as in the Gaussian model. We refer the reader to Corollary 28 in Appendix B.1 for the proof.

#### Robust generalization.

Next, we investigate the sample complexity of robust generalization in our Bernoulli model. For linear classifiers, a small robust classification error again requires a large number of samples:

###### Theorem 9.

Let be a linear classifier learning algorithm, i.e., a function from samples to a linear classifier . Suppose that we choose uniformly at random from and draw samples from the -Bernoulli model with . Moreover, let and . Then the expected -robust classification error of is at least if

 n≤c2ε2γ2dlog\sfracdγ.

We provide a proof sketch in Section 4 and the full proof in Appendix B.2. At first, the lower bound for linear classifiers might suggest that -robustness requires an inherently larger sample complexity here as well. However, in contrast to the Gaussian model, non-linear classifiers can achieve a significantly improved robustness. In particular, consider the following thresholding operation which is defined element-wise as

 T(x)i={+1if xi≥0−1otherwise.

It is easy to see that for , the thresholding operator undoes the action of any -bounded adversary, i.e., we have for any . Hence we can combine the thresholding operator with the classifier learned from a single sample to get the following upper bound.

###### Theorem 10.

Let be drawn from a -Bernoulli model with where is a universal constant. Let be the vector . Then with high probability, the classifier has -robust classification error at most 1% for any .

This theorem shows a stark contrast to the Gaussian case. Although both models have similar sample complexity for standard generalization, there is a gap between the -robust sample complexity for the Bernoulli and Gaussian models. This discrepancy provides evidence that robust generalization requires a more nuanced understanding of the data distribution than standard generalization.

In isolation, the thresholding step might seem specific to the Bernoulli model studied here. However, our experiments in Section 5 show that an explicit thresholding layer also significantly improves the sample complexity of training a robust neural network on MNIST. We conjecture that the effectiveness of thresholding is behind many of the successful defenses against adversarial examples on MNIST (for instance, see Appendix C in [madry2017towards]).

## 3 Lower Bounds for the Gaussian Model

Recall our main theoretical result: In the Gaussian model, no algorithm can produce a robust classifier unless it has seen a large number of samples. In particular, we give a nearly tight trade-off between the number of samples and the -robustness of the classifier. The following theorem is the technical core of this lower bound. Combined with standard bounds on the -norm of a random Gaussian vector, it gives Theorem 6 from the previous section.

###### Theorem 11.

Let be any learning algorithm, i.e., a function from samples in to a binary classifier . Moreover, let , let , and let be drawn from . We also draw samples from the -Gaussian model. Then the expected -robust classification error of is at least

 12Pv∼N(0,I)[√nσ2+n∥v∥∞≤ε].

Several remarks are in order. Since we lower bound the expected robust classification error for a distribution over the model parameters , our result implies a lower bound on the minimax robust classification error (i.e., minimum over learning algorithms, maximum over unknown parameters ). Second, while we refer to the learning procedure as an algorithm, our lower bounds are information theoretic and hold irrespective of the computational power of this procedure.

Moreover, our proof shows that given the samples, there is a single

adversarial perturbation that (a) applies to all learning algorithms, and (b) leads to at least a constant fraction of fresh samples being misclassified. In other words, the same perturbation is transferable across examples as well as across architectures and learning procedures. Hence our simple Gaussian data model already exhibits the transferability phenomenon, which has recently received significant attention in the deep learning literature (e.g.,

[SzegedyZSBEGF13, TramerPGBM17, dezfooli2017universal]).

We defer a full proof of the theorem to Section A.2 of the supplementary material. Here, we sketch the main ideas of the proof.

We fix an algorithm and let denote the set of samples given to the algorithm. We are interested in the expected robust classification error, which can be formalized as

 Eθ∗ESnEy∼±1Prx∼N(yθ∗,σ2I)[∃x′∈Bε∞(x):fn(x′)≠y].

We swap the two outer expectations so the quantity of interest becomes

 ESnEθ∗Ey∼±1Prx∼N(yθ∗,σ2I)[∃x′∈Bε∞(x):fn(x′)≠y].

Given the samples , the posterior on is a Gaussian distribution with parameters defined by simple statistics of (the sample mean and the number of samples). Since the new data point (to be classified) is itself drawn from a Gaussian distribution with mean , the posterior distribution on the positive examples is another Gaussian with a certain mean

. Similarly, the posterior distribution on the negative examples is a Gaussian with mean and the same standard deviation . At a high level, we will now argue that the adversary can make the two posterior distributions and similar enough so that the problem becomes inherently noisy, preventing any classifier from achieving a high accuracy.

To this end, define the classification sets of as and . This allows us to write the expected robust classification error as

We now lower bound the inner probabilities by considering the fixed perturbation . Note that a point is certainly misclassified if we have and . Thus the expected misclassification rate is at least .555For a set and a vector , we use the notation to denote the set . But since is simply a translated version of , this implies that

 Prμ+[Bε∞(A−)]≥μ0(A−+Δ−¯z)=μ0(A−)

where the distribution is the centered Gaussian . Similarly,

 Prμ−[Bε∞(A+)]≥μ0(A+−Δ+¯z)=μ0(A+).

Since , this implies that the adversarial perturbation misclassifies in expectation half of the positively labeled examples, which completes the proof. As mentioned above, the crucial step is that the posteriors and are similar enough so that we can shift both to the origin while still controlling the measure of the sets and .

## 4 Lower Bounds for the Bernoulli Model

For the Bernoulli model, our lower bound applies only to linear classifiers. As pointed out in Section 2.2, non-linear classifiers do not suffer an increase in sample complexity in this data model. We now give a high-level overview of our proof that the sample complexity for learning a linear classifier must increase as

 n≥cε2dlogd. (1)

At first, this lower bound may look stronger than in the Gaussian case, where Theorem 6 established a lower bound of the form , i.e., with only a square root dependence on . However, it is important to note that the relevant -robustness scale for linear classifiers in the Bernoulli model is on the order of , whereas non-linear classifiers can achieve robustness for noise level up to . In particular, we prove that no linear classifier can achieve small -robust classification error for (see Lemma 30 in Appendix B.2 for details). Recall that we focus on the regime. In this case, the lower bound in Equation 1 is on the order of samples, which is comparable to the (nearly) tight bound for the Gaussian case. This is no coincidence: for our noise parameters , one can show that approximately samples suffice to recover to sufficiently good accuracy.

The point of start of our proof of the lower bound for linear classifiers is the following observation. For an example , a linear classifier with parameter vector robustly classifies the point if and only if

 infΔ:∥Δ∥∞≤ε⟨yw,x+Δ⟩>0,

which is equivalent to

 ⟨yw,x⟩>supΔ:∥Δ∥∞≤ε⟨yw,Δ⟩.

By the definition of dual norms, the supremum on the right hand size is thus equal to .

The learning algorithm infers the parameter vector from a limited number of samples. Since these samples are noisy copies of the unknown parameters , the algorithm cannot be too certain of any single bit in (recall that we draw uniformly from the hypercube). We formalize this intuition in Lemma 29 (Appendix B.2) as a bound on the log odds given a sample :

 logPr[θ=+1∣S]Pr[θ=−1∣S].

Given such a bound, we can analyze the uncertainty in the estimate

by establishing an upper bound on the posterior for each . This in turn allow us to bound . With control over this expectation, we can then relate the prediction and the -norm via a tail bound argument. We defer the details to Appendix B.2.

## 5 Experiments

We complement our theoretical results by performing experiments on multiple common datasets.

### 5.1 Experimental setup

We consider standard convolutional neural networks and train models on datasets of varying complexity. Specifically, we study the MNIST

[lecun1998mnist], CIFAR-10 [krizhevsky2009learning], and SVHN [netzer2011reading] datasets. The latter is particularly well-suited for our analysis since it contains a large number of training images (more than 600,000), allowing us to study adversarially robust generalization in the large dataset regime.

#### Model architecture.

For MNIST, we use the simple convolution architecture obtained from the TensorFlow tutorial

[TFtutorial]. In order to prevent the model from overfitting when trained on small data samples, we regularize the model by adding weight decay with parameter to the training loss. For CIFAR-10, we consider a standard ResNet model [ResnetPaper]. It has 4 groups of residual layers with filter sizes (16, 16, 32, 64) and 5 residual units each. On SVHN, we also trained a network of larger capacity (filter sizes of instead of ) in order to perform well on the harder problems with larger adversarial perturbations. All of our models achieve close to state-of-the-art performance on the respective benchmark.

#### Robust optimization.

We perform robust optimization to train our classifiers. In particular, we train against a projected gradient descent (PGD) adversary, starting from a random initial perturbation of the training datapoint (see [madry2017towards] for more details). We consider adversarial perturbations in norm, performing PGD updates of the form

 xt+1=ΠBε∞(x0)(xt+λ⋅sgn(∇L(xt)))

for some step size . Here, denotes the loss of the model, while corresponds to projecting onto the ball of radius around . On MNIST, we perform steps of PGD, while on CIFAR-10 and SVHN we perform steps. We evaluate all networks against a -step PGD adversary. We choose the PGD step size to be , where denotes the maximal allowed perturbation and is the total number of steps. This allows PGD to reach the boundary of the optimization region within steps from any starting point.

### 5.2 Empirical sample complexity evaluation

We study how the generalization performance of adversarially robust networks varies with the size of the training dataset. To do so, we train networks with a specific adversary (for some fixed ) while reducing the size of the training set. The training subsets are produced by randomly sub-sampling the complete dataset in a class-balanced fashion. When increasing the number of samples, we ensure that each dataset is a superset of the previous one.

We then evaluate the robustness of each trained network to perturbations of varying magnitude (). For each choice of training set size and fixed attack

, we select the best performance achieved across all hyperparameters settings (training perturbations

and model size). On all three datasets, we observed that the best natural accuracy is usually achieved for the naturally trained network, while the best adversarial accuracy for almost all values of was achieved when training with the largest . We maximize over the hyperparameter settings since we are not interested in the performance of a specific model, but rather in the inherent generalization properties of the dataset independently of the classifier used. The results of these experiments are shown in Figure 2 for each dataset.

The plots clearly demonstrate the need for more data to achieve adversarially robust generalization. For any fixed test set accuracy, the number of samples needed is significantly higher for robust generalization. In the SVHN experiments (where we have sufficient training samples to observe plateauing behavior), the natural accuracy reaches its maximum with significantly fewer samples than the adversarial accuracy. We report more details of our experiments in Section C of the supplementary material.

### 5.3 Thresholding experiments

Motivated by our theoretical study of the Bernoulli model, we investigate whether thresholding can also improve the sample complexity of robust generalization against an adversary on a real dataset. MNIST is a natural candidate here since the images are nearly black-and-white and hence lie close to vertices of a hypercube (as in the Bernoulli model). This is further motivated by experimental evidence indicating that adversarially robust networks on MNIST learn such thresholding filters when trained adversarially [madry2017towards].

We repeat the sample complexity experiments performed in Section 5.2 with networks where thresholding filters are explicitly encoded in the model. Here, we replace the first convolutional layer with a fixed thresholding layer consisting of two channels, and , where is the input image. Results from networks trained with this thresholding layer are shown in Figure 3. For naturally trained networks, we use a value of for the thresholding filters, whereas for adversarially trained networks we set . For each data subset size and test perturbation , we plot the best test accuracy achieved over networks trained with different thresholding filters, i.e., different values of . We separately show the effect of explicit thresholding in such networks when they are trained naturally or adversarially using PGD. As predicted by our theory, the networks achieve good adversarially robust generalization with significantly fewer samples when thresholding filters are added. Further, note that adding a simple thresholding layer directly yields nearly state-of-the-art robustness against moderately strong adversaries (), without any other modifications to the model architecture or training algorithm. It is also worth noting that the thresholding filters could have been learned by the original network architecture, and that this modification only decreases the capacity of the model. Our findings emphasize network architecture as a crucial factor for learning adversarially robust networks from a limited number of samples.

We also experimented with thresholding filters on the CIFAR10 dataset, but did not observe any significant difference from the standard architecture. This agrees with our theoretical understanding that thresholding helps primarily in the case of (approximately) binary datasets.

## 6 Related Work

Due to the large body of work on adversarial robustness, we focus on related papers that also provide theoretical explanations for adversarial examples. Compared to prior work, the main difference of our approach is the focus on generalization. Most related papers study robustness either without the learning context, or in the limit as the number of samples approaches infinity. As a result, finite sample phenomena do not arise in these theoretical approaches. As we have seen in Figure 1, adversarial examples are currently a failure of generalization from a limited training set. Hence we believe that studying robust generalization is an insightful avenue for understanding adversarial examples.

• [leftmargin=0.5cm]

• wang2017adversarial study the adversarial robustness of nearest neighbor classifiers. In contrast to our work, the authors give theoretical guarantees for a specific classification algorithm. We focus on the inherent sample complexity of adversarially robust generalization independently of the learning method. Moreover, our results hold for finite sample sizes while the results in [wang2017adversarial] are only asymptotic.

• Recent work by gilmer2017spheres explores a specific distribution where robust learning is empirically difficult with overparametrized neural networks.666It is worth noting that the distribution in [gilmer2017spheres]

has only one degree of freedom. Hence we conjecture that the observed difficulty of robust learning in their setup is due to the chosen model class and not due to an information-theoretic limit as in our work.

The main phenomenon is that even a small natural error rate on their dataset translates to a large adversarial error rate. Our results give a more nuanced picture that involves the sample complexity required for generalization. In our data models, it is possible to achieve an error rate that is essentially zero by using a very small number of samples, whereas the adversarial error rate is still large unless we have seen a lot of samples.

• FMF16 relate the robustness of linear and non-linear classifiers to adversarial and (semi-)random perturbations. Their work studies the setting where the classifier is fixed and does not encompass the learning task. We focus on generalization aspects of adversarial robustness and provide upper and lower bounds on the sample complexity. Overall, we argue that adversarial examples are inherent to the statistical setup and not necessarily a consequence of a concrete classifier model.

• The work of XCM2009 establishes a connection between robust optimization and regularization for linear classification. In particular, they show that robustness to a specific perturbation set is exactly equivalent to the standard support vector machine. The authors give asymptotic consistency results under a robustness condition, but do not provide any finite sample guarantees. In contrast, our work considers specific distributional models where we can demonstrate a clear gap between robust and standard generalization.

• PMSW18 discuss adversarial robustness at the population level. They assume the existence of an adversary that can significantly increase the loss for any hypothesis in the hypothesis class. By definition, robustness against adversarial perturbations is impossible in this regime. As demonstrated in Figure 1, we instead conjecture that current classification models are not robust to adversarial examples because they fail to generalize. Hence our results concern generalization from a finite number of samples. We show that even when the hypothesis class is large enough to achieve good robust classification error, the sample complexity of robust generalization can still be significantly bigger than that of standard generalization.

• In a recent paper, FFF18 also give provable lower bounds for adversarial robustness. There are several important differences between their work and ours. At a high level, the results in [FFF18] state that there are fundamental limits for adversarial robustness that apply to any classifier. As pointed out by the authors, their bounds also apply to the human visual system. However, an important aspect of adversarial examples is that they often fool current classifiers, yet are still easy to recognize for humans. Hence we believe that the approach in [FFF18] does not capture the underlying phenomenon since it does not distinguish between the robustness of current artificial classifiers and the human visual system.

Moreover, the lower bounds in [FFF18] do not involve the training data and consequently apply in the limit where an infinite number of samples is available. In contrast, our work investigates how the amount of available training data affects adversarial robustness. As we have seen in Figure 1, adversarial robustness is currently an issue of generalization. In particular, we can train classifiers that achieve a high level of robustness on the CIFAR10 training set, but this robustness does not transfer to the test set. Therefore, our perspective based on adversarially robust generalization more accurately reflects the current challenges in training robust classifiers.

Finally, FFF18 utilize the notion of a latent space for the data distribution in order to establish lower bounds that apply to any classifier. While the existence of generative models such as GANs provides empirical evidence for this assumption, we note that it does not suffice to accurately describe the robustness phenomenon. For instance, there are multiple generative models that produce high-quality samples for the MNIST dataset, yet there are now also several successful defenses against adversarial examples on MNIST. As we have shown in our work, the fine-grained properties of the data distribution can have significant impact on how hard it is to learn a robust classifier.

#### Margin-based theory.

There is a long line of work in machine learning on exploring the connection between various notions of margin and generalization, e.g., see shaishalev2014 and references therein. In this setting, the margin, i.e., how robustly classifiable the data is for -bounded classifiers, enables dimension-independent control of the sample complexity. However, the sample complexity in concrete distributional models can often be significantly smaller than what the margin implies. As we will see next, standard margin-based bounds do not suffice to demonstrate a gap between robust and benign generalization for the distributional models studied in our work.

First, we briefly remind the reader about standard margin-based results (see Theorem 15.4 in shaishalev2014 for details). For a dataset that has bounded norm and margin , the classification error of the hard-margin SVM scales as

where is the number of samples. To illustrate this bound, consider the Gaussian model in the regime where a single sample suffices to learn a classifier with low error (see Theorem 4). The standard bound on the norm of an i.i.d. Gaussian vector shows that we have a data norm bound with high probability. While the Gaussian model is not strictly separable in any regime, we can still consider the probability that a sample achieves at least a certain margin:

 Pz∼N(0,σ2I)[⟨z,θ⋆⟩∥θ⋆∥2≥ρ]≥1−δ.

A simple calculation shows that for (as in our earlier bounds), the Gaussian model does not achieve margin

even at the quantile

. Hence the margin-based bound would indicate a sample complexity of already for standard generalization, which obscures the dichotomy between standard and robust sample complexity.

#### Robust statistics.

An orthogonal line of work in robust statistics studies robustness of estimators to corruption of training data Huber81a. This notion of robustness, while also important, is not directly relevant to the questions addressed in our work.

## 7 Discussion and Future Directions

The vulnerability of neural networks to adversarial perturbations has recently been a source of much discussion and is still poorly understood. Different works have argued that this vulnerability stems from their discontinuous nature SzegedyZSBEGF13, their linear nature GoodfellowSS14, or is a result of high-dimensional geometry and independent of the model class [gilmer2017spheres]. Our work gives a more nuanced picture. We show that for a natural data distribution (the Gaussian model), the model class we train does not matter and a standard linear classifier achieves optimal robustness. However, robustness also strongly depends on properties of the underlying data distribution. For other data models (such as MNIST or the Bernoulli model), our results demonstrate that non-linearities are indispensable to learn from few samples. This dichotomy provides evidence that defenses against adversarial examples need to be tailored to the specific dataset and hence may be more complicated than a single, broad approach. Understanding the interactions between robustness, classifier model, and data distribution from the perspective of generalization is an important direction for future work.

What do our results mean for robust classification of real images? Our Gaussian lower bound implies that if an algorithm works for all (or most) settings of the unknown parameter , then achieving strong -robustness requires a sample complexity increase that is polynomial in the dimension. There are a few different ways this lower bound could be bypassed. After all, it is conceivable that the noise scale is significantly smaller for real image datasets, making robust classification easier. And even if that was not the case, a good algorithm could work for the parameters that correspond to real datasets while not working for most other parameters. To accomplish this, the algorithm would implicitly or explicitly have prior information about the correct . While some prior information is already incorporated in the model architectures (e.g., convolutional and pooling layers), the conventional wisdom is often not to bias the neural network with our priors. Our work suggests that there are trade-offs with robustness here and that adding more prior information could help to learn more robust classifiers.

The focus of our paper is on adversarial perturbations in a setting where the test distribution (before the adversary’s action) is the same as the training distribution. While this is a natural scenario from a security point of view, other setups can be more relevant in different robustness contexts. For instance, we may want a classifier that is robust to small changes between the training and test distribution. This can be formalized as the classification accuracy on unperturbed examples coming from an adversarially modified distribution. Here, the power of the adversary is limited by how much the test distribution can be modified, and the adversary is not allowed to perturb individual samples coming from the modified test distribution. Interestingly, our lower bound for the Gaussian model also applies to such worst-case distributional shifts. In particular, if the adversary is allowed to shift the mean by a vector in , our proof sketched in Section 3 transfers to the distribution shift setting. Since the lower bound relies only on a single universal perturbation, this perturbation can also be applied directly to the mean vector.

#### Future directions.

Several questions remain. We now provide a list of concrete directions for future work on robust generalization.

Stronger lower bounds.

An interesting aspect of adversarial examples is that the adversary can often fool the classifier on most inputs [SzegedyZSBEGF13, CarliniW16a]. While our results show a lower bound for classification error , it is conceivable that misclassification rates much closer to 1 are unavoidable for at least one of the two classes (or equivalently, when the adversary is allowed to pick the class label). In order to avoid degenerate cases such as achieving robustness by being the constant classifier, it would be interesting to study regimes where the classifier has high standard accuracy but does not achieve robustness yet. In such a regime, does good standard accuracy imply that the classifier is vulnerable to adversarial perturbations on almost all inputs?

Different perturbation sets.

Depending on the problem setting, different perturbation sets are relevant. Due to the large amount of empirical work on robustness, our paper has focused on such perturbations. From a security point of view, we want to defend against perturbations that are imperceptible to humans. While this is not a well-defined concept, the class of small -norm perturbations should be contained in any reasonable definition of imperceptible perturbations. However, changes in different norms [SzegedyZSBEGF13, MoosDez16, CarliniW16a], sparse perturbations [PapernotMJFCS16, CarliniW16, NK17, SVS17], or mild spatial transformations can also be imperceptible to a human [CJBWMD18]. In less adversarial settings, more constrained and lower-dimensional perturbations such as small rotations and translations may be more appropriate [ETTSM17]. Overall, understanding the sample complexity implications of different perturbation sets is an important direction for future work.

Further notions of test time robustness.

As mentioned above, less adversarial forms of robustness may be better suited to model challenges arising outside security. How much easier is it to learn a robust classifier in more benign settings? This question is naturally related to problems such as transfer learning and domain adaptation.

Our results directly apply to two concrete distributional models. While the results already show interesting phenomena and are predictive of behavior on real data, understanding the robustness properties for a broader class of distributions is an important direction for future work. Moreover, it would be useful to understand what general properties of distributions make robust generalization hard or easy.

Wider sample complexity separations.

In our work, we show a separation of between the standard and robust sample complexity for the Gaussian model. It is open whether larger gaps are possible. Note that for large adversarial perturbations, the data may no longer be robustly separable which leads to trivial gaps in sample complexity, simply because the harder robust generalization problem is impossible to solve. Hence this question is mainly interesting in the regime where a robust classifier exists in the model class of interest.

Robustness in the PAC model.

Our focus has been on robust learning for specific distributions without any limitations on the hypothesis class. A natural dual perspective is to investigate robust learning for specific hypothesis classes, as in the probably approximately correct (PAC) framework. For instance, it is well known that the sample complexity of learning a half space in dimensions is . Does this sample complexity also suffice to learn in the presence of an adversary at test time? While robustness to adversarial training noise has been studied in the PAC setting (e.g., see [KL93, KSS94, BEK99]), we are not aware of similar work on test time robustness.

## Acknowledgements

Ludwig Schmidt is supported by a Google PhD Fellowship. During this research project, Ludwig was also a research fellow at the Simons Institute for the Theory of Computing, an intern in the Google Brain team, and a visitor at UC Berkeley. Shibani Santurkar is supported by the National Science Foundation (NSF) under grants IIS-1447786, IIS-1607189, and CCF-1563880, and the Intel Corporation. Dimitris Tsipras was supported in part by the NSF grant CCF-1553428. Aleksander Mądry was supported in part by an Alfred P. Sloan Research Fellowship, a Google Research Award, and the NSF grant CCF-1553428.

## Appendix A Omitted proofs for the Gaussian model

### a.1 Upper bounds

We begin with standard results about (sub)-Gaussian concentration in Fact 12 and Lemmas 13 to 16. These results show that a class-weighted average of sufficiently many samples from the Gaussian model achieves a large inner product with the unknown mean vector. Lemma 17 then relates the inner product between a linear classifier and the mean vector to the classification accuracy. Theorem 18 uses the lemmas to establish our main theorem for standard generalization. Corollary 19 instantiates the bound for learning from one sample. After further simplification, this yields Theorem 4 from the main text.

For robust generalization, we first relate the inner product between a linear classifier and the unknown mean vector to the robust classification accuracy in Lemma 20. Similar to the standard classification error, Theorem 21 and Corollary 22 then yield our upper bounds for robust generalization. Simplifying Corollary 22 further gives Theorem 5 from the main text.

###### Fact 12.

Let be drawn from a centered spherical Gaussian, i.e., where . Then we have   .

###### Proof.

We refer the reader to Example 5.7 in [BLM2013] for a reference of this standard result. Combined with , which is obtained from Jensen’s Inequality, the aforementioned example gives the desired upper tail bound. ∎

###### Lemma 13.

Let be drawn i.i.d. from a spherical Gaussian, i.e., where and . Let be the sample mean vector . Finally, let be the target probability. Then we have

###### Proof.

Since each has the same distribution as for , we can bound the desired tail probability for

 ¯¯¯z =1nn∑i=1μ+gi =μ+1nn∑i=1gi.

Morever, the average of the has the same distribution as . Hence it suffices to bound the tail of . For any , applying the triangle inequality then gives

 P[∥¯¯¯z∥2≥∥μ∥2+c] =P[∥μ+¯¯¯g∥2≥∥μ∥2+c] ≤P[∥¯¯¯g∥2≥c].

Setting with

 t=σ√2log\sfrac1δn

and substituting into Fact 12 then gives the desired result. ∎

For convenient use in our later theorems, we instantiate Lemma 13 with the parameters most relevant for our Gaussian model. In particular, the norm of the mean vector is and we are interested in up to exponentially small failure probability (but not necessarily smaller).

###### Lemma 14.

Let be drawn i.i.d. from a spherical Gaussian with mean norm , i.e., where , , and . Let be the sample mean vector . Then we have

###### Proof.

We substitute into Lemma 13 with and . ∎

###### Lemma 15.

Let be drawn i.i.d. from a spherical Gaussian, i.e., where and . Let be the mean vector . Finally, let be the target probability. Then we have

 P⎡⎣⟨¯¯¯z,μ⟩≤∥μ∥22−σ∥μ∥2√2log\sfrac1δn⎤⎦≤δ.
###### Proof.

As in Lemma 13, we use the fact that has the same distribution as where . For any , this allows us to simplify the tail event to

 P[⟨¯¯¯z,μ⟩≤∥μ∥22−t]=P[⟨¯¯¯g,μ⟩≤−t].

The right hand side can now be simplified to where . Invoking the standard sub-Gaussian tail bound

and substituting then gives the desired result. ∎

###### Lemma 16.

Let be drawn i.i.d. from a spherical Gaussian with mean norm , i.e., where , , and . Let be the sample mean vector and let be the unit vector in the direction of , i.e., . Then we have

###### Proof.

We invoke Lemma 14, which yields

with probability . Moreover, we invoke Lemma 15 with and to get

 ⟨¯¯¯z,μ⟩≥d−d2√n

with probability . We continue under both events, which yields the desired overall failure probability .

We now have

 ⟨ˆw,μ⟩ =⟨¯¯¯z,μ⟩∥¯¯¯z∥2 =2√n−12√n+4σ√d

as stated in the lemma. ∎

###### Lemma 17.

Let be drawn from a spherical Gaussian, i.e., where and . Moreover, let be an arbitrary unit vector with where . Then we have

###### Proof.

Since has the same distribution as where , we can bound the tail event as

 P[⟨w,z⟩≤ρ] =P[⟨w,μ+g⟩≤ρ] =P[⟨w,g⟩≤ρ−⟨w,μ⟩].

The inner product is distributed as a univariate normal because the vector has unit norm. Hence we can invoke the standard sub-Gaussian tail bound to get the desired tail probability. ∎

###### Theorem 18 (Standard generalization in the Gaussian model.).

Let be drawn i.i.d. from a -Gaussian model with . Let be the unit vector in the direction of , i.e., . Then with probability at least , the linear classifier has classification error at most

###### Proof.

Let and note that each is independent and has distribution . Hence we can invoke Lemma 16 and get

 ⟨ˆw,θ⋆⟩≥2√n−12√n+4σ√d

with probability at least as stated in the theorem.

Next, unwrapping the definition of allows us to write the classification error of as

 P[fˆw(x)≠y]=P[⟨ˆw,θ⋆⟩≤0].

Invoking Lemma 17 with then gives the desired bound. ∎

###### Corollary 19 (Generalization from a single sample.).

Let be drawn from a -Gaussian model with

 σ≤d\sfrac145√log\sfrac1β.

Let be the unit vector . Then with probability at least , the linear classifier has classification error at most .

###### Proof.

Invoking Theorem 18 with gives a classification error bound of

It remains to show that .

We now bound the denominator in . First, we have

 2+4σ ≤2d\sfrac14+45d\sfrac14 ≤3d\sfrac14.

Next, we bound the entire denominator as

 2(2+4σ)2σ2 ≤2⋅9√d⋅√d25log\sfrac1β ≤dlog\sfrac1β

which yields the desired classification error when substituted back into . ∎

###### Lemma 20.

Assume a -Gaussian model. Let , be robustness parameters, and let be a unit vector such that ., where is the dual norm of . Then the linear classifier has -robust classification error at most

###### Proof.

Per Definition 3, we have to upper bound the quantity

 P(x,y)∼P[∃x′∈B(x):fˆw(x′)≠y].

For linear classifiers, we can rewrite this event as follows:

 P(x,y)∼P[∃x′∈Bεp(x):fˆw(x′)≠y] =P(x,y)∼P[∃x′∈Bεp(x):⟨y⋅x′,ˆw⟩≤0] =P(x,y)∼P[∃Δ∈Bεp(0):⟨y⋅(x+Δ),ˆw⟩≤0] =P(x,y)∼P[minΔ∈Bεp(0)⟨y⋅(x+Δ),ˆw⟩≤0] =P(x,y)∼P[⟨y⋅x,ˆw⟩+minΔ∈Bεp(0)⟨y⋅Δ,ˆw⟩≤0].

We now use the definition of the dual norm. Note that for any , we also have . Since , we can drop the factor. Overall, we get

 P(x,y)∼P[⟨y⋅x,ˆw⟩+minΔ∈Bεp(0)⟨y⋅Δ,ˆw⟩≤0] =P(x,y)∼P[⟨y⋅x,ˆw⟩−ε∥ˆw∥∗p≤0]

By assumption in the lemma, we have . Hence we can invoke Lemma 17 with and to get the desired bound on the robust classification error. ∎

###### Theorem 21.

Let be drawn i.i.d. from a -Gaussian model with . Let be the unit vector in the direction of , i.e., . Then with probability at least , the linear classifier has -robust classification error at most if

 ε≤2√n−12√n+4σ−σ√2log\sfrac1β√d.
###### Proof.

Let and note that each is independent and has distribution . Hence we can invoke Lemma 16 and get

 ⟨ˆw,θ⋆⟩≥2√n−12√n+4σ√d

with probability at least as stated in the theorem.

Since , we have . The bound on in the theorem allows us to invoke Lemma 20. This yields an -robust classification error of at most

Since

this simplifies to the robust classification error stated in the theorem. ∎

###### Corollary 22.

Let be drawn i.i.d. from a -Gaussian model with and . Let be the unit vector in the direction of