Deep neural networks constitute an extremely powerful architecture for machine learning and have achieved an enormous success in several fields such as speech recognition, computer vision and natural language processing where they can often outperform human abilitiesmnih2015human; lecun2015deep; radford2015unsupervised; schmidhuber2015deep; goodfellow2016deep. In 2014, a very surprising property of deep neural networks emerged in the context of image classification goodfellow2014explaining: an extremely small perturbation can change the label of a correctly classified image. This property is particularly concerning since it may be exploited by a malicious adversary to fool a machine learning algorithm by steering its output. For this reason, methods to find perturbed inputs or adversarial examples have been named adversarial attacks. This problem further captured the attention of the deep learning community when it was discovered that real-world images taken with a camera can also constitute adversarial examples kurakin2018adversarial; sharif2016accessorize; brown2017adversarial; evtimov2017robust. To study adversarial attacks, two lines of research have been developed: one aims to develop efficient algorithms to find adversarial examples su2019one; athalye2018synthesizing; liu2016delving, and the other aims to make deep neural networks more robust against adversarial attacks madry2018towards; tsipras2018robustness; nakkiran2019adversarial; lecuyer2019certified; gilmer2019adversarial; algorithms to compute the robustness of a given trained deep neural network against adversarial attacks have also been developed li2019certified; jordan2019provable.
Several theories have been proposed to explain the phenomenon of adversarial examples raghunathan2018certified; wong2018provable; xiao2018training; cohen2019certified; schmidt2018adversarially; tanay2016boundary; kim2019bridging; fawzi2016robustness; shamir2019simple; bubeck2019adversarial; ilyas2019adversarial. One of the most prominent theories states that adversarial examples are an unavoidable feature of the high-dimensional geometry of the input space: Refs. gilmer2018adversarial; fawzi2018adversarial; shafahi2019adversarial; mahloujifar2019curse show that, if the classification error is finite, the classification of a correctly classified input can be changed with an adversarial perturbation of relative size , where is the dimension of the input space.
In this paper, we study the properties of adversarial examples for wide, deep neural networks with random weights and biases. Our main result, presented in section 3, is a probabilistic robustness guarantee on the distance (later extended to all the distances)111the norm of a vector is of a given input from the closest classification boundary. We prove that the distance from the closest classification boundary of any given whose entries are
is with high probability at least(the tilde means that logarithmic factors are hidden), i.e., the distance from of any adversarial example is larger than . Since , our result implies that the relative size of any adversarial perturbation is at least . This lower bound to the size of adversarial perturbations matches the upper bound imposed by the high-dimensional geometry proven in Refs. gilmer2018adversarial; fawzi2018adversarial; shafahi2019adversarial; mahloujifar2019curse. Therefore, our result proves that
is the universal scaling of the minimum size of adversarial perturbations. We also prove that, for any given unit vector, with high probability all the inputs with have the same classification as . Since , a remarkable consequence of this result is that a finite fraction of the distance to the origin can be traveled without encountering any classification boundary.
Our results encompass a wide variety of network architectures, namely any combination of convolutional or fully connected layers with nonlinear activation, skipped connections and pooling (see section 2).
Our proof builds on the equivalence between deep neural networks with random weights and biases and Gaussian processes lee2018deep; yang2019wide
. We prove that the same probabilistic robustness guarantees for the adversarial distance also apply to a broad class of Gaussian processes when the variance is lower bounded by the Euclidean square norm of the input and the feature map of the kernel associated to the Gaussian process is Lipshitz, a result that can be of independent interest.
In section 4 we experimentally validate our theoretically predicted scaling of the adversarial distance for random deep neural networks. In subsection 4.1 we experimentally study the adversarial distance for deep neural networks trained on the MNIST and CIFAR10 datasets. In both cases, the training does not change the order of magnitude of the adversarial distance. While for MNIST the adversarial distances for random and trained networks are very close, in the case of CIFAR10 the training decreases the adversarial distance by roughly half order of magnitude. As better discussed in subsection 4.1, this can be ascribed to the different nature of the CIFAR10 with respect to the MNIST data.
The results of our experiments make us conjecture that the proof of our adversarial robustness guarantees can be extended to trained deep neural networks. This extension will open the way to the first thorough theoretical study of the relationship between the network architecture and its robustness to adversarial attacks, thus leading to understand which changes in the architecture lead to the best improvements.
1.1 Related works
The equivalence between neural networks with random weights and Gaussian processes in the limit of infinite width has been known for a long time in the case of fully connected neural networks with one hidden layer neal1996priors; williams1997computing, but it has only recently been extended to multi-layer schoenholz2016deep; pennington2018emergence; lee2018deep; matthews2018gaussian; poole2016exponential; schoenholz2016deep and convolutional deep neural networks garriga-alonso2018deep; xiao2018dynamical; novak2019bayesian. The equivalence is now proved for practically all the existing neural networks architectures yang2019wide, and has been extended to trained deep neural networks jacot2018neural; lee2019wide; yang2019scaling; arora2019exact; huang2019dynamics; li2019enhanced; wei2019regularization; cao2019generalization including adversarial training gao2019convergence.
Robustness guarantees for Bayesian inference with Gaussian processes were first considered incardelli2019robustness. The smoothness of the feature map of a kernel plays a key role in machine learning applications mallat2012group; oyallon2015deep; bruna2013invariant; bietti2019group and kernels associated to deep neural networks have been studied from this point of view bietti2019inductive.
In the setup of binary classification of bit strings, the Hamming distance of a given input from the closest classification boundary has been theoretically studied in de2019random, where the scaling has been found.
Our inputs are -dimensional images considered as elements of , where is the number of the input channels (e.g., for Red-Green-Blue images) and is the set of the input pixels, assumed for simplicity to be periodic. recovers standard 2D images. For the sake of a simpler notation, we will sometimes consider the input space as , with .
Our architecture allows for any combination of convolutional layers, fully connected layers, skipped connections and pooling. For the sake of a simpler notation, we treat each of the above operations as a layer, even if it does not include any nonlinear activation. For simplicity, we assume that the nonlinear activation function is the ReLU. Our results can be easily extended to other activation functions.
For any and any input , let be the number of channels and the set of pixels of the output of the -th layer . The layer transformations have the following mathematical expression:
Input layer: We have and
where is the convolutional patch of the first layer. We assume for simplicity that .
Nonlinear layer: If the ()-th layer is a nonlinear layer, we have and
where is the activation function and is the convolutional patch of the layer. We assume for simplicity that . Fully connected layers are recovered by .
Skipped connection: If the ()-th layer is a skipped connection, we have , and
where is such that the sum in (3) is well defined, i.e., and . For the sake of a simple proof, we assume that the -th layer is either a convolutional or a fully connected layer.
Pooling: If the ()-th layer is a pooling layer, we have , and is a partition of , i.e., the elements of are disjoint subsets of whose union is equal to . We assume by simplicity that the -th layer is a convolutional layer and that all the elements of have the same cardinality, which is therefore equal to . We have
Flattening layer: Let the ()-th layer be the flattening layer. We notice that we include a fully connected layer directly after the flattening as part of this layer. We have and
Output layer: The final output of the network is , and the output label is . We introduce the other components of for the sake of a simpler notation in the proof of Theorem 2.
Our random deep neural networks draw all the weights and the biases
from independent Gaussian probability distributions with zero mean and variancesand , respectively. We stress that the variances are allowed to depend on the layer.
3 Theoretical results
A recent series of works schoenholz2016deep; pennington2018emergence; lee2018deep; matthews2018gaussian; poole2016exponential; garriga-alonso2018deep; xiao2018dynamical; novak2019bayesian; yang2019wide has proved that in the limit the random deep neural networks defined in section 2 are centered Gaussian processes, i.e., for any and any set of inputs , the joint probability distribution of the corresponding outputs is Gaussian with zero mean and covariance given by a kernel that depends on the architecture of the deep neural network. Therefore, studying adversarial perturbations for random deep neural networks is equivalent to studying adversarial perturbations for the corresponding Gaussian processes. We will first prove in Theorem 1 our adversarial robustness guarantee for a broad class of Gaussian processes. Then, we will prove in Theorem 2 that the Gaussian processes corresponding to random deep neural networks fall in the broad class. Thanks to the equivalence, this will prove that the guarantee applies to random deep neural networks.
We recall that to any kernel on we can associate a Reproducing Kernel Hilbert Space (RKHS) with scalar product and norm denoted by and , respectively, and a feature map such that for any rasmussen2006gaussian
A key role will be played by the RKHS distance
We can now state our main result, which we prove in Appendix A.
Theorem 1 ( adversarial robustness guarantee for Gaussian processes).
Let be a Gaussian process on with zero mean and covariance , and let be the associated RKHS distance. Let be such that for any
Let , and for any let be the ball with center and radius . Then, for any and any
we have .
Moreover, let be a unit vector in , and for any let be the segment starting in , parallel to and with length . Then, for any
we have .
Recalling that our classifier is , we have for some in iff is crossed by a classification boundary, i.e., iff there exists such that .
), which provides an upper bound to the expectation value of the maximum of a Gaussian process over a given region, and on an estimate of the covering number of theunit ball (Theorem 6). Despite employing the best state-of-the-art tools, the prefactors of both these results are not sharp ledoux2013probability; price2016sublinear.
Theorem 2 (smoothness of the DNN Gaussian processes).
Corollary 1 ( adversarial robustness guarantee for random deep neural networks).
Let be a random deep neural network as in section 2. Let , and for any let . Then, in the limit , for any and any
we have .
Moreover, let be a unit vector in , and for any let . Then, for any
we have .
Remark 3 (asymptotic scaling).
Remark 4 ( adversarial robustness guarantees).
For any , the norm of a vector is . For any , let be the ball with center and radius . Since for any , we trivially have from Remark 3 that in the limit , for
In particular, the and distances from the closest classification boundary scale at least as and , respectively.
To summarize, we have proven that the distance of any given input from the closest classification boundary is with high probability at least , where is the dimension of the input. This result applies both to deep neural networks with almost any architecture and random weights and biases, and to smooth Gaussian processes.
To experimentally validate Corollary 1 and Remark 3, we performed adversarial attacks on random inputs for various network architectures with randomly chosen weights. Experimental findings were consistent across a variety of networks as shown in Appendix E, but for sake of brevity, we only provide figures and results for a simplified residual network in this section. Figure 1 plots the median distance of adversarial examples for a residual network similar to the first proposed residual network he2016deep. This network contains three residual blocks and does not contain a global average pooling layer before the final output (its complete architecture is given in subsection D.2
). Attacks were performed on 2-dimensional images with three channels and pixel values chosen randomly from the standard uniform distribution.
Results from Figure 1 plotting median adversarial distances as a function of the input dimension are consistent with the expected theoretical scaling in Remark 3. Namely, adversarial distances in the , , and norms scale with the dimension of the input proportionally to , a constant (not dependent on ), and respectively (up to logarithmic factors). Adversarial distances relative to the average starting norm of an input are plotted in Figure 2. This adjusted metric named relative distance provides a convenient means of understanding the scaling of adversarial distances, since relative adversarial distances scale proportionally to in all norms.
4.1 Adversarial Attacks on Trained Neural Networks
. In this section, we extend our experimental analysis to networks trained on MNIST and CIFAR10 data. We trained networks with the same residual network architecture given in the prior section on MNIST and CIFAR10 data under the task of binary classification. Networks were trained for 15 and 25 epochs for the MNIST and CIFAR10 datasets respectively achieving greater than 98% training set accuracy in all cases. Refer tosubsection D.3 for full details on the training of the networks.
Properties of trained neural networks, especially as they relate to adversarial robustness and generalization, are dependent on the properties of the data used to train them. For example, since neural networks can be trained to “memorize” data choromanska2015loss, Corollary 1 can be forced to fail if the network is trained on a dataset which contains very close inputs with different labels. From Figure 3, the networks trained on CIFAR10 data show a smaller adversarial distance with respect to random networks both on random images and on images taken from the training or test set. In the case of MNIST, training decreases the adversarial distance for random images, but does not significantly change it for training or test images. One possible explanation for this discrepancy is the conspicuous geometric and visual structure inherent in the MNIST dataset relative to CIFAR10. Digits in MNIST all have the same uniform black background and geometry and roughly fill the whole image, while in CIFAR10 the background and the relative size of the relevant part of the image can vary significantly, and pictures are taken from various different angles (e.g., different orientations of a dog or car). Thus, when trained on MNIST, networks can more easily embed training and test points within areas far from classification boundaries. More generally, networks trained on MNIST data achieve low generalization error and increased adversarial distances are correlated with those lower errors.
From Corollary 1, we expect the portion of images that have at least one adversarial example within a given distance to increase linearly with the distance. This finding is validated by results shown in Figure 4 which plots the adversarial distance by percentile (sorted smallest to largest distance). In the case of random images, the linear increase in adversarial distance by percentile is evident throughout most of the percentiles in the chart conforming closely to the linear fit (dotted line). Interestingly, this linear correlation is even observed in images in the training and test sets outside of the smallest and highest percentiles. For training and test set images, networks usually predicted labels with high confidence thus limiting the percentage of images falling at small distances from a classification boundary.
We have studied the properties of adversarial examples for deep neural networks with random weights and biases and have proved that the distance from the closest classification boundary of any given input is with high probability at least , where is the dimension of the input (Corollary 1). Since this lower bound matches the upper bound of gilmer2018adversarial; fawzi2018adversarial; shafahi2019adversarial; mahloujifar2019curse, our result determines the universal scaling of the minimum size of adversarial perturbations. We have validated our theoretical results with experiments on both random deep neural networks and deep neural networks trained on the MNIST and CIFAR10 datasets. The experiments on random networks are completely in agreement with our theoretical predictions. Networks trained on MNIST and CIFAR10 data are mostly consistent with our main findings; therefore, we conjecture that the proof of our adversarial robustness guarantee can be extended to trained deep neural networks. Such extension will be the focus of our future research, which could e.g. exploit the equivalence between trained deep neural networks and Gaussian processes jacot2018neural; lee2019wide; yang2019scaling; arora2019exact; huang2019dynamics; li2019enhanced; wei2019regularization; cao2019generalization. This result will open the way to a more thorough theoretical study of the relation between network architecture and adversarial phenomena, leading to understand which changes in the architecture achieve the best improvements for the adversarial robustness. Moreover, our methods can be employed to study the robustness of deep neural networks with respect to adversarial perturbations that keep the data manifold invariant, such as smooth deformations of the input image mallat2012group; oyallon2015deep; bruna2013invariant; bietti2019group; bietti2019inductive.
Appendix A Proof of Theorem 1
a.1 adversarial distance
and for any let
Conditioning on , becomes the Gaussian process with average
We put for any
such that is a centered Gaussian process with covariance . Let
and let us assume that . The following theorem provides an upper bound to :
Theorem 3 (Borell–TIS inequality adler2009random).
Let be a centered Gaussian process on , and let be the associated kernel. Then, for any
We have from Theorem 3
is a centered Gaussian random variable with variance, we have
We get an upper bound on from the following theorem.
Theorem 4 (Dudley’s theorem bartlett2013theoretical).
Let be a centered Gaussian process on , and let be the RKHS distance of the associated kernel. For any , let be the minimum number of balls of with radius that can cover . Then,
We directly get from Theorem 4
where is the minimum number of balls of with radius that can cover . Let be the minimum number of balls of the Euclidean distance with radius that can cover the unit ball. From Lemma 1 and (8) we get that
for any , therefore
and (28) implies
We get from Theorem 6
and the claim follows from (8).
a.2 Random direction
Appendix B Proof of Theorem 2
The proof of Theorem 2 is based on the following theorem, which formalizes the equivalence between deep neural networks with random weights and biases and Gaussian processes.
Theorem 5 (Master Theorem yang2019wide).
Let be the outputs of the layers of the random deep neural network defined in section 2. Let be the kernels on where is recursively defined as
i.e., as the covariance of , where the expectation is computed assuming that are independent centered Gaussian processes with covariances .
Given , let be such that there exist such that for any ,
Then, in the limit , we have for any
with the covariance matrix given by
For finite width, the outputs of the intermediate layers of the random deep neural networks have a sub-Weibull distribution vladimirova2019understanding.
The main consequence of Theorem 5 is that the final output is a centered Gaussian process:
The final output of the deep neural network is a centered Gaussian process with covariance .
It is convenient to define for any
be the RKHS distance associated with .
b.1 Input layer
b.2 Nonlinear layer
Let the ()-th layer be a nonlinear layer. From Theorem 5, we can assume that is the centered Gaussian process with covariance . We then have