DeepAI
Log In Sign Up

Adversarial robustness guarantees for random deep neural networks

04/13/2020
by   Giacomo De Palma, et al.
MIT
0

The reliability of most deep learning algorithms is fundamentally challenged by the existence of adversarial examples, which are incorrectly classified inputs that are extremely close to a correctly classified input. We study adversarial examples for deep neural networks with random weights and biases and prove that the ℓ^1 distance of any given input from the classification boundary scales at least as √(n), where n is the dimension of the input. We also extend our proof to cover all the ℓ^p norms. Our results constitute a fundamental advance in the study of adversarial examples, and encompass a wide variety of architectures, which include any combination of convolutional or fully connected layers with skipped connections and pooling. We validate our results with experiments on both random deep neural networks and deep neural networks trained on the MNIST and CIFAR10 datasets. Given the results of our experiments on MNIST and CIFAR10, we conjecture that the proof of our adversarial robustness guarantee can be extended to trained deep neural networks. This extension will open the way to a thorough theoretical study of neural network robustness by classifying the relation between network architecture and adversarial distance.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/09/2020

Input Validation for Neural Networks via Runtime Local Robustness Verification

Local robustness verification can verify that a neural network is robust...
12/04/2020

Towards Natural Robustness Against Adversarial Examples

Recent studies have shown that deep neural networks are vulnerable to ad...
09/08/2017

Towards Proving the Adversarial Robustness of Deep Neural Networks

Autonomous vehicles are highly complex systems, required to function rel...
10/29/2021

ε-weakened Robustness of Deep Neural Networks

This paper introduces a notation of ε-weakened robustness for analyzing ...
12/25/2018

Deep neural networks are biased towards simple functions

We prove that the binary classifiers of bit strings generated by random ...
08/21/2018

zoNNscan : a boundary-entropy index for zone inspection of neural models

The training of deep neural network classifiers results in decision boun...
05/26/2019

All Neural Networks are Created Equal

One of the unresolved questions in the context of deep learning is the t...

1 Introduction

Deep neural networks constitute an extremely powerful architecture for machine learning and have achieved an enormous success in several fields such as speech recognition, computer vision and natural language processing where they can often outperform human abilities

mnih2015human; lecun2015deep; radford2015unsupervised; schmidhuber2015deep; goodfellow2016deep. In 2014, a very surprising property of deep neural networks emerged in the context of image classification goodfellow2014explaining: an extremely small perturbation can change the label of a correctly classified image. This property is particularly concerning since it may be exploited by a malicious adversary to fool a machine learning algorithm by steering its output. For this reason, methods to find perturbed inputs or adversarial examples have been named adversarial attacks. This problem further captured the attention of the deep learning community when it was discovered that real-world images taken with a camera can also constitute adversarial examples kurakin2018adversarial; sharif2016accessorize; brown2017adversarial; evtimov2017robust. To study adversarial attacks, two lines of research have been developed: one aims to develop efficient algorithms to find adversarial examples su2019one; athalye2018synthesizing; liu2016delving, and the other aims to make deep neural networks more robust against adversarial attacks madry2018towards; tsipras2018robustness; nakkiran2019adversarial; lecuyer2019certified; gilmer2019adversarial; algorithms to compute the robustness of a given trained deep neural network against adversarial attacks have also been developed li2019certified; jordan2019provable.

Several theories have been proposed to explain the phenomenon of adversarial examples raghunathan2018certified; wong2018provable; xiao2018training; cohen2019certified; schmidt2018adversarially; tanay2016boundary; kim2019bridging; fawzi2016robustness; shamir2019simple; bubeck2019adversarial; ilyas2019adversarial. One of the most prominent theories states that adversarial examples are an unavoidable feature of the high-dimensional geometry of the input space: Refs. gilmer2018adversarial; fawzi2018adversarial; shafahi2019adversarial; mahloujifar2019curse show that, if the classification error is finite, the classification of a correctly classified input can be changed with an adversarial perturbation of relative size , where is the dimension of the input space.

In this paper, we study the properties of adversarial examples for wide, deep neural networks with random weights and biases. Our main result, presented in section 3, is a probabilistic robustness guarantee on the distance (later extended to all the distances)111the norm of a vector is of a given input from the closest classification boundary. We prove that the distance from the closest classification boundary of any given whose entries are

is with high probability at least

(the tilde means that logarithmic factors are hidden), i.e., the distance from of any adversarial example is larger than . Since , our result implies that the relative size of any adversarial perturbation is at least . This lower bound to the size of adversarial perturbations matches the upper bound imposed by the high-dimensional geometry proven in Refs. gilmer2018adversarial; fawzi2018adversarial; shafahi2019adversarial; mahloujifar2019curse. Therefore, our result proves that

is the universal scaling of the minimum size of adversarial perturbations. We also prove that, for any given unit vector

, with high probability all the inputs with have the same classification as . Since , a remarkable consequence of this result is that a finite fraction of the distance to the origin can be traveled without encountering any classification boundary.

Our results encompass a wide variety of network architectures, namely any combination of convolutional or fully connected layers with nonlinear activation, skipped connections and pooling (see section 2).

Our proof builds on the equivalence between deep neural networks with random weights and biases and Gaussian processes lee2018deep; yang2019wide

. We prove that the same probabilistic robustness guarantees for the adversarial distance also apply to a broad class of Gaussian processes when the variance is lower bounded by the Euclidean square norm of the input and the feature map of the kernel associated to the Gaussian process is Lipshitz, a result that can be of independent interest.

In section 4 we experimentally validate our theoretically predicted scaling of the adversarial distance for random deep neural networks. In subsection 4.1 we experimentally study the adversarial distance for deep neural networks trained on the MNIST and CIFAR10 datasets. In both cases, the training does not change the order of magnitude of the adversarial distance. While for MNIST the adversarial distances for random and trained networks are very close, in the case of CIFAR10 the training decreases the adversarial distance by roughly half order of magnitude. As better discussed in subsection 4.1, this can be ascribed to the different nature of the CIFAR10 with respect to the MNIST data.

The results of our experiments make us conjecture that the proof of our adversarial robustness guarantees can be extended to trained deep neural networks. This extension will open the way to the first thorough theoretical study of the relationship between the network architecture and its robustness to adversarial attacks, thus leading to understand which changes in the architecture lead to the best improvements.

1.1 Related works

The equivalence between neural networks with random weights and Gaussian processes in the limit of infinite width has been known for a long time in the case of fully connected neural networks with one hidden layer neal1996priors; williams1997computing, but it has only recently been extended to multi-layer schoenholz2016deep; pennington2018emergence; lee2018deep; matthews2018gaussian; poole2016exponential; schoenholz2016deep and convolutional deep neural networks garriga-alonso2018deep; xiao2018dynamical; novak2019bayesian. The equivalence is now proved for practically all the existing neural networks architectures yang2019wide, and has been extended to trained deep neural networks jacot2018neural; lee2019wide; yang2019scaling; arora2019exact; huang2019dynamics; li2019enhanced; wei2019regularization; cao2019generalization including adversarial training gao2019convergence.

Robustness guarantees for Bayesian inference with Gaussian processes were first considered in

cardelli2019robustness. The smoothness of the feature map of a kernel plays a key role in machine learning applications mallat2012group; oyallon2015deep; bruna2013invariant; bietti2019group and kernels associated to deep neural networks have been studied from this point of view bietti2019inductive.

In the setup of binary classification of bit strings, the Hamming distance of a given input from the closest classification boundary has been theoretically studied in de2019random, where the scaling has been found.

2 Setup

Our inputs are -dimensional images considered as elements of , where is the number of the input channels (e.g., for Red-Green-Blue images) and is the set of the input pixels, assumed for simplicity to be periodic. recovers standard 2D images. For the sake of a simpler notation, we will sometimes consider the input space as , with .

Our architecture allows for any combination of convolutional layers, fully connected layers, skipped connections and pooling. For the sake of a simpler notation, we treat each of the above operations as a layer, even if it does not include any nonlinear activation. For simplicity, we assume that the nonlinear activation function is the ReLU

. Our results can be easily extended to other activation functions.

For any and any input , let be the number of channels and the set of pixels of the output of the -th layer . The layer transformations have the following mathematical expression:

  • Input layer: We have and

    (1)

    where is the convolutional patch of the first layer. We assume for simplicity that .

  • Nonlinear layer: If the ()-th layer is a nonlinear layer, we have and

    (2)

    where is the activation function and is the convolutional patch of the layer. We assume for simplicity that . Fully connected layers are recovered by .

  • Skipped connection: If the ()-th layer is a skipped connection, we have , and

    (3)

    where is such that the sum in (3) is well defined, i.e., and . For the sake of a simple proof, we assume that the -th layer is either a convolutional or a fully connected layer.

  • Pooling: If the ()-th layer is a pooling layer, we have , and is a partition of , i.e., the elements of are disjoint subsets of whose union is equal to . We assume by simplicity that the -th layer is a convolutional layer and that all the elements of have the same cardinality, which is therefore equal to . We have

    (4)
  • Flattening layer: Let the ()-th layer be the flattening layer. We notice that we include a fully connected layer directly after the flattening as part of this layer. We have and

    (5)
  • Output layer: The final output of the network is , and the output label is . We introduce the other components of for the sake of a simpler notation in the proof of Theorem 2.

Our random deep neural networks draw all the weights and the biases

from independent Gaussian probability distributions with zero mean and variances

and , respectively. We stress that the variances are allowed to depend on the layer.

3 Theoretical results

A recent series of works schoenholz2016deep; pennington2018emergence; lee2018deep; matthews2018gaussian; poole2016exponential; garriga-alonso2018deep; xiao2018dynamical; novak2019bayesian; yang2019wide has proved that in the limit the random deep neural networks defined in section 2 are centered Gaussian processes, i.e., for any and any set of inputs , the joint probability distribution of the corresponding outputs is Gaussian with zero mean and covariance given by a kernel that depends on the architecture of the deep neural network. Therefore, studying adversarial perturbations for random deep neural networks is equivalent to studying adversarial perturbations for the corresponding Gaussian processes. We will first prove in Theorem 1 our adversarial robustness guarantee for a broad class of Gaussian processes. Then, we will prove in Theorem 2 that the Gaussian processes corresponding to random deep neural networks fall in the broad class. Thanks to the equivalence, this will prove that the guarantee applies to random deep neural networks.

We recall that to any kernel on we can associate a Reproducing Kernel Hilbert Space (RKHS) with scalar product and norm denoted by and , respectively, and a feature map such that for any rasmussen2006gaussian

(6)

A key role will be played by the RKHS distance

(7)

We can now state our main result, which we prove in Appendix A.

Theorem 1 ( adversarial robustness guarantee for Gaussian processes).

Let be a Gaussian process on with zero mean and covariance , and let be the associated RKHS distance. Let be such that for any

(8)

Let , and for any let be the ball with center and radius . Then, for any and any

(9)

we have .

Moreover, let be a unit vector in , and for any let be the segment starting in , parallel to and with length . Then, for any

(10)

we have .

Remark 1.

Recalling that our classifier is , we have for some in iff is crossed by a classification boundary, i.e., iff there exists such that .

Remark 2.

We do not have any reason to believe that the prefactor in (9) is sharp. Indeed, the proof of Theorem 1 relies on Dudley’s theorem (Theorem 4

), which provides an upper bound to the expectation value of the maximum of a Gaussian process over a given region, and on an estimate of the covering number of the

unit ball (Theorem 6). Despite employing the best state-of-the-art tools, the prefactors of both these results are not sharp ledoux2013probability; price2016sublinear.

The following theorem, which we prove in Appendix B, states that the kernels of the Gaussian processes that correspond to random deep neural networks satisfy the hypotheses of Theorem 1.

Theorem 2 (smoothness of the DNN Gaussian processes).

Let be the kernel associated to the output of a random deep neural network as in section 2. Then, satisfies (8) with

(11)

where we recall from section 2 that and are sets of the pixels of the input and of the layer immediately before the flattening, respectively.

Corollary 1 ( adversarial robustness guarantee for random deep neural networks).

Let be a random deep neural network as in section 2. Let , and for any let . Then, in the limit , for any and any

(12)

we have .

Moreover, let be a unit vector in , and for any let . Then, for any

(13)

we have .

Remark 3 (asymptotic scaling).

We stress that Theorem 1 and Corollary 1 hold for any choice of , and . In the limit , if all the entries of are we have , and therefore both (9) and (10) become, up to logarithmic factors,

(14)

Analogously, in the limit both (12) and (13) become

(15)
Remark 4 ( adversarial robustness guarantees).

For any , the norm of a vector is . For any , let be the ball with center and radius . Since for any , we trivially have from Remark 3 that in the limit , for

(16)

In particular, the and distances from the closest classification boundary scale at least as and , respectively.

To summarize, we have proven that the distance of any given input from the closest classification boundary is with high probability at least , where is the dimension of the input. This result applies both to deep neural networks with almost any architecture and random weights and biases, and to smooth Gaussian processes.

4 Experiments

To experimentally validate Corollary 1 and Remark 3, we performed adversarial attacks on random inputs for various network architectures with randomly chosen weights. Experimental findings were consistent across a variety of networks as shown in Appendix E, but for sake of brevity, we only provide figures and results for a simplified residual network in this section. Figure 1 plots the median distance of adversarial examples for a residual network similar to the first proposed residual network he2016deep. This network contains three residual blocks and does not contain a global average pooling layer before the final output (its complete architecture is given in subsection D.2

). Attacks were performed on 2-dimensional images with three channels and pixel values chosen randomly from the standard uniform distribution.

Figure 1: Random untrained networks: Median distance of closest adversarial examples from their respective inputs () scale as predicted in Remark 3 for a residual network (see subsection D.2 for full description of network). Error bars span from the 45th percentile to the 55th percentile of adversarial distances. For each input dimension, results are calculated from 2000 samples (200 random networks each attacked at 10 random points). See Appendix D for further details on how experiments were performed.

Results from Figure 1 plotting median adversarial distances as a function of the input dimension are consistent with the expected theoretical scaling in Remark 3. Namely, adversarial distances in the , , and norms scale with the dimension of the input proportionally to , a constant (not dependent on ), and respectively (up to logarithmic factors). Adversarial distances relative to the average starting norm of an input are plotted in Figure 2. This adjusted metric named relative distance provides a convenient means of understanding the scaling of adversarial distances, since relative adversarial distances scale proportionally to in all norms.

Figure 2: Random untrained networks: Median relative distance of closest adversarial examples from their respective inputs () scale with the input dimension as in all norms for a residual network with random weights (see subsection D.2 for full description of network). Results plotted here are for residual networks with random weights. Error bars span from the 45th percentile to the 55th percentile of adversarial distances. For each input dimension, results are calculated from 2000 samples (200 random networks each attacked at 10 random points).

4.1 Adversarial Attacks on Trained Neural Networks

Results from section 4 indicate that adversarial attacks on networks with randomly chosen weights empirically conform with our main findings presented in section 3

. In this section, we extend our experimental analysis to networks trained on MNIST and CIFAR10 data. We trained networks with the same residual network architecture given in the prior section on MNIST and CIFAR10 data under the task of binary classification. Networks were trained for 15 and 25 epochs for the MNIST and CIFAR10 datasets respectively achieving greater than 98% training set accuracy in all cases. Refer to

subsection D.3 for full details on the training of the networks.

Properties of trained neural networks, especially as they relate to adversarial robustness and generalization, are dependent on the properties of the data used to train them. For example, since neural networks can be trained to “memorize” data choromanska2015loss, Corollary 1 can be forced to fail if the network is trained on a dataset which contains very close inputs with different labels. From Figure 3, the networks trained on CIFAR10 data show a smaller adversarial distance with respect to random networks both on random images and on images taken from the training or test set. In the case of MNIST, training decreases the adversarial distance for random images, but does not significantly change it for training or test images. One possible explanation for this discrepancy is the conspicuous geometric and visual structure inherent in the MNIST dataset relative to CIFAR10. Digits in MNIST all have the same uniform black background and geometry and roughly fill the whole image, while in CIFAR10 the background and the relative size of the relevant part of the image can vary significantly, and pictures are taken from various different angles (e.g., different orientations of a dog or car). Thus, when trained on MNIST, networks can more easily embed training and test points within areas far from classification boundaries. More generally, networks trained on MNIST data achieve low generalization error and increased adversarial distances are correlated with those lower errors.

Figure 3: Random vs trained networks: Median distance of adversarial examples (in norm) for random neural networks and neural networks of same architecture trained on MNIST and CIFAR10 data Figure 1. Analysis is performed for random images (images with randomly chosen pixel values) and images in the training and test sets. Network architecture is a simplified residual network (see subsection D.2).

From Corollary 1, we expect the portion of images that have at least one adversarial example within a given distance to increase linearly with the distance. This finding is validated by results shown in Figure 4 which plots the adversarial distance by percentile (sorted smallest to largest distance). In the case of random images, the linear increase in adversarial distance by percentile is evident throughout most of the percentiles in the chart conforming closely to the linear fit (dotted line). Interestingly, this linear correlation is even observed in images in the training and test sets outside of the smallest and highest percentiles. For training and test set images, networks usually predicted labels with high confidence thus limiting the percentage of images falling at small distances from a classification boundary.

Figure 4: Trained networks: Adversarial distance by percentile for random images (images with randomly chosen pixel values) and images in the training and test sets. The expected linear relationship between distance and percentile is observed for random images apart from the highest percentiles as is evident from the linear fit over percentiles ranging from 0 to 0.25 shown as dotted line. Adversarial attacks are performed on the norm. Network architecture is a simplified residual network (see subsection D.2).

5 Discussion

We have studied the properties of adversarial examples for deep neural networks with random weights and biases and have proved that the distance from the closest classification boundary of any given input is with high probability at least , where is the dimension of the input (Corollary 1). Since this lower bound matches the upper bound of gilmer2018adversarial; fawzi2018adversarial; shafahi2019adversarial; mahloujifar2019curse, our result determines the universal scaling of the minimum size of adversarial perturbations. We have validated our theoretical results with experiments on both random deep neural networks and deep neural networks trained on the MNIST and CIFAR10 datasets. The experiments on random networks are completely in agreement with our theoretical predictions. Networks trained on MNIST and CIFAR10 data are mostly consistent with our main findings; therefore, we conjecture that the proof of our adversarial robustness guarantee can be extended to trained deep neural networks. Such extension will be the focus of our future research, which could e.g. exploit the equivalence between trained deep neural networks and Gaussian processes jacot2018neural; lee2019wide; yang2019scaling; arora2019exact; huang2019dynamics; li2019enhanced; wei2019regularization; cao2019generalization. This result will open the way to a more thorough theoretical study of the relation between network architecture and adversarial phenomena, leading to understand which changes in the architecture achieve the best improvements for the adversarial robustness. Moreover, our methods can be employed to study the robustness of deep neural networks with respect to adversarial perturbations that keep the data manifold invariant, such as smooth deformations of the input image mallat2012group; oyallon2015deep; bruna2013invariant; bietti2019group; bietti2019inductive.

Appendix A Proof of Theorem 1

a.1 adversarial distance

Let

(17)

and for any let

(18)

Conditioning on , becomes the Gaussian process with average

(19)

and covariance

(20)

We put for any

(21)

such that is a centered Gaussian process with covariance . Let

(22)

and let us assume that . The following theorem provides an upper bound to :

Theorem 3 (Borell–TIS inequality adler2009random).

Let be a centered Gaussian process on , and let be the associated kernel. Then, for any

(23)

where

(24)

We have from Theorem 3

(25)

Recalling that

is a centered Gaussian random variable with variance

, we have

(26)

We get an upper bound on from the following theorem.

Theorem 4 (Dudley’s theorem bartlett2013theoretical).

Let be a centered Gaussian process on , and let be the RKHS distance of the associated kernel. For any , let be the minimum number of balls of with radius that can cover . Then,

(27)

We directly get from Theorem 4

(28)

where is the minimum number of balls of with radius that can cover . Let be the minimum number of balls of the Euclidean distance with radius that can cover the unit ball. From Lemma 1 and (8) we get that

(29)

for any , therefore

(30)

and (28) implies

(31)

We get from Theorem 6

(32)

where in the second line we made the change of variable . Putting together (A.1), (31), (A.1), Lemma 3 and Lemma 4 we get

(33)

and the claim follows from (8).

a.2 Random direction

Let

(34)

We define for any

(35)

such that is a centered Gaussian process on with covariance and feature map

(36)

We have

(37)

therefore Theorem 7 and Lemma 5 imply

(38)

and the claim follows.

Appendix B Proof of Theorem 2

The proof of Theorem 2 is based on the following theorem, which formalizes the equivalence between deep neural networks with random weights and biases and Gaussian processes.

Theorem 5 (Master Theorem yang2019wide).

Let be the outputs of the layers of the random deep neural network defined in section 2. Let be the kernels on where is recursively defined as

(39)

i.e., as the covariance of , where the expectation is computed assuming that are independent centered Gaussian processes with covariances .

Given , let be such that there exist such that for any ,

(40)

Then, in the limit , we have for any

(41)

with the covariance matrix given by

(42)
Remark 5.

For finite width, the outputs of the intermediate layers of the random deep neural networks have a sub-Weibull distribution vladimirova2019understanding.

The main consequence of Theorem 5 is that the final output is a centered Gaussian process:

Corollary 2.

The final output of the deep neural network is a centered Gaussian process with covariance .

Proof.

Given , let be continuous and bounded. For any we have from Theorem 5 in the limit

(43)

with as in (42). Taking the expectation value on both sides of (43) we get, recalling that each has the same probability distribution as ,

(44)

and the claim follows. ∎

It is convenient to define for any

(45)

Let also

(46)

be the RKHS distance associated with .

We will prove by induction that for any there exist such that satisfies (8) with and . The following subsections will prove the inductive step for each of the types of layer defined in section 2.

b.1 Input layer

is a centered Gaussian process with covariance as in (39) with

(47)

and

(48)

therefore

(49)

and satisfies (8) with

(50)

b.2 Nonlinear layer

Let the ()-th layer be a nonlinear layer. From Theorem 5, we can assume that is the centered Gaussian process with covariance . We then have