Gradient Methods Provably Converge to Non-Robust Networks

02/09/2022
by   Gal Vardi, et al.
Weizmann Institute of Science
0

Despite a great deal of research, it is still unclear why neural networks are so susceptible to adversarial examples. In this work, we identify natural settings where depth-2 ReLU networks trained with gradient flow are provably non-robust (susceptible to small adversarial ℓ_2-perturbations), even when robust networks that classify the training dataset correctly exist. Perhaps surprisingly, we show that the well-known implicit bias towards margin maximization induces bias towards non-robust networks, by proving that every network which satisfies the KKT conditions of the max-margin problem is non-robust.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/06/2021

On Margin Maximization in Linear and ReLU Networks

The implicit bias of neural networks has been extensively studied in rec...
05/20/2020

Feature Purification: How Adversarial Training Performs Robust Deep Learning

Despite the great empirical success of adversarial training to defend de...
02/09/2021

Adversarial Perturbations Are Not So Weird: Entanglement of Robust and Non-Robust Features in Neural Network Classifiers

Neural networks trained on visual data are well-known to be vulnerable t...
10/28/2020

Most ReLU Networks Suffer from ℓ^2 Adversarial Perturbations

We consider ReLU networks with random weights, in which the dimension de...
12/06/2018

Max-Margin Adversarial (MMA) Training: Direct Input Space Margin Maximization through Adversarial Training

We propose Max-Margin Adversarial (MMA) training for directly maximizing...
11/02/2017

Provable defenses against adversarial examples via the convex outer adversarial polytope

We propose a method to learn deep ReLU-based classifiers that are provab...
06/11/2020

Directional convergence and alignment in deep learning

In this paper, we show that although the minimizers of cross-entropy and...

1 Introduction

In a seminal paper, Szegedy et al. (2013) observed that deep networks are extremely vulnerable to adversarial examples. They demonstrated that in trained neural networks very small perturbations to the input can change the predictions. This phenomenon has attracted considerable interest, and various attacks (e.g., Carlini and Wagner (2017); Papernot et al. (2017); Athalye et al. (2018); Carlini and Wagner (2018); Wu et al. (2020)) and defenses (e.g., Papernot et al. (2016); Madry et al. (2017); Wong and Kolter (2018); Croce and Hein (2020)) were developed. Despite a great deal of research, it is still unclear why neural networks are so susceptible to adversarial examples (Goodfellow et al., 2014; Fawzi et al., 2018; Shafahi et al., 2018; Schmidt et al., 2018; Khoury and Hadfield-Menell, 2018; Bubeck et al., 2019; Allen-Zhu and Li, 2020; Wang et al., 2020; Shah et al., 2020; Shamir et al., 2021; Ge et al., 2021). Specifically, it is not well-understood why gradient methods learn non-robust networks, namely, networks that are susceptible to adversarial examples, even in cases where robust classifiers exist (cf. Daniely and Shacham (2020)).

In a recent string of works, it was shown that small adversarial perturbations can be found for any fixed input in certain ReLU networks with random weights (drawn from a Gaussian distribution). Building on

Shamir et al. (2019), it was shown in Daniely and Shacham (2020) that small adversarial perturbations (measured in the Euclidean norm) can be found in random ReLU networks where each layer has vanishing width relative to the previous layer. Bubeck et al. (2021) extended this result to general two-layers ReLU networks, and Bartlett et al. (2021) extended it to a large family of ReLU networks of constant depth. These works aim to explain the abundance of adversarial examples in ReLU networks, since they imply that adversarial examples are common in random networks, and in particular in random initializations of gradient-based methods. However, trained networks are clearly not random, and properties that hold in random networks may not hold in trained networks. Hence, finding a theoretical explanation to the existence of adversarial examples in trained networks remains a major challenge.

In this work, we show that in depth- ReLU networks trained with the logistic loss or the exponential loss, gradient flow is biased towards non-robust networks, even when robust networks that classify the training dataset correctly exist. We focus on the setting where we train the network using a binary classification dataset , such that for each we have

. E.g., this assumption holds w.h.p. if the inputs are drawn i.i.d. from the uniform distribution on the sphere of radius

. On the one hand, we prove that the training dataset can be correctly classified by a (sufficiently wide) depth- ReLU network, where for each example in the dataset flipping the sign of the output requires a perturbation of size roughly (measured in the Euclidean norm). On the other hand, we prove that for depth- ReLU networks of any width, gradient flow converges to networks, such that for every example in the dataset flipping the sign of the output can be done with a perturbation of size much smaller than . Moreover, the same adversarial perturbation applies to all examples in the dataset.

For example, if we have examples and , namely, the examples are “almost orthogonal”, then we show that in the trained network there are adversarial perturbations of size for each example in the dataset. Also, if we have examples that are drawn i.i.d. from the uniform distribution on the sphere of radius , then w.h.p. there are adversarial perturbations of size for each example in the dataset. In both cases, the dataset can be correctly classified by a depth- ReLU network such that perturbations of size cannot flip the sign for any example in the dataset.

A limitation of our negative result is that it assumes an upper bound of on the size of the dataset. Hence, it does not apply directly to large datasets. Therefore, we extend our result to the case where the dataset might be arbitrarily large, but the size of the subset of examples that attain exactly the margin is bounded. Thus, instead of assuming an upper bound on the size of the training dataset, it suffices to assume an upper bound on the size of the subset of examples that attain the margin in the trained network.

The tendency of gradient flow to converge to non-robust networks even when robust networks exist can be seen as an implication of its implicit bias. While existing works mainly consider the implicit bias of neural networks in the context of generalization, we show that it is also a useful technical tool in the context of robustness. In order to prove our negative result, we utilize known properties of the implicit bias in depth- ReLU networks trained with the logistic or the exponential loss. By Lyu and Li (2019) and Ji and Telgarsky (2020), if gradient flow in homogeneous models (which include depth- ReLU networks) with such losses converges to zero loss, then it converges in direction to a KKT point of the max-margin problem in parameter space. In our proof we show that under our assumptions on the dataset every network that satisfies the KKT conditions of the max-margin problem is non-robust. This fact may seem surprising, since our geometric intuition on linear predictors suggests that maximizing the margin is equivalent to maximizing the robustness. However, once we consider more complex models, we show that robustness and margin maximization in parameter space are two properties that do not align, and can even contradict each other.

We complement our theoretical results with an empirical study. As we already mentioned, a limitation of our negative result is that it applies to the case where the size of the dataset is smaller than the input dimension. We show empirically that the same small perturbation from our negative result is also able to change the labels of almost all the examples in the dataset, even when it is much larger than the input dimension. In addition, our theoretical negative result holds regardless of the width of the network. We demonstrate it empirically, by showing that changing the width does not change the size of the perturbation that flips the labels of the examples in the dataset.

2 Preliminaries

Notations.

We use bold-face letters to denote vectors, e.g.,

. For we denote by the Euclidean norm. We denote by the indicator function, for example equals if and otherwise. We denote . For an integer we denote . For a set we denote by the uniform distribution over . We use standard asymptotic notation to hide constant factors, and to hide logarithmic factors.

Neural networks.

The ReLU activation function is defined by

. In this work we consider depth- ReLU neural networks. Formally, a depth- network of width is parameterized by where for all and , and for every input we have

We sometimes view as the vector obtained by concatenating the vectors . Thus, denotes the norm of the vector .

We denote . We say that a network is homogeneous if there exists such that for every and we have . Note that depth- ReLU networks as defined above are homogeneous (with ).

Robustness.

Given some function , We say that a neural network is -robust w.r.t. inputs if for every , , and with , we have . Thus, changing the labels of the examples cannot be done with perturbations of size . Note that we consider here perturbations.

In this work we focus on the case where the inputs are on the sphere of radius , denoted by , which generally corresponds to components of size . Then, the distance between every two inputs is at most , and therefore a perturbation of size clearly suffices for flipping the sign of the output (assuming that there is at least one input with and one input with ). Hence, the best we can hope for is -robustness. In our results we show a setting where a -robust network exists, but gradient flow converges to a network where we can flip the sign of the outputs with perturbations of size much smaller than (and hence it is not -robust).

Note that in homogeneous neural networks, for every and every , we have . Thus, the robustness of the network depends only on the direction of , and does not depend on the scaling of .

Gradient flow and implicit bias.

Let be a binary classification training dataset. Let be a neural network parameterized by

. For a loss function

the empirical loss of on the dataset is

(1)

We focus on the exponential loss and the logistic loss .

We consider gradient flow on the objective given in Eq. (1). This setting captures the behavior of gradient descent with an infinitesimally small step size. Let be the trajectory of gradient flow. Starting from an initial point , the dynamics of is given by the differential equation . Note that the ReLU function is not differentiable at . Practical implementations of gradient methods define the derivative to be some constant in . We note that the exact value of has no effect on our results.

We say that a trajectory converges in direction to if . Throughout this work we use the following theorem:

Theorem 2.1 (Paraphrased from Lyu and Li (2019); Ji and Telgarsky (2020)).

Let be a homogeneous ReLU neural network parameterized by . Consider minimizing either the exponential or the logistic loss over a binary classification dataset using gradient flow. Assume that there exists time such that , namely, for every . Then, gradient flow converges in direction to a first order stationary point (KKT point) of the following maximum margin problem in parameter space:

(2)

Moreover, and as .

Note that in ReLU networks Problem (2) is non-smooth. Hence, the KKT conditions are defined using the Clarke subdifferential, which is a generalization of the derivative for non-differentiable functions. See Appendix A for a formal definition. Theorem 2.1 characterized the implicit bias of gradient flow with the exponential and the logistic loss for homogeneous ReLU networks. Namely, even though there are many possible directions that classify the dataset correctly, gradient flow converges only to directions that are KKT points of Problem (2). We note that such KKT point is not necessarily a global/local optimum (cf. Vardi et al. (2021)). Thus, under the theorem’s assumptions, gradient flow may not converge to an optimum of Problem (2), but it is guaranteed to converge to a KKT point.

3 Robust networks exist

We first show that for datasets where the correlation between every pair of examples is not too large, there exists a robust depth-

ReLU network that labels the examples correctly. Intuitively, such a network exists since we can choose the weight vectors and biases such that each neuron is active for exactly one example

in the dataset, and hence only one neuron contributes to the gradient at . Also, the weight vectors are not too large and therefore the gradient at is sufficiently small. Formally, we have the following:

Theorem 3.1.

Let be a dataset. Let be a constant independent of and suppose that for every . Then, there exists a depth- ReLU network of width such that for every , and for every flipping the sign of the output requires a perturbation of size larger than . Thus, is -robust w.r.t. .

Proof.

Consider the network such that for every we have , and . For every we have

and for every we have

Hence, for every .

We now prove that is -robust. Let and let such that . We show that . We have

Also, for every we have

Therefore, . ∎

It is not hard to show that the condition on the inner products in the above theorem holds w.h.p. when and are drawn from the uniform distribution on the sphere of radius . Indeed, the following lemma implies that in this case the inner products can be bounded by , and can even be bounded by (See Appendix B for the proof).

Lemma 3.1.

Let be i.i.d. such that for all , where for some constant

. Then, with probability at least

we have for all . Moreover, with probability at least we have for all .

4 Gradient flow converges to non-robust networks

We now show that even though robust networks exist, gradient flow is biased towards non-robust networks. For homogeneous networks, Theorem 2.1 implies that gradient flow generally converges in direction to a KKT point of Problem (2). Moreover, as discussed previously, the robustness of the network depends only on the direction of the parameters vector. Thus, it suffices to show that every network that satisfies the KKT conditions of Problem (2) is non-robust. We prove it in the following theorem:

Theorem 4.1.

Let be a training dataset. We denote , and , and assume that for some . Furthermore, we assume that . Let be a depth- ReLU network such that is a KKT point of Problem (2). Then, there is a vector with and , such that for every we have , and for every we have .

Example 1.

Assume that (from the above theorem) is a constant independent of . Consider the following cases:

  • If and then the adversarial perturbation satisfies . Thus, in this case the data points are “almost orthogonal”, and gradient flow converges to highly non-robust solutions, since even very small perturbations can flip the signs of the outputs for all examples in the dataset.

  • If the inputs are drawn i.i.d. from then by Lemma 3.1 we have w.h.p. that , and hence for the adversarial perturbation satisfies .

Note that in the above cases the size of the adversarial perturbation is much smaller than . Also, note that by Theorem 3.1 and Lemma 3.1, there exist -robust networks that classify the dataset correctly.

Thus, under the assumptions of Theorem 4.1, gradient flow converges to non-robust networks, even when robust networks exist by Theorem 3.1. We discuss the proof ideas in Section 5. We note that Theorem 4.1 assumes that the dataset can be correctly classified by a network , which is indeed true (in fact, even by a width- network, since by assumption we have ). Moreover, we note that the assumption of the inputs coming from is mostly for technical convenience, and we believe that it can be relaxed to have all points approximately of the same norm (which would happen, e.g., if the inputs are sampled from a standard Gaussian distribution).

The result in Theorem 4.1 has several interesting properties:

  • It does not require any assumptions on the width of the neural network.

  • It does not depend on the initialization, and holds whenever gradient flow converges to zero loss. Note that if gradient flow converges to zero loss then by Theorem 2.1 it converges in direction to a KKT point of Problem (2) (regardless of the initialization of gradient flow) and hence the result holds.

  • It proves the existence of adversarial perturbations for every example in the dataset.

  • The same vector is used as an adversarial perturbation (up to sign) for all examples. It corresponds to the well-known empirical phenomenon of universal adversarial perturbations, where one can find a single perturbation that simultaneously flips the label of many inputs (cf. Moosavi-Dezfooli et al. (2017); Zhang et al. (2021)).

  • The perturbation depends only on the training dataset. Thus, for a given dataset, the same perturbation applies to all depth- networks which gradient flow might converge to. It corresponds to the well-known empirical phenomenon of transferability in adversarial examples, where one can find perturbations that simultaneously flip the labels of many different trained networks (cf. Liu et al. (2016); Akhtar and Mian (2018)).

A limitation of Theorem 4.1 is that it holds only for datasets of size . E.g., as we discussed in Example 1, if the data points are orthogonal then we need , and if they are random then we need . Moreover, the conditions of the theorem do not allow datasets that contain clusters, where the inner products between data points are large, and do not allow data points which are not of norm . In the following corollary we extend Theorem 4.1 to allow for such scenarios. Here, the technical assumptions are only on the subset of data points that attain the margin. In particular, if this subset satisfies the assumptions, then the dataset may be arbitrarily large and may contain clusters and points with different norms.

Corollary 4.1.

Let be a training dataset. Let be a depth- ReLU network such that is a KKT point of Problem (2). Let , and . Assume that for all we have . Let , and assume that for some , and that . Then, there is a vector with and , such that for every we have , and for every we have .

The proof of the corollary can be easily obtained by slightly modifying the proof of Theorem 4.1 (see Appendix D for details). Note that in Theorem 4.1 the set of size contains all the examples in the dataset, and hence for all points in the dataset there are adversarial perturbations of size , while in Corollary 4.1 the set contains only the examples that attain exactly margin , the adversarial perturbations provably exist for examples in , and their size depends on the size of .

5 Proof sketch of Theorem 4.1

In this section we discuss the main ideas in the proof of Theorem 4.1. For the formal proof see Appendix C.

5.1 A simple example

We start with a simple example to gain some intuition. Consider a dataset such that for all we have , where are the standard unit vectors in and is even. Suppose that for and for .

First, consider the robust network of width from Theorem 3.1 that correctly classifies the dataset. In the proof of Theorem 3.1 we constructed the network such that for every we have , and . Note that we have for all . In this network, each input is in the active region (i.e., the region of inputs where the ReLU is active) of exactly one neuron, and has distance of from the active regions of the other neurons. Hence, adding a perturbation smaller than to an input can affect only the contribution of one neuron to the output, and will not flip the output’s sign.

Now, we consider a network , such that for all we have , and . Thus, the weights are in the same directions as the weights of the network , and in the network the bias terms equal . It is easy to verify that for all we have . Since and for all , then the network is better than in the sense of margin maximization. However, is much less robust than . Indeed, note that in the network each input is on the boundary of the active regions of all neurons , that is, for all we have . As a result, a perturbation can affect the contribution of all neurons to the output. Let and consider adding to the perturbation . Thus, is spanned by all the inputs where , and affects the (negative) contribution of the corresponding neurons. It is not hard to show that and . Therefore, is much less robust than (which required perturbations of size ). Thus, bias towards margin maximization might have a negative effect on the robustness. Of course, this is just an example, and in the following subsection we provide a more formal overview of the proof.

5.2 Proof overview

We denote . Thus, is a network of width , where the weights in the first layer are , the bias terms are , and the weights in the second layer are . We assume that satisfies the KKT conditions of Problem (2). We denote , , and . Note that since the dataset contains both examples with label and examples with label then and are non-empty. For simplicity, we assume that for all . That is, for and for . We emphasize that we focus here on the case where in order to simplify the description of the proof idea, and in the formal proof we do not have such an assumption. We denote . Thus, by our assumption we have .

Since satisfies the KKT conditions (see Appendix A for the formal definition) of Problem (2), then there are such that for every we have

(3)

where is a subdifferential of at , i.e., if then , and otherwise is some value in (we note that in this case can be any value in and in our proof we do not have any further assumptions on it). Also we have for all , and if . Likewise, we have

(4)

In the proof we use Eq. (3) and (4) in order to show that is non-robust. We focus here on the case where and we show that . The result for can be obtained in a similar manner. We denote . The proof consists of three main components:

  1. We show that , namely, attains exactly margin .

  2. For every we have . Since for we have , it implies that when moving from to the non-negative contribution of the neurons in to the output does not increase.

  3. When moving from to the total contribution of the neurons in to the output (which is non-positive) decreases by at least .

Note that the combination of the above properties imply that as required. We now describe the main ideas for the proof of each part.

5.3 The examples in the dataset attain margin

We show that all examples in the dataset attain margin . The main idea can be described informally as follows (see Lemma C.1 for the details). Assume that there is such that . Hence, . Suppose w.l.o.g. that . Using Eq. (3) and (4) we prove that in order to achieve when , there must be some such that

Recall that by Eq. (3), for the term is the coefficient of in the expression for . Hence, corresponds to the total sum of coefficients of over all neurons in . Thus, our lower bound on implies intuitively that the total sum of coefficients of is large. We use this fact in order to show that attains margin strictly larger than , which implies in contradiction to our lower bound on .

5.4 The contribution of the neurons to the output does not increase

For and we show that for every we have . Using Eq. (3) we have

Therefore,

By our assumption on it follows easily that , and hence we conclude that .

5.5 The contribution of the neurons to the output decreases

We show that for and , when moving from to the total contribution of the neurons in to the output decreases by at least . Since for every we have then we need to show that the sum of the outputs of the neurons increases by at least .

By a similar calculation to the one given in Subsection 5.4 we obtain that for every and we have

(5)

Recall that . Hence, for every the input to neuron increases by at least when moving from to . However, if , namely, at the input to neuron is negative, then increasing the input may not affect the output of the network. Indeed, by moving from to we might increase the input to neuron but if it is still negative then the output of neuron remains .

In order to circumvent this issue we analyze the perturbation in two stages as follows. We define for some to be chosen later. Let . We prove that for every we have , namely, the input to neuron might be negative but it can be lower bounded. Hence, Eq. (5) implies that by choosing we have for all . That is, in the first stage we move from to and increase the inputs to all neurons in such that at they are least . In the second stage we move from to (using the perturbation ). Note that when we move from to , every increase in the inputs to the neurons in results in a decrease in the output of the network.

Since we need the output of the network to decrease by at least , and since for every we have , then when moving from to we need the sum of the inputs to the neurons to increase by at least . Similarly to Eq. (5), we obtain that when moving from to we increase the sum of the inputs to the neurons by at least

(6)

Then, we prove a lower bound for . We show that such a lower bound can be achieved, since if is too small then it is impossible to have margin for all examples in . This lower bound allows us to choose such that the expression in Eq. (6) is at least . Finally, it remains to analyze and show that it satisfies the required upper bound.

6 Experiments

We complement our theoretical results by an empirical study on the robustness of depth- ReLU networks trained on synthetically generated datasets. Theorem 4.1 shows that networks trained with gradient flow converge to non-robust networks, but there are still a couple of questions remaining regarding the scope and limitations of this result. First, although the theorem limits the number of samples, we show here that the result applies also in cases when there are much more training samples. Second, the theorem does not depend on the width of the trained network, and we show that even when the size of the training set is much larger than the input dimension, the width of the network does not affect the size of the minimal perturbation that changes the label of the samples.

Experimental setting.

In all of our experiments we trained a depth- fully-connected neural network with ReLU activations using SGD with a batch size of . For experiments with less than samples, this is equivalent to full batch gradient descent. We used the exponential loss, although we also tested on logistic loss and obtained similar results. Each experiment was done using

different random seeds, and we present the results in terms of the average and (when relevant) standard deviation over these runs. We used an increasing learning rate to accelerate the convergence to the KKT point (which theoretically is only reached at infinity). We began training with a learning rate of

and increased it by a factor of every iterations. We finished training after we achieved a loss smaller than

. We emphasize that since we use an exponentially tailed loss, the gradients are extremely small at late stages of training, hence to achieve such small loss we must use an increasing learning rate. We implemented our experiments using PyTorch (

Paszke et al. (2019)).

Dataset.

In all of our experiments we sampled where and is uniform on . We also tested on

sampled from a Gaussian distribution with variance

and obtained similar results. Here we only report the results on the uniform distribution.

Margin.

In our experimental results we defined the margin in the following way: We train a network over a dataset . Suppose that after training all the samples are classified correctly (this happened in all of our experiments), i.e. . We define . Finally, we say that a sample is on the margin if . In words, we consider to be a sample which is exactly on the margin, but we also allow slack for other samples to be on the margin. We must allow some slack, because in practice we cannot converge exactly to the KKT point, where all the samples on the margin have the exact same output.

6.1 Results

(a) (b)
Figure 1: The effects of the width and the number of samples on the minimal perturbation size. The x-axis corresponds to the input dimension of the samples. The y-axis corresponds to the minimal perturbation size to change the labels of all the samples on the margin. We defined the perturbation direction as in Theorem 4.1: , where represents the set of samples that are on the margin. (a) The minimal perturbation size plotted for different sample sizes. Here the sample size is , where is the input dimension and . It is clear that in all of the experiments, the minimal perturbation size is well beyond the plot of , for which the perturbation is not adversarial. (b) The minimal perturbation size for different widths of the network. Here, the number of samples is .

Minimum perturbation size.

Figure 1(a) shows that the perturbation defined in Theorem 4.1 can change the labels of all the samples on the margin, even when there are much more samples than stated in Theorem 4.1. To this end, we trained our model on samples, where and . Note that Theorem 4.1 only considers the case of for data which is uniformly distributed. The width of the network is . After training is completed, we considered perturbations in the direction of , where represents the set of samples that are on the margin. The y-axis represents the minimal such that for all we have that . In words, we plot the minimal size of the perturbation which changes the labels of all the samples on the margin. We emphasize that we used the same perturbation for all the samples.

We also plot , as a perturbation above this line can trivially change the labels of all points. Recall that by Theorem 3.1, there exists a -robust network (if the width of the network is at least the size of the dataset). From Figure 1(a), it is clear that the minimal perturbation size is much smaller than . We also plot the standard deviation over the different random seeds, showing that our results are consistent.

It is also important to understand how many samples lie on the margin, since our perturbation changes the label of these samples. Figure 2(a) plots the ratio of samples on the margin out of the total number of samples, and Figure 2(b) plots the number of samples not on the margin. These plots correspond to the same experiments as in Figure 1(a), where the number of samples depends on the input dimension. For all the samples are on the margin, as was proven in Lemma C.1. For