1 Introduction
Despite the tremendous success of deep neural networks in many computer vision and language modeling tasks, as well as in scientific discoveries, their properties and the reasons for their success are still poorly understood. Focusing on computer vision, a particularly surprising phenomenon evidencing that those machines drift away from how humans perform image recognition is the presence of
adversarial examples, images that are almost identical to the original ones, yet are misclassified by otherwise accurate models.Since their discovery [Sze+14], a vast amount of work has been devoted to understanding the sources of adversarial examples and explanations include, but are not limited to, the close to linear operating mode of neural nets [GSS15]
, the curse of dimensionality carried by the input space
[GSS15, Gab+19], insufficient model capacity [Tsi+19, Nakk19] or spurious correlations found in common datasets [Ily+19]. In particular, one widespread viewpoint is that adversarial vulnerability is the result of a model’s sensitivity to imperceptible yet wellgeneralizing features in the data, so called useful nonrobust features, giving rise to a tradeoff between accuracy and robustness [Tsi+19, Zha+19]. This gradual understanding has enabled the design of training algorithms, that provide convincing, yet partial, remedies to the problem; the most prominent of them being adversarial training and its many variants [GSS15, Mad+18, robustbench20]. Yet we are far from a mature, unified theory of robustness that is powerful enough to universally guide engineering choices or defense mechanisms.In this work, we aim to get a deeper understanding of adversarial robustness (or lack thereof) by focusing on the recently established connection of neural networks with kernel machines. Infinitely wide neural networks, trained via gradient descent with infinitesimal learning rate, provably become kernel machines with a dataindependent, but architecture dependent kernel  its Neural Tangent Kernel (NTK)  that remains constant during training [JHG18, Lee+19, Aro+19b, Liu+20]. The analytical tools afforded by the rich theory of kernels have resulted in progress in understanding the optimization landscape and generalization capabilities of neural networks [Du+19b, Aro+19a], together with the discovery of interesting deep learning phenomena [Fort+20, Jim+21]
, while also inspiring practical advances in diverse areas of applications such as the design of better classifiers
[Sha+20], efficient neural architecture search [Chen+21], lowdimensional tasks in graphics [Tan+20] and dataset distillation [Ngu+21]. While the NTK approximation is increasingly utlilized, even for finite width neural nets, little is known about the adversarial robustness properties of these infinitely wide models.Our contribution: Our work inscribes itself into the quest to leverage analytical tools afforded by kernel methods, in particular spectral analysis, to track properties of interest in the associated neural nets, in this case as they pertain to robustness. To this end, we first demonstrate that adversarial perturbations generated analytically with the NTK can successfully lead the associated trained wide neural networks (in the kernelregime) to misclassify, thus allowing kernels to faithfully predict the lack of robustness of those trained neural networks. In other words, adversarial (non) robustness transfers from kernels to networks; and adversarial perturbations generated via kernels resemble those generated by the corresponding trained networks. One implication of this transferability is that we can analytically devise adversarial examples that do not require access to the trained model and in particular its weights; instead these “blind spots” may be calculated apriori, before training starts.
A perhaps even more crucial implication of the NTK approach to robustness relates to the understanding of adversarial examples. Indeed, we show how the spectrum of the NTK provides an alternative way to define features of the model, to classify them according to their robustness and usefulness for correct predictions and visually inspect them via their contribution to the adversarial perturbation (see Fig. 1). This in turn allows us to verify previously conjectured properties of standard classifiers; dependence on both robust and nonrobust features in the data [Tsi+19], and tradeoff of accuracy and robustness during training. In particular we observe that features tend to be rather invariable across architectures, and that robust features tend to correspond to the top of the eigenspectrum (see Fig. 2), and as such are learned first by the corresponding wide nets [Aro+19a, JHG18]. Moreover, we are able to visualize useful nonrobust features of standard models (Fig. 4). While this conceptual feature distinction has been highly influential in recent works that study the robustness of deep neural networks (see for example [ZhLi20, KLR21, SMK21]), to the best of our knowledge, none of them has explicitly demonstrated the dependence of networks on such feature functions (except for simple linear models [Goh19]). Rather, these works either reveal such features in some indirect fashion, or accept their existence as an assumption. Here, we show that Neural Tangent Kernel theory endows us with a natural definition of features through its eigendecomposition and provides a way to visualise and inspect robust and nonrobust features directly on the function space of trained neural networks.
Interestingly, this connection also enables us to empirically demonstrate that robust features of standard models alone are not enough for robust classification. Aiming to understand, then, what makes robust models robust, we track the evolution of the datadependent empirical NTK during adversarial training of neural networks used in practice. Prior experimental work has found that networks with nontrivial width to depth ratio which are trained with large learning rates, depart from the NTK regime and fall in the socalled “rich feature” regime, where the NTK changes substantially during training [Gei+19, Fort+20, Bar+21, Jim+21]. In our work, which to the best of our knowledge is the first to provide insights on how the kernel behaves during adversarial training, we find that the NTK evolves much faster compared to standard training, simultaneously both changing its features and assigning more importance to the more robust ones, giving direct insight into the mechanism at play during adversarial training (see Fig. 6). In summary, the contributions of our work are the following:

We discuss how to generate adversarial examples for infinitelywide neural networks via the NTK, and show that they transfer to fool their associated (finite width) nets in the appropriate regime, yielding a "trainingfree" attack without need to access model weights (Sec. 3).

Using the spectrum of the NTK, we give an alternative definition of features, providing a natural decomposition or perturbations into robust and nonrobust parts [Tsi+19, Ily+19] (Fig. 1). We confirm that robust features overwhelmingly correspond to the top part of the eigenspectrum; hence they are learned early on in training. We bolster previously conjectured hypotheses that prediction relies on both robust and nonrobust features and that robustness is traded for accuracy during standard training. Further, we show that only utilizing the robust features of standard models is not sufficient for robust classification (Sec. 4).

We turn to finitewidth neural nets with standard parameters to study the dynamics of their empirical NTK during adversarial training. We show that the kernel rotates in a way that enables both new (robust) feature learning and that drastically increases of the importance (relative weight) of the robust features over the nonrobust ones. We further highlight the structural differences of the kernel change during adversarial training versus standard training and observe that the kernel seems to enter the “lazy” regime much faster (Sec. 5).
Collectively, our findings may help explain many phenomena present in the adversarial ML literature and further elucidate both the vulnerability of standard models and the robustness of adversarially trained ones. We provide code to visualize features induced by kernels, giving a unique and principled way to inspect features induced by standardly trained nets.
Related work: To the best of our knowledge the only prior work that leverages NTK theory to derive perturbations in some adversarial setting is due to YuWu21, yet with entirely different focus. It deals with what is coined generalization attacks: the process of altering the training data distribution to prevent models to generalise on clean data. Bai+21 study aspects of robust models through their linearized subnetworks, but do not leverage NTKs.
2 Preliminaries
We introduce background material and definitions important to our analysis. Here, we restrict ourselves to binary classification, to keep notation light. We defer the multiclass case, complete definitions and a more detailed discussion of prior work to the Appendix.
2.1 Adversarial Examples
Let be a classifier, be an input (e.g. a natural image) and its label (e.g. the image class). Then, given that is an accurate classifier on , is an adversarial example [Sze+14] for if

the distance is small. Common choices in computer vision are the norms, especially the norm on which we focus henceforth, and

. That is, the perturbed input is being misclassified.
Given a loss function
, such as crossentropy, one can construct an adversarial example by finding the perturbation that produces the maximal increase of the loss, solving(1) 
for some that quantifies the dissimilarity between the two examples. In general, this is a nonconvex problem and one can resort to first order methods [GSS15]
(2) 
or iterative versions for solving it [KGB17, Mad+18]. The former method is usually called Fast Gradient Sign Method (FGSM) and the latter Projected Gradient Descent (PGD)
. These methods are able to produce examples that are being misclassified by common neural networks with a probability that approaches 1
[CaWa17]. Even more surprisingly, it has been observed that adversarial examples crafted to “fool” one machine learning model are consistently capable of “fooling” others
[PMG16, Pap+17], a phenomenon that is known as the transferability of adversarial examples. Finally, adversarial training refers to the alteration of the training procedure to include adversarial samples for teaching the model to be robust [GSS15, Mad+18] and empirically holds as the strongest defense against adversarial examples [Mad+18, Zha+19].2.2 Robust and NonRobust Features
Despite a vast amount of research, the reasons behind the existence of adversarial examples are not perfectly clear. A line of work has argued that a central reason is the presence of robust and nonrobust features in the data that standard models learn to rely upon [Tsi+19, Ily+19]. In particular it is conjectured that reliance on useful but nonrobust features during training is responsible for the brittleness of neural nets. Here, we slightly adapt the feature definitions of [Ily+19]^{1}^{1}1We distinguish useful and robust features based on their accuracy as classifiers, not in terms of correlation with the labels as in Ily+19, allowing a natural extension to the multiclass setting. For robustness, we consider any accuracy bounded away from zero as robust, quantifying that an adversary cannot drive accuracy to zero entirely., and extend them to multiclass problems (see Appendix A).
Let be the data generating distribution with and . We define a feature as a function and distinguish how they perform as classifiers. Fix :

Useful feature: A feature is called useful if
(3) 
Robust feature: A feature is called robust if it remains useful under any perturbation inside a bounded “ball” , that is if
(4)
In general, a feature adds predictive value if it gives an advantage above guessing the most likely label, i.e. , and we will speak of “useful” features in this case, omitting the . We will call such a feature useful, nonrobust if it is useful, but robust only for or very close to , depending on context.
The vast majority of works imagines features as being induced by the activations
of neurons in the net, most commonly those of the penultimate layer (
representationlayer features), but the previous formal definitions are in no way restricted to activations, and we will show how to exploit them using the eigenspectrum of the NTK. In particular, in Sec. 4, we demonstrate that the above framework agrees perfectly with features induced by the eigenspectrum of the NTK of a network, providing a natural way to decompose the predictions of the NTK into such feature functions. In particular we can identify robust, useful, and, indeed, useful nonrobust features.2.3 Neural Tangent Kernel
Let be a (scalar) neural network with a linear final layer parameterized by a set of weights and be a dataset of size , with and . Linearized training methods study the first order approximation
(5) 
The network gradient induces a kernel function , usually referred as the Neural Tangent Kernel (NTK) of the model
(6) 
This kernel describes the dynamics with infinitesimal learning rate (gradient flow). In general, the tangent space spanned by the twists substantially during training, and learning with the Gram matrix of Eq. (6) (empirical NTK) corresponds to training along an intermediate tangent plane. Remarkably, however, in the infinite width limit with appropriate initialization and low learning rate, it has been shown that becomes a linear function of the parameters [JHG18, Liu+20], and the NTK remains constant (). Then, for learning with loss the training dynamics of infinitely wide networks admits a closed form solution corresponding to kernel regression [JHG18, Lee+19, Aro+19b]
(7) 
where is any input (training or testing), denotes the time evolution of gradient descent, is the (small) learning rate and, slightly abusing notation, denotes the matrix containing the pairwise training values of the NTK, , and similarly for . To be precise, Eq. (7) gives the mean
output of the network using a weightindependent kernel with variance depending on the initialization
^{2}^{2}2For that reason, in the experiments, we often compare this with the centered prediction of the actual neural network, , as is commonly done in similar studies [COB19]..3 Transfer Results in the Kernel Regime
In this section, we show how to generate adversarial examples from NTKs and discuss their similarity to the ones generated by the actual networks. Note that for network results, we restrict ourselves to wide networks initialized in the “lazy” regime with small learning rates (the “kernel regime”).
3.1 Generation of Adversarial Examples for Infinitely Wide Neural Networks
Adversarial examples arise in the context of classification, while the NTK learning process is described by a regression as in Eq. (7). The arguably simplest way to align with the framework presented in Eq. (1
) is to treat the outputs of the kernel similar to logits of a neural net, mapping them to a probability distribution via the sigmoid/softmax function and apply crossentropy loss.
A simple calculation (see Appendix B, together with the generalization to the multiclass case) gives:
The optimal one step adversarial example of a scalar, infinitely wide, neural network is given by
(8) 
for , where .
One can conceive other ways to generate adversarial perturbations for the kernel, either by changing the loss function (as previously done in neural networks (e.g. [CaWa17])) or through a Taylor expansion around the test input, and we present such alternative derivations in Appendix B. However, in practice we observe little difference between that approach and the one presented here.
3.2 Transfer Results and Kernel Attacks
Predictions from NTK theory for infinitely wide neural networks have been used successfully for their large finite width counterparts, so it seems reasonable to conjecture that adversarial perturbations generated via the kernel as in Eq. (8) resemble those directly computed for the corresponding neural net as per Eq. (2). In particular, this would imply that adversarial perturbations derived from the NTK should not only fool the kernel machine itself, but also lead wide neural nets to misclassify.
While similar transfer results in different contexts have been observed indirectly, via the effects of the perturbation on metrics like accuracy [YuWu21, Ngu+21], we aim to look deeper to compare perturbations directly. High similarity would imply that any gradient based whitebox attack on the neural net can be successfully mimicked by a “blackbox” kernel derived attack.
Setting
. To this end, we train multiple twolayer neural networks on image classifications tasks extracted from MNIST and CIFAR10 and compare adversarial examples generated by Eqs. (
2) (attacking the neural network) and (8) (attacking the kernel). The networks are trained with small learning rate and are sufficiently large, so lie close to the NTK regime.We track cosine similarity between the gradients of the loss from the NTK predictions and the gradients from the actual neural net as training evolves. Then, we generate adversarial perturbations from both the neural net and the kernel machine, and test whether those produced by the latter can fool the former. Full experimental details can be found in Appendix C.
Results. Our experiments confirm a very strong alignment of loss gradients from the neural nets and the NTK across the whole duration of training, as can be seen in Fig. 3 (top). Then, as expected, kernelgenerated attacks produce a similar drop in accuracy throughout training as the networks “own” whitebox attacks, eventually driving robust accuracy to , as seen in Fig. 3 (bottom). We reproduce these plots for MNIST in Appendix C, leading to similar conclusions.
When concerned with security aspects of neural nets, adversarial attacks are mainly characterised as either whitebox or blackbox attacks [Pap+17]. White box attacks assume full access to the neural network and in particular its weights; prominent examples include FGSM/PGD attacks. Black box attacks, on the other hand, can only query the model to try to infer the loss gradient, either through training separate surrogate models [PMG16] or through carefully crafted inputoutput pairs fed to the target model [Che+17, Ily+18, And+20]. NTK theory and the experiments of this section suggest a threat model in which the attacker does not require access to the model or its weights, nor training of a substitute model. For fixed architecture and training data, all the information required for the computation of Eq. (8) is available at initialization, making the “NTKattack” akin to a “training free” substitution attack, and, at least in the kernelregime for wide nets considered here, as effective as whitebox attacks.
4 NTK Eigenvectors Induce Robust and NonRobust Features
This close connection between adversarial perturbations from the kernel and the corresponding neural net gives us the opportunity to bring to bear kernel tools on the study of adversarial robustness and its relation to features in a more direct fashion. Several recent works leverage properties of the NTK, and specifically its spectrum, to study aspects of approximation and generalization in neural networks [Aro+19a, Bas+19, BiMa19, Bas+20]. Here we show how the spectrum relates to robustness and helps to clarify the notion of robust/nonrobust features.
We define features induced by the eigendecomposition of the Gram matrix . We will be most interested in the end of training, when the model has access to all the features it can extract from the training data . As , Eq. (7) becomes and can be decomposed as , where
(9) 
Each can be seen as a unique feature captured from the (training) data. Note that these functions map the input to the output space, thus matching the definitions of Sec. 2.2. Also observe that all ’s jointly recover the original prediction of the model, while each one, intuitively, should contribute something different to it.
Importantly, these features induce a decomposition of the gradient of the loss into parts, each representing gradients of a unique feature as already advertised in Fig. 1. The binary case is particularly elegant as it gives rise to a linear decomposition of the gradient as
(10) 
for some depending on and (see Appendix D). But if ’s are features, how do they look like?
Feature properties of common architectures:
With these definitions in place, we can now analyze the characteristics of features for commonly used architectures, leveraging their associated NTK. To be consistent with the previous section, we consider classification problems from MNIST (10 classes) and CIFAR10 (car vs airplane). We compose the Gram matrices from the whole training dataset (50000 and 10000, respectively), and compute the different feature functions
using the eigendecomposition of the matrix. We estimate the
usefulness of a feature by measuring its accuracy on a holdout validation set, and its robustness by perturbing each input of this set, using an FGSM attack on feature . We consider several different Fully Connected and Convolutional Kernels, whose expressions are available through the Neural Tangents library [Nov+20], built on top of JAX [Brad+18]. We summarize our findings on how these features behave:Functions represent visually distinct features. We visualise each feature by plotting its gradient with respect to . Fig. 2 shows the gradient of the first 5 features for various architectures for a specific image from the CIFAR10 dataset. We observe that features are fairly consistent across models, and they are interpretable: for example the 4th feature seems to represent the dominant color of an image, while the 5th one seems to be capturing horizontal edges.
Networks use both robust and nonrobust features for prediction. It has been speculated that neural networks trained in a standard (non adversarial) fashion rely on both robust and nonrobust features. Our feature definition in Eq. (9) shows that this is indeed the case. The NTK of common neural networks consists of both robust features that match human expectations, such as the ones depicted in Fig. 2, but also on features that are predictive of the true label, while not being robust to adversarial perturbations of the input (Fig. 4). Fig. 2 depicts the first 100 features of a fully connected and a convolutional tangent kernel in UsefulnessRobustness space. The upper left region of the plots shows a large amount of useful, yet nonrobust features. These features seem random to human observers.
Robustness lies at the top. We observe in Fig. 2
that features corresponding to the top eigenvectors tend to be robust. This is consistent among different models and between the two datasets (see Appendix
D). Since these eigenvectors are the ones fitted first during training [Aro+19a, JHG18], it is no wonder that the loss gradient evolves from coherence to noise, as observed in Fig. 6(b). This also explains the apparent tradeoff between robustness and accuracy of neural networks as training progresses: useful, robust features are fitted first, followed by useful, but nonrobust ones. This ties in well with both empirical findings [Rah+19] and theoretical case studies [Bas+19, BiMa19, Bas+20] that demonstrate that low frequency functions are fitted first during training and provide favorable generalization properties and we would associate robust features with these lowfrequency parts (in function space).Robust features alone are not enough. In light of these findings, it might be reasonable to conjecture that we could obtain robust models by retaining the robust features of the prediction, while discarding the nonrobust ones. The spectral approach gives a principled way to disentangle features and create kernel machines keeping only the robust ones. Our results show that in general it is not possible to obtain nontrivial performance without compromising robustness in this fashion, strengthening the case for the necessity of data augmentation in the form of adversarial training (see Appendix D.3).
5 Kernel Dynamics during Adversarial Training
Given the apparent necessity for adversarial training to produce robust models, how does it achieve this goal? To shed some light on this fundamental question, we depart from the “lazy” NTK regime and study the evolution of the NTK of adversarially trained models. For a neural network trained with gradient descent, as the learning rate , the continuous time dynamics can be written as
(11) 
In the NTK regime, this kernel remains fixed at its initial value. However, outside this regime, it has been demonstrated, both empirically [Gei+19, Fort+20, Bar+21, Jim+21] and theoretically [ABP21], that is not constant during training, and is changing as the weights move. In adversarial training, moreover, there is the additional effect that at each weight update, the data changes as well. For that reason, understanding the dynamics of adversarial training requires tracking the evolution of a kernel , where
denotes the current (mini) batch of training data. Notice that the tangent vector
is still describing the instantaneous change of on the current batch of data, thus is informative of the local geometry of the function space, justifying its value as a quantity to be measured during adversarial training.We train a deep convolutional architecture on CIFAR10 (multiclass) with standard (SGD) and adversarial training using PGD with an constraint. Full implementations details and accuracy curves can be found in Appendix E, together with the reproduction of the same experiment on MNIST, where the observations are similar. We track the following quantities during training:
Kernel distance. We compare two kernels using a scale invariant distance, which quantifies the relative rotation between them, as used in other works studying NTK dynamics (e.g. Fort+20):
(12) 
Polar dynamics. Zooming in on the change that the initial kernel undergoes, we define a polar space on which we measure the movement of the kernel:
(13) 
where are the initial and final kernel, respectively. Fig. 6 presents a heatmap of kernel distances at different time steps for both standard and adversarial training, as well as both training trajectories in polar space.
)) for MNIST (left) and CIFAR10 (right). Darker colors indicate earlier epochs.
Concentration on subspaces. To quantify weight concentration on the top region of the spectrum, we track the (normalized) Frobenius norm of subspaces as , for various cutoffs , where we have indexed the eigenvalues from largest to smallest. Fig. 5 depicts concentration on the top 20 eigenvalues during training.
Our findings show that similar to what has been reported in prior work [Fort+20], the kernel rotates significantly in the beginning of training and then slows down for both standard and adversarial training. However, in the latter case, this second phase begins a lot earlier. As Fig. 6 illuminates, the kernel moves a greater distance than when performing standard training, but after a few epochs it stops both rotating and expanding; note that this is not the case for standard training where the kernel increases its magnitude substantially later in training, and in fact grows to have a norm orders of magnitude larger than during adversarial training (see Fig. 5). In hindsight, this behavior is perhaps not surprising, as each element of the kernel measures similarity between data points, and a robust machine should be more conservative when estimating similarity. The observation that during adversarial training the kernel becomes relatively static relatively fast might indicate that linear dynamics govern the later phase of adversarial training. It has been observed in previous works [Gei+19, Fort+20, Jim+21] that linearization after a few initial epochs of rapid rotation often closely matches performance of full network training. Our results indicate that a similar phenomenon occurs even under the data shift of adversarial training (see Appendix E.1 for a study of linearized adversarial training), opening avenues to design robust machines more efficiently.
Moreover, endowed with the knowledge that at least for kernels trained with static data robust features lie at the top, we study polar dynamics of the top space only (see Fig. 14) to observe that there is substantial rotation in this space, suggesting that robust features are learned early on not only during standard, but in particular during adversarial training. Even more interestingly, Fig. 5 demonstrates that not only the robust features change, but their relative weight as measured by the concentration on the top20 space is increasing simultaneously relative to standard training as well, and remains large; in fact, significantly larger than during standard training. As each eigenvalue weights the importance of the corresponding feature on the final prediction, this implies that the kernel “learns” to depend more on the most robust features.
Put together, these findings reveal different kernel dynamics during standard and adversarial training: the kernel rotates much faster, expands much less and becomes “lazy” much earlier than during standard training. Fully understanding the properties of converged adversarial kernels remains an important avenue for future work, that might allow to design faster algorithms for robust classification.
6 Final Remarks
We have studied adversarial robustness through the lens of the NTK across multiple architectures and data sets both in the idealized NTK regime and the “rich feature” regime. When connecting the spectrum of the kernel with fundamental properties characterizing robustness our phenomenological study reveals a universal picture of the emergence of robust and nonrobust features and their role during training. There are certain limitations and unexplored themes in our work; Sec. 3 argues that transferable attacks from the NTK may be as effective as whitebox attacks, but this warrants an indepth study across architectures, kernels and data sets (which has not been the main focus of this work). Sec. 4 visualises features for fairly simple models, since the computation of kernel derivatives is a costly procedure. It would be interesting to use our framework to visualise features from more complicated architectures. Finally, our work in Sec. 5 invites more research on the kernel at the end of adversarial training, similar to what has been done for standard models [Long21].
We hope that our viewpoint can motivate further theoretical understanding of adversarial phenomena (such as transferability) and the design of better and/or faster adversarial learning algorithms, by further analyzing the kernels from robust deep neural networks.
Acknowledgements
The authors would like to thank Jingtong Su, Alberto Bietti, Yunzhen Feng, and Artem Vysogorets for fruitful discussions and feedback in various stages of this work. NT thanks Dimitris Tsipras for a helpful discussion in the beginning of this project. The authors would like to acknowledge support through the National Science Foundation under NSF Award 1922658. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
References
Checklist
The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default to , , or . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

Did you include the license to the code and datasets? See Sec. 4 and Appendix.

Did you include the license to the code and datasets? The code and the data are proprietary.

Did you include the license to the code and datasets?
Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you describe the limitations of your work?

Did you discuss any potential negative societal impacts of your work? Our work sheds light properties of adversarial examples to make mahcine learning models more reliable in the long run.

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results?

Did you include complete proofs of all theoretical results?


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators?

Did you mention the license of the assets?

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Robust and NonRobust features
The idea that data features are to be blamed for the adversarial weakness of machine learning models was proposed in [Ily+19, Tsi+19]. In particular, Ily+19 show that training with adversarially perturbed images labeled with the “wrong” label yields classifiers with nontrivial test performance (“learning from nonrobust features only”), while, in a dual experiment, they demonstrate that standard training with “robustified” data (data that presumably are “denoised” from nonrobust features) produces a classifier with nontrivial robust accuracy (“relies only on robust features”). Motivated by these observations, the authors propose a model of robust/nonrobust features that are hidden in the data, and whose presence determines the eventual robustness of models. To accompany the definitions of Sec. 2.2, we extend them for multiclass classification, since Sec. 4 introduces our NTK feature framework for both binary and multiclass problems.
Let be the data generating distribution, with (input space) and (action space). We define features as functions from the input to the action space, and categorize them as follows, according to their performance as classifiers. Fix :

Useful feature: A feature is called useful if
(14) 
Robust feature: A feature is called robust if it is predictive of the true label under any perturbation inside a bounded “ball” , that is if
(15) 
Useful, nonrobust feature: A feature is called useful, nonrobust if it confers an advantage above guessing the most likely label, i.e. , but is robust only for (within some precision).
The above framework was introduced by [Ily+19, Tsi+19], and we have slightly adapted it in terms of accuracy as classifiers derived from features. Goh19 showed how such feature functions arise in a simple linear model, and proposed two mechanisms to construct useful, nonrobust features. In [ZhLi20], the authors view the weights of neural networks as features, and show that adversarial training “purifies/robustifies” them.
Appendix B Derivation of Adversarial Perturbations for Kernel Regression
In this section, we derive expressions for adversarial attacks on Neural Tangent Kernels presented in the main paper, as well as additional derivations obtained from firstorder expansions around the input.
b.1 Adversarial Perturbations from CrossEntropy Loss
We first derive the expression in Eq. (8) of the paper. Let be an input to the NTK prediction
(16) 
where is a dataset of size . We consider the binary and the multiclass case separately.
In the binary case, where , we feed expression Eq. (16) to a sigmoid and maximize the cross entropy loss between the output and the true label:
(17) 
where we set to lie in . We compute the gradient of the loss with respect to :
(18) 
So the optimal onestep attack, under an adversary, reduces to computing perturbation
(19) 
since for all .
In the case of a kclass classification problem with one hot labels , we can express the cross entropy loss between the NTK predictions Eq. (16) and the labels as:
(20) 
where denotes the th output of Eq. (16). Computing the loss gradient as before yields the optimal perturbation ,
(21) 
The above calculations allow us to speed up the computation of the attacks in the case of NTKs with closed form expression, since the gradient
(22) 
with D being the Jacobian of wrt to , can be precomputed, without the need for autodifferentiation tools. We leverage this in the experiments of Sec. 3.
b.2 Alternative Approaches to Generate Perturbations
One can derive other perturbation variants by changing the loss function from crossentropy to other functions studied in the literature in this context (e.g. [CaWa17]). Alternatively, we can study the output on a test input directly to devise strategies to most efficiently perturb it, using a Taylor expansion around the input, leading to a linear expression (shown here for scalar kernels):
(23) 
for some that depends on the training data and the NTK kernel only.
Binary case:
Suppose we would like to evaluate a model described by Eq. (7) at the end of training,
(24) 
on slightly perturbed variations of the original training data. Then, slightly abusing notation, we set, , that is for all for small, but unknown, perturbations . By taking a firstorder Taylor expansion in the perturbation, we can write the th element of as follows:
(25) 
For each row we obtain:
(26) 
Hence, can be written as for a perturbation matrix , with th row . Substituting into Eq. (24), we get:
(27) 
Thus, the output of the model on is:
(28) 
leading to the linear expression advertised in Eq. (23). The adversarial perturbation changes the output by , an expression which allows us to compute the adversarial perturbation to maximally change the output within the desired constraints on .
Since Eq. (24) describes regression models with LSE (loss), while adversarial examples typically are studied for classification models, we use thresholding (i.e. taking the sign of the output in the case of binary classification tasks) or by outputting the maximum prediction (in the case of multiclass problems) to turn Eq. (24) into a classifier.
Inspecting Eq. (28), maximal “confusion" of the classification model is achieved by aligning with (directed towards the decision boundary). In case of the commonly used restriction, i.e. , the optimal adversarial perturbation is given by:
(29) 
The computation of this optimal adversarial perturbation requires an expression for the NTK and its gradient with respect to the training data. For models where an analytical expression of the NTK is available, only access to the labeled training data is necessary (as presented, for instance, in Sec. C). In more complicated models or those that deviate from the assumptions for Eq. (24) one can compute an empirical kernel by sampling over kernels at initialization over a few instances and obtain the matrices with autodifferentiation tools.
Eq. (28) has been derived for perturbations of the training data. Consider now the case when we evaluate Eq. (24) on perturbations of unseen test data, that is on . Then, Eq. (27) becomes:
(30) 
Again, solely the second term depends on the perturbation, so we proceed by choosing a maximally perturbing direction as before. The only difference lies in the matrix that now depends on the test set
(31) 
In practice, an adversary can calculate the NTK offline and calculate the optimal perturbation on a new test input by computing the corresponding row of the matrix . Importantly, no information on the test data labels is needed.
Multiclass case:
We adapt the derivations of the binary case to the setting where the output dimension is larger than one in the underlying regression setting (see below), resulting in a multiclass classifier. This leads to the multidimensional analogue of the linear Eq. (23) for , :
(32) 
Again, the can be computed from the NTK and its derivative as well as the training data labels. Exactly analogous considerations as in the binary case allow to adapt this expression to perturbations of the test data.
At this point we have a choice of how to adversarially perturb the classifier to achieve the largest effect on the network output. We present the two most obvious methods.
Maxof perturbation: Similar in spirit to traditional approaches in adversarial attacks (CaWa17) we choose such as to most efficiently decrease the correct response while maximally increasing one of the false responses . The solution is given by:
(33) 
It is obtained by solving
Then
(34) 
Sumof perturbation: For onehot vectors we could, instead, maximize the crossentropy between the labels and the new outputs, thus choosing to produce a maximally mixed output. If is the correct label, this yields
(35) 
derived as follows
(36) 
Maximizing this cross entropy amounts to maximizing
For small perturbations we can develop the exponential to first order^{3}^{3}3The resulting expression for the maximum also holds when developing to second order., which leads to finding the maximum of
yielding Eq. (35).
Derivation of Eq. (32): While we remain with as in the binary case, the other quantities change as , and , i.e. for each data pair we have . Let denote the entry of that corresponds to the th and the th output of the model (evaluated at and ). Then, with similar reasoning that led to Eq. (25) we now obtain:
(37) 
For the prediction of the model on the whole dataset, we have:
(38) 
which for a given sample gives:
(39) 
where is equal to
(40) 
Appendix C Transfer Results for Wide TwoLayer Networks
In this section, we present additional experimental details for Sec. 3.2 and show the results of the experiments on MNIST. We train twolayer neural networks of the form
(41) 
where the first layer is initialized with the normal distribution, the second layer is frozen to its initial random values in
, and denotes the width of the network. The NTK of this architecture is given by(42) 
We choose this family of models in order to be consistent with early works that analyzed training and generalization properties of neural networks in the NTK regime [Aro+19a, Du+19a]. We perform experiments on image classification on MNIST and on a binary task extracted from CIFAR10 (car vs airplane). We train the networks in a regression fashion, minimizing the loss between the predictions and onehot vectors, using fullbatch gradient descent on the entire dataset (full training data for MNIST and 5K images for each of car and airplane in binary CIFAR). We keep the learning rate fixed to and vary the width of the network in . We train 3 networks for each dataset until convergence ( epochs), each initialized with a different random seed. When we measure quantities from the neural net, we subtract the initial prediction , since the NTK expression Eq. (16) does not take the initialization of the network into account. When attacking the models ( attacks), we use perturbation budget for MNIST and
for CIFAR10. The experiments are performed with PyTorch
[Pas+19].For each model, we calculate the loss gradients with respect to the input during training, and compare them to those derived for the NTK in Eqs. (18) and (21) for the binary and the multiclass task, respectively, using cosine similarity:
(43) 
where is the NTK prediction defined in Eq. (16), denotes the output of the neural net and is the initial prediction of the neural net (prior to training). In order to match the timescales, we manually align the networks on epoch = with a timepoint for the NTK, and based on this number, we match the rest of the epochs assuming linear dependence (as theory predicts  Eq. (16)). Fig. 6(a) shows cosine similarity of loss gradients and robust accuracy of the network (evaluated against its own adversarial examples, and those from the NTK) for MNIST. Fig. 6(b) illustrates the similarity of loss gradients of neural nets and their NTKs for 3 different epochs.
Notice the very small discrepancy between the loss gradients of different networks (initialized with different random seeds) in Fig. 6(a). They are all centered around the loss gradient of the NTK, a manifestation of transferability of adversarial examples, at least for models with the same architecture. The NTK framework might possibly provide a wider explanation of this phenomenon, also across architectures. For instance, for fully connected kernels, the NTK expression for kernels of depth is a relatively simple function of expressions for depth [JHG18, BiMa19] which could explain transferability across architectures of varying depth.
Appendix D NTK Features: Additional Details
In this section, we present additional material for Sec. 4; we show derivations that are missing from the main text, and complement the plots by showing the same information for more architectures and datasets.
d.1 Loss Gradient Decomposition
First, recall our definitions of features from Sec. 4. Let be a dataset, where and (binary classification). Then, kernel regression on this dataset gives predictions of the form . Given, the eigendecomposition of the Gram Matrix , we can decompose the prediction as follows
(44) 
where . Notably, this means that the gradient of the cross entropy loss can be also understood as a composition of gradients coming from these features, as the following proposition shows.
Proposition 1.
The loss gradient of can be decomposed as follows:
(45) 
where is a quantity that depends on .
Proof.
d.2 Additional Plots
Complementing Fig. 2 in the main text, we show (the first 100) NTK features in Robustness  Usefulness space defined in Sec. 4 for a larger number of architectures for both MNIST and CIFAR in Fig. 8 and 9. We use available analytical NTK expressions for standard FC{1,2,3,4,5} and CONV{1,2} architectures in the NTK regime to evaluate and decompose kernels on a subset of 10K MNIST training images and 10K binary CIFAR images  5K cars and 5K airplanes. We note that within a dataset, the plots do not change much between architectures, speaking to the universal nature of these kernelinduced features.
d.3 Robust Features Alone are not Enough
Feature definitions outlined in Sec. 4 open an avenue to use traditional feature selection methods to search for robust models. In particular, here we rank the features of an NTK based on their robustness on a validation set (accuracy against adversarial examples computed from the same feature  setting: FGSM with
for MNIST or for CIFAR10). Specifically, we test and rank each "onefeature kernel" function . Given this ranking, we construct a sequence of new kernels by progressively aggregating the most robust features with their original eigenvalues. This gives rise to kernel machines of the form , where indicates the number of top robust features kept. We present the results of this approach in Figures 10 (MNIST) and 11 (CIFAR10), where we plot clean accuracy as well as robust accuracy against perturbation from the kernel itself as well as against "transfer" perturbations from the original (full) kernel.On the binary classification task, some robustness can be garnered by keeping the most robust features and there seems to exist a sweet spot where the robustness is maximized (this seems to be consistent across other models as well). On multiclass MNIST, however, despite the relative simplicity of the dataset, we are not able to obtain nontrivial performance without compromising robustness. We conclude that it is unlikely that robust features (of standard models) alone are sufficient for robust classification, and the burden of some data augmentation, like in the form of adversarial training, seems necessary, at least for the models considered in our experiments.
Appendix E Experimental Details for the Kernel Dynamics Section
Here we provide the details of our experiments in Sec. 5, where we compare standard and adversarial training by tracking several kernel quantities.
For experiments with MNIST, we use a simple convolutional architecture with 3 layers. The first 2 layers compute a convolution (with a 3
3 kernel), followed by a ReLU and then by an average pooling layer (of kernel size 2
2 and stride 2). The 3rd layer is fullyconnected with a ReLU nonlinearity, followed by a linear prediction layer with 10 outputs. The layers have width 32, 64 and 256, respectively.
For CIFAR10, we use a deeper architecture consisting of 6 layers. Layers 1 and 2, 3 and 4, 5 and 6 are fully convolutional with 32, 64 and 128 channels, respectively, and a kernel of size 3
3. There is a max pooling operation after layer 2, and average pooling after the final layer, followed by a linear prediction layer. Both pooling operations use a kernel of size 2
2 and stride 2.We use a fixed learning rate of
for all experiments and no weight decay. We do not use any data augmentation, since we are interested in analyzing the behavior of kernels, rather than obtaining the best possible results. Stochastic gradient descent is used in all cases, with a batch size of 300 for MNIST and 250 for CIFAR10. The kernels quantities are tracked for the same (first) batch during training. For adversarial training, we either used FGSM or PGD (for generating the adversarial examples) with 20 steps against
adversaries. The maximum perturbation size is set to and (for MNIST and CIFAR10, respectively), and in the case of PGD training we use an attack step size of and , respectively. Experiments were run with JAX [Brad+18], and empirical NTKs were computed using the Neural Tangents Library [Nov+20]. Neural nets were trained using Flax [Hee+20] and the JaxOpt library [Blo+21], adapting code available from the JaxOpt repository. This code snippet was licensed under the Apache License, Version 2.0.Models were trained for 200 epochs. Fig. 12 summarizes the performance of the networks during training. In Fig. 13, we show how norm concentration evolves during training  similar to the plots for CIFAR10 in Fig. 5, but for MNIST and for two choices of eigenvalue index cutoff.
Fig. 14 shows the polar dynamics for the top space (top 20 eigenvalues) of the kernel. We observe little to no change for adversarial training from Fig. 6 in the main text that showed the same information for the entire space, though for standard training there is less rotation in the top space. We entertain this as an indication that adversarial training modifies the “robust” (top) features of the kernel more than standard training.
Finally, Fig. 15 shows the values within the kernel matrices before and after training for MNIST for standard and adversarial training. We draw the same conclusions as the main text, namely the “standard” kernel has significantly larger values than the “adversarial” one.
e.1 Linearized Adversarial Training
Motivated by the apparent laziness of the kernel during adversarial training and the findings of prior works [Gei+19, Fort+20] that considered linearization (with respect to the parameters) of the model after some epochs, we do the same for adversarial training.
We include a small study that linearizes the kernel after a certain number of epochs. In particular, Fig. 16 shows the training behavior after linearizing the CIFAR10 model after 25 and 50 epochs, and also at initialization. After linearization, we continue adversarial training in this simple linearized model (meaning we generate adversarial examples from the linear model). We observe that adversarial training continues, without a collapse of the training method. In comparison to nonlinearized training (Fig. 12), training seems to stagnate. We also observe that the earlier we linearize, the greater the gap is between standard and robust performance. We leave the investigation of this intriguing phenomenon and a detailed comparison to standard training to future work.