Log In Sign Up

What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness?

The adversarial vulnerability of neural nets, and subsequent techniques to create robust models have attracted significant attention; yet we still lack a full understanding of this phenomenon. Here, we study adversarial examples of trained neural networks through analytical tools afforded by recent theory advances connecting neural networks and kernel methods, namely the Neural Tangent Kernel (NTK), following a growing body of work that leverages the NTK approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. We show how NTKs allow to generate adversarial examples in a “training-free” fashion, and demonstrate that they transfer to fool their finite-width neural net counterparts in the “lazy” regime. We leverage this connection to provide an alternative view on robust and non-robust features, which have been suggested to underlie the adversarial brittleness of neural nets. Specifically, we define and study features induced by the eigendecomposition of the kernel to better understand the role of robust and non-robust features, the reliance on both for standard classification and the robustness-accuracy trade-off. We find that such features are surprisingly consistent across architectures, and that robust features tend to correspond to the largest eigenvalues of the model, and thus are learned early during training. Our framework allows us to identify and visualize non-robust yet useful features. Finally, we shed light on the robustness mechanism underlying adversarial training of neural nets used in practice: quantifying the evolution of the associated empirical NTK, we demonstrate that its dynamics falls much earlier into the “lazy” regime and manifests a much stronger form of the well known bias to prioritize learning features within the top eigenspaces of the kernel, compared to standard training.


page 2

page 3

page 7

page 9

page 20

page 26


Evolution of Neural Tangent Kernels under Benign and Adversarial Training

Two key challenges facing modern deep learning are mitigating deep netwo...

On Fragile Features and Batch Normalization in Adversarial Training

Modern deep learning architecture utilize batch normalization (BN) to st...

Adversarial Examples Are Not Bugs, They Are Features

Adversarial examples have attracted significant attention in machine lea...

Can we achieve robustness from data alone?

Adversarial training and its variants have come to be the prevailing met...

Extrapolation and Spectral Bias of Neural Nets with Hadamard Product: a Polynomial Net Study

Neural tangent kernel (NTK) is a powerful tool to analyze training dynam...

Decoder-free Robustness Disentanglement without (Additional) Supervision

Adversarial Training (AT) is proposed to alleviate the adversarial vulne...

State-Reification Networks: Improving Generalization by Modeling the Distribution of Hidden Representations

Machine learning promises methods that generalize well from finite label...

1 Introduction

Despite the tremendous success of deep neural networks in many computer vision and language modeling tasks, as well as in scientific discoveries, their properties and the reasons for their success are still poorly understood. Focusing on computer vision, a particularly surprising phenomenon evidencing that those machines drift away from how humans perform image recognition is the presence of

adversarial examples, images that are almost identical to the original ones, yet are misclassified by otherwise accurate models.

Since their discovery [Sze+14], a vast amount of work has been devoted to understanding the sources of adversarial examples and explanations include, but are not limited to, the close to linear operating mode of neural nets [GSS15]

, the curse of dimensionality carried by the input space

[GSS15, Gab+19], insufficient model capacity [Tsi+19, Nakk19] or spurious correlations found in common datasets [Ily+19]. In particular, one widespread viewpoint is that adversarial vulnerability is the result of a model’s sensitivity to imperceptible yet well-generalizing features in the data, so called useful non-robust features, giving rise to a trade-off between accuracy and robustness [Tsi+19, Zha+19]. This gradual understanding has enabled the design of training algorithms, that provide convincing, yet partial, remedies to the problem; the most prominent of them being adversarial training and its many variants [GSS15, Mad+18, robustbench20]. Yet we are far from a mature, unified theory of robustness that is powerful enough to universally guide engineering choices or defense mechanisms.

In this work, we aim to get a deeper understanding of adversarial robustness (or lack thereof) by focusing on the recently established connection of neural networks with kernel machines. Infinitely wide neural networks, trained via gradient descent with infinitesimal learning rate, provably become kernel machines with a data-independent, but architecture dependent kernel - its Neural Tangent Kernel (NTK) - that remains constant during training [JHG18, Lee+19, Aro+19b, Liu+20]. The analytical tools afforded by the rich theory of kernels have resulted in progress in understanding the optimization landscape and generalization capabilities of neural networks [Du+19b, Aro+19a], together with the discovery of interesting deep learning phenomena [Fort+20, Jim+21]

, while also inspiring practical advances in diverse areas of applications such as the design of better classifiers

[Sha+20], efficient neural architecture search [Chen+21], low-dimensional tasks in graphics [Tan+20] and dataset distillation [Ngu+21]. While the NTK approximation is increasingly utlilized, even for finite width neural nets, little is known about the adversarial robustness properties of these infinitely wide models.

Our contribution: Our work inscribes itself into the quest to leverage analytical tools afforded by kernel methods, in particular spectral analysis, to track properties of interest in the associated neural nets, in this case as they pertain to robustness. To this end, we first demonstrate that adversarial perturbations generated analytically with the NTK can successfully lead the associated trained wide neural networks (in the kernel-regime) to misclassify, thus allowing kernels to faithfully predict the lack of robustness of those trained neural networks. In other words, adversarial (non-) robustness transfers from kernels to networks; and adversarial perturbations generated via kernels resemble those generated by the corresponding trained networks. One implication of this transferability is that we can analytically devise adversarial examples that do not require access to the trained model and in particular its weights; instead these “blind spots” may be calculated a-priori, before training starts.

Figure 1: Top. Standard setup of an adversarial attack, where a barely perceivable perturbation is added to an image to confuse an accurate classifier. Bottom. The correspondence between neural networks and kernel machines allows to visualize a decomposition of this perturbation, each part attributed to a different feature of the model. The first few features tend to be robust.

A perhaps even more crucial implication of the NTK approach to robustness relates to the understanding of adversarial examples. Indeed, we show how the spectrum of the NTK provides an alternative way to define features of the model, to classify them according to their robustness and usefulness for correct predictions and visually inspect them via their contribution to the adversarial perturbation (see Fig. 1). This in turn allows us to verify previously conjectured properties of standard classifiers; dependence on both robust and non-robust features in the data [Tsi+19], and tradeoff of accuracy and robustness during training. In particular we observe that features tend to be rather invariable across architectures, and that robust features tend to correspond to the top of the eigenspectrum (see Fig. 2), and as such are learned first by the corresponding wide nets [Aro+19a, JHG18]. Moreover, we are able to visualize useful non-robust features of standard models (Fig. 4). While this conceptual feature distinction has been highly influential in recent works that study the robustness of deep neural networks (see for example [ZhLi20, KLR21, SMK21]), to the best of our knowledge, none of them has explicitly demonstrated the dependence of networks on such feature functions (except for simple linear models [Goh19]). Rather, these works either reveal such features in some indirect fashion, or accept their existence as an assumption. Here, we show that Neural Tangent Kernel theory endows us with a natural definition of features through its eigen-decomposition and provides a way to visualise and inspect robust and non-robust features directly on the function space of trained neural networks.

Figure 2: Left: Top 5 features for 7 different kernel architectures for a car image extracted from the CIFAR10 dataset when trained on car and plane images. Right: Features according to their robustness (x-axis) and usefulness (y-axis). Larger/darker bullets correspond to larger eigenvalues. Useful features have -usefulness; shaded boxes are meant to help visualize useful-robust regions.

Interestingly, this connection also enables us to empirically demonstrate that robust features of standard models alone are not enough for robust classification. Aiming to understand, then, what makes robust models robust, we track the evolution of the data-dependent empirical NTK during adversarial training of neural networks used in practice. Prior experimental work has found that networks with non-trivial width to depth ratio which are trained with large learning rates, depart from the NTK regime and fall in the so-called “rich feature” regime, where the NTK changes substantially during training [Gei+19, Fort+20, Bar+21, Jim+21]. In our work, which to the best of our knowledge is the first to provide insights on how the kernel behaves during adversarial training, we find that the NTK evolves much faster compared to standard training, simultaneously both changing its features and assigning more importance to the more robust ones, giving direct insight into the mechanism at play during adversarial training (see Fig. 6). In summary, the contributions of our work are the following:

  • We discuss how to generate adversarial examples for infinitely-wide neural networks via the NTK, and show that they transfer to fool their associated (finite width) nets in the appropriate regime, yielding a "training-free" attack without need to access model weights (Sec. 3).

  • Using the spectrum of the NTK, we give an alternative definition of features, providing a natural decomposition or perturbations into robust and non-robust parts [Tsi+19, Ily+19] (Fig. 1). We confirm that robust features overwhelmingly correspond to the top part of the eigenspectrum; hence they are learned early on in training. We bolster previously conjectured hypotheses that prediction relies on both robust and non-robust features and that robustness is traded for accuracy during standard training. Further, we show that only utilizing the robust features of standard models is not sufficient for robust classification (Sec. 4).

  • We turn to finite-width neural nets with standard parameters to study the dynamics of their empirical NTK during adversarial training. We show that the kernel rotates in a way that enables both new (robust) feature learning and that drastically increases of the importance (relative weight) of the robust features over the non-robust ones. We further highlight the structural differences of the kernel change during adversarial training versus standard training and observe that the kernel seems to enter the “lazy” regime much faster (Sec. 5).

Collectively, our findings may help explain many phenomena present in the adversarial ML literature and further elucidate both the vulnerability of standard models and the robustness of adversarially trained ones. We provide code to visualize features induced by kernels, giving a unique and principled way to inspect features induced by standardly trained nets.

Related work: To the best of our knowledge the only prior work that leverages NTK theory to derive perturbations in some adversarial setting is due to YuWu21, yet with entirely different focus. It deals with what is coined generalization attacks: the process of altering the training data distribution to prevent models to generalise on clean data. Bai+21 study aspects of robust models through their linearized sub-networks, but do not leverage NTKs.

2 Preliminaries

We introduce background material and definitions important to our analysis. Here, we restrict ourselves to binary classification, to keep notation light. We defer the multiclass case, complete definitions and a more detailed discussion of prior work to the Appendix.

2.1 Adversarial Examples

Let be a classifier, be an input (e.g. a natural image) and its label (e.g. the image class). Then, given that is an accurate classifier on , is an adversarial example [Sze+14] for if

  1. the distance is small. Common choices in computer vision are the norms, especially the norm on which we focus henceforth, and

  2. . That is, the perturbed input is being misclassified.

Given a loss function

, such as cross-entropy, one can construct an adversarial example by finding the perturbation that produces the maximal increase of the loss, solving


for some that quantifies the dissimilarity between the two examples. In general, this is a non-convex problem and one can resort to first order methods [GSS15]


or iterative versions for solving it [KGB17, Mad+18]. The former method is usually called Fast Gradient Sign Method (FGSM) and the latter Projected Gradient Descent (PGD)

. These methods are able to produce examples that are being misclassified by common neural networks with a probability that approaches 1


. Even more surprisingly, it has been observed that adversarial examples crafted to “fool” one machine learning model are consistently capable of “fooling” others

[PMG16, Pap+17], a phenomenon that is known as the transferability of adversarial examples. Finally, adversarial training refers to the alteration of the training procedure to include adversarial samples for teaching the model to be robust [GSS15, Mad+18] and empirically holds as the strongest defense against adversarial examples [Mad+18, Zha+19].

2.2 Robust and Non-Robust Features

Despite a vast amount of research, the reasons behind the existence of adversarial examples are not perfectly clear. A line of work has argued that a central reason is the presence of robust and non-robust features in the data that standard models learn to rely upon [Tsi+19, Ily+19]. In particular it is conjectured that reliance on useful but non-robust features during training is responsible for the brittleness of neural nets. Here, we slightly adapt the feature definitions of [Ily+19]111We distinguish useful and robust features based on their accuracy as classifiers, not in terms of correlation with the labels as in Ily+19, allowing a natural extension to the multi-class setting. For robustness, we consider any accuracy bounded away from zero as robust, quantifying that an adversary cannot drive accuracy to zero entirely., and extend them to multi-class problems (see Appendix A).

Let be the data generating distribution with and . We define a feature as a function and distinguish how they perform as classifiers. Fix :

  1. -Useful feature: A feature is called -useful if

  2. -Robust feature: A feature is called -robust if it remains useful under any perturbation inside a bounded “ball” , that is if


In general, a feature adds predictive value if it gives an advantage above guessing the most likely label, i.e. , and we will speak of “useful” features in this case, omitting the . We will call such a feature useful, non-robust if it is useful, but -robust only for or very close to , depending on context.

The vast majority of works imagines features as being induced by the activations

of neurons in the net, most commonly those of the penultimate layer (

representation-layer features), but the previous formal definitions are in no way restricted to activations, and we will show how to exploit them using the eigenspectrum of the NTK. In particular, in Sec. 4, we demonstrate that the above framework agrees perfectly with features induced by the eigenspectrum of the NTK of a network, providing a natural way to decompose the predictions of the NTK into such feature functions. In particular we can identify robust, useful, and, indeed, useful non-robust features.

2.3 Neural Tangent Kernel

Let be a (scalar) neural network with a linear final layer parameterized by a set of weights and be a dataset of size , with and . Linearized training methods study the first order approximation


The network gradient induces a kernel function , usually referred as the Neural Tangent Kernel (NTK) of the model


This kernel describes the dynamics with infinitesimal learning rate (gradient flow). In general, the tangent space spanned by the twists substantially during training, and learning with the Gram matrix of Eq. (6) (empirical NTK) corresponds to training along an intermediate tangent plane. Remarkably, however, in the infinite width limit with appropriate initialization and low learning rate, it has been shown that becomes a linear function of the parameters [JHG18, Liu+20], and the NTK remains constant (). Then, for learning with loss the training dynamics of infinitely wide networks admits a closed form solution corresponding to kernel regression [JHG18, Lee+19, Aro+19b]


where is any input (training or testing), denotes the time evolution of gradient descent, is the (small) learning rate and, slightly abusing notation, denotes the matrix containing the pairwise training values of the NTK, , and similarly for . To be precise, Eq. (7) gives the mean

output of the network using a weight-independent kernel with variance depending on the initialization

222For that reason, in the experiments, we often compare this with the centered prediction of the actual neural network, , as is commonly done in similar studies [COB19]..

3 Transfer Results in the Kernel Regime

In this section, we show how to generate adversarial examples from NTKs and discuss their similarity to the ones generated by the actual networks. Note that for network results, we restrict ourselves to wide networks initialized in the “lazy” regime with small learning rates (the “kernel regime”).

3.1 Generation of Adversarial Examples for Infinitely Wide Neural Networks

Adversarial examples arise in the context of classification, while the NTK learning process is described by a regression as in Eq. (7). The arguably simplest way to align with the framework presented in Eq. (1

) is to treat the outputs of the kernel similar to logits of a neural net, mapping them to a probability distribution via the sigmoid/softmax function and apply cross-entropy loss.

A simple calculation (see Appendix B, together with the generalization to the multi-class case) gives:

The optimal one step adversarial example of a scalar, infinitely wide, neural network is given by


for , where .

One can conceive other ways to generate adversarial perturbations for the kernel, either by changing the loss function (as previously done in neural networks (e.g. [CaWa17])) or through a Taylor expansion around the test input, and we present such alternative derivations in Appendix B. However, in practice we observe little difference between that approach and the one presented here.

3.2 Transfer Results and Kernel Attacks

Predictions from NTK theory for infinitely wide neural networks have been used successfully for their large finite width counterparts, so it seems reasonable to conjecture that adversarial perturbations generated via the kernel as in Eq. (8) resemble those directly computed for the corresponding neural net as per Eq. (2). In particular, this would imply that adversarial perturbations derived from the NTK should not only fool the kernel machine itself, but also lead wide neural nets to misclassify.

Figure 3: Top

. Cosine similarity between the loss gradient of the neural net and of the NTK prediction for the same time point.

Bottom. Robust accuracy of neural net against its own adversarial examples (solid) and corresponding NTK examples (dashed). CIFAR10, car vs plane.

While similar transfer results in different contexts have been observed indirectly, via the effects of the perturbation on metrics like accuracy [YuWu21, Ngu+21], we aim to look deeper to compare perturbations directly. High similarity would imply that any gradient based white-box attack on the neural net can be successfully mimicked by a “black-box” kernel derived attack.


. To this end, we train multiple two-layer neural networks on image classifications tasks extracted from MNIST and CIFAR-10 and compare adversarial examples generated by Eqs. (

2) (attacking the neural network) and (8) (attacking the kernel). The networks are trained with small learning rate and are sufficiently large, so lie close to the NTK regime.

We track cosine similarity between the gradients of the loss from the NTK predictions and the gradients from the actual neural net as training evolves. Then, we generate adversarial perturbations from both the neural net and the kernel machine, and test whether those produced by the latter can fool the former. Full experimental details can be found in Appendix C.

Results. Our experiments confirm a very strong alignment of loss gradients from the neural nets and the NTK across the whole duration of training, as can be seen in Fig. 3 (top). Then, as expected, kernel-generated attacks produce a similar drop in accuracy throughout training as the networks “own” white-box attacks, eventually driving robust accuracy to , as seen in Fig. 3 (bottom). We reproduce these plots for MNIST in Appendix C, leading to similar conclusions.

When concerned with security aspects of neural nets, adversarial attacks are mainly characterised as either white-box or black-box attacks [Pap+17]. White box attacks assume full access to the neural network and in particular its weights; prominent examples include FGSM/PGD attacks. Black box attacks, on the other hand, can only query the model to try to infer the loss gradient, either through training separate surrogate models [PMG16] or through carefully crafted input-output pairs fed to the target model [Che+17, Ily+18, And+20]. NTK theory and the experiments of this section suggest a threat model in which the attacker does not require access to the model or its weights, nor training of a substitute model. For fixed architecture and training data, all the information required for the computation of Eq. (8) is available at initialization, making the “NTK-attack” akin to a “training free” substitution attack, and, at least in the kernel-regime for wide nets considered here, as effective as white-box attacks.

4 NTK Eigenvectors Induce Robust and Non-Robust Features

This close connection between adversarial perturbations from the kernel and the corresponding neural net gives us the opportunity to bring to bear kernel tools on the study of adversarial robustness and its relation to features in a more direct fashion. Several recent works leverage properties of the NTK, and specifically its spectrum, to study aspects of approximation and generalization in neural networks [Aro+19a, Bas+19, BiMa19, Bas+20]. Here we show how the spectrum relates to robustness and helps to clarify the notion of robust/non-robust features.

We define features induced by the eigendecomposition of the Gram matrix . We will be most interested in the end of training, when the model has access to all the features it can extract from the training data . As , Eq. (7) becomes and can be decomposed as , where


Each can be seen as a unique feature captured from the (training) data. Note that these functions map the input to the output space, thus matching the definitions of Sec. 2.2. Also observe that all ’s jointly recover the original prediction of the model, while each one, intuitively, should contribute something different to it.

Importantly, these features induce a decomposition of the gradient of the loss into parts, each representing gradients of a unique feature as already advertised in Fig. 1. The binary case is particularly elegant as it gives rise to a linear decomposition of the gradient as


for some depending on and (see Appendix D). But if ’s are features, how do they look like?

Feature properties of common architectures:

Figure 4: Non-robust, useful features earlier and later in the spectrum, for CIFAR10 car and plane.

With these definitions in place, we can now analyze the characteristics of features for commonly used architectures, leveraging their associated NTK. To be consistent with the previous section, we consider classification problems from MNIST (10 classes) and CIFAR-10 (car vs airplane). We compose the Gram matrices from the whole training dataset (50000 and 10000, respectively), and compute the different feature functions

using the eigendecomposition of the matrix. We estimate the

usefulness of a feature by measuring its accuracy on a hold-out validation set, and its robustness by perturbing each input of this set, using an FGSM attack on feature . We consider several different Fully Connected and Convolutional Kernels, whose expressions are available through the Neural Tangents library [Nov+20], built on top of JAX [Brad+18]. We summarize our findings on how these features behave:

Functions represent visually distinct features. We visualise each feature by plotting its gradient with respect to . Fig. 2 shows the gradient of the first 5 features for various architectures for a specific image from the CIFAR-10 dataset. We observe that features are fairly consistent across models, and they are interpretable: for example the 4th feature seems to represent the dominant color of an image, while the 5th one seems to be capturing horizontal edges.

Networks use both robust and non-robust features for prediction. It has been speculated that neural networks trained in a standard (non adversarial) fashion rely on both robust and non-robust features. Our feature definition in Eq. (9) shows that this is indeed the case. The NTK of common neural networks consists of both robust features that match human expectations, such as the ones depicted in Fig. 2, but also on features that are predictive of the true label, while not being robust to adversarial perturbations of the input (Fig. 4). Fig. 2 depicts the first 100 features of a fully connected and a convolutional tangent kernel in Usefulness-Robustness space. The upper left region of the plots shows a large amount of useful, yet non-robust features. These features seem random to human observers.

Robustness lies at the top. We observe in Fig. 2

that features corresponding to the top eigenvectors tend to be robust. This is consistent among different models and between the two datasets (see Appendix

D). Since these eigenvectors are the ones fitted first during training [Aro+19a, JHG18], it is no wonder that the loss gradient evolves from coherence to noise, as observed in Fig. 6(b). This also explains the apparent trade-off between robustness and accuracy of neural networks as training progresses: useful, robust features are fitted first, followed by useful, but non-robust ones. This ties in well with both empirical findings [Rah+19] and theoretical case studies [Bas+19, BiMa19, Bas+20] that demonstrate that low frequency functions are fitted first during training and provide favorable generalization properties and we would associate robust features with these low-frequency parts (in function space).

Robust features alone are not enough. In light of these findings, it might be reasonable to conjecture that we could obtain robust models by retaining the robust features of the prediction, while discarding the non-robust ones. The spectral approach gives a principled way to disentangle features and create kernel machines keeping only the robust ones. Our results show that in general it is not possible to obtain non-trivial performance without compromising robustness in this fashion, strengthening the case for the necessity of data augmentation in the form of adversarial training (see Appendix D.3).

5 Kernel Dynamics during Adversarial Training

Given the apparent necessity for adversarial training to produce robust models, how does it achieve this goal? To shed some light on this fundamental question, we depart from the “lazy” NTK regime and study the evolution of the NTK of adversarially trained models. For a neural network trained with gradient descent, as the learning rate , the continuous time dynamics can be written as


In the NTK regime, this kernel remains fixed at its initial value. However, outside this regime, it has been demonstrated, both empirically [Gei+19, Fort+20, Bar+21, Jim+21] and theoretically [ABP21], that is not constant during training, and is changing as the weights move. In adversarial training, moreover, there is the additional effect that at each weight update, the data changes as well. For that reason, understanding the dynamics of adversarial training requires tracking the evolution of a kernel , where

denotes the current (mini) batch of training data. Notice that the tangent vector

is still describing the instantaneous change of on the current batch of data, thus is informative of the local geometry of the function space, justifying its value as a quantity to be measured during adversarial training.

We train a deep convolutional architecture on CIFAR-10 (multiclass) with standard (SGD) and adversarial training using PGD with an constraint. Full implementations details and accuracy curves can be found in Appendix E, together with the reproduction of the same experiment on MNIST, where the observations are similar. We track the following quantities during training:

Figure 5: Left: Kernel Matrices for a mini batch of size 256. Left to Right: Kernel at initialization, Kernel after standard training, Kernel after adversarial training (20 PGD steps). The standard kernel grows significantly more than the adversarial one. Right: (a) Kernel Frobenius norm evolution, and (b) concentration on the top 20 eigenvalues during standard and adversarial training. Setting: CIFAR10, .

Kernel distance. We compare two kernels using a scale invariant distance, which quantifies the relative rotation between them, as used in other works studying NTK dynamics (e.g. Fort+20):


Polar dynamics. Zooming in on the change that the initial kernel undergoes, we define a polar space on which we measure the movement of the kernel:


where are the initial and final kernel, respectively. Fig. 6 presents a heatmap of kernel distances at different time steps for both standard and adversarial training, as well as both training trajectories in polar space.

Figure 6: Left: Rotation (Eq. (12)) of the empirical NTK during standard, and adversarial training. Left to right: MNIST, standard, MNIST adversarial, CIFAR standard, CIFAR adversarial. Right: Kernel trajectories in polar space (Eq. (13

)) for MNIST (left) and CIFAR10 (right). Darker colors indicate earlier epochs.

Concentration on subspaces. To quantify weight concentration on the top region of the spectrum, we track the (normalized) Frobenius norm of subspaces as , for various cut-offs , where we have indexed the eigenvalues from largest to smallest. Fig. 5 depicts concentration on the top 20 eigenvalues during training.

Our findings show that similar to what has been reported in prior work [Fort+20], the kernel rotates significantly in the beginning of training and then slows down for both standard and adversarial training. However, in the latter case, this second phase begins a lot earlier. As Fig. 6 illuminates, the kernel moves a greater distance than when performing standard training, but after a few epochs it stops both rotating and expanding; note that this is not the case for standard training where the kernel increases its magnitude substantially later in training, and in fact grows to have a norm orders of magnitude larger than during adversarial training (see Fig. 5). In hindsight, this behavior is perhaps not surprising, as each element of the kernel measures similarity between data points, and a robust machine should be more conservative when estimating similarity. The observation that during adversarial training the kernel becomes relatively static relatively fast might indicate that linear dynamics govern the later phase of adversarial training. It has been observed in previous works [Gei+19, Fort+20, Jim+21] that linearization after a few initial epochs of rapid rotation often closely matches performance of full network training. Our results indicate that a similar phenomenon occurs even under the data shift of adversarial training (see Appendix E.1 for a study of linearized adversarial training), opening avenues to design robust machines more efficiently.

Moreover, endowed with the knowledge that at least for kernels trained with static data robust features lie at the top, we study polar dynamics of the top space only (see Fig. 14) to observe that there is substantial rotation in this space, suggesting that robust features are learned early on not only during standard, but in particular during adversarial training. Even more interestingly, Fig. 5 demonstrates that not only the robust features change, but their relative weight as measured by the concentration on the top-20 space is increasing simultaneously relative to standard training as well, and remains large; in fact, significantly larger than during standard training. As each eigenvalue weights the importance of the corresponding feature on the final prediction, this implies that the kernel “learns” to depend more on the most robust features.

Put together, these findings reveal different kernel dynamics during standard and adversarial training: the kernel rotates much faster, expands much less and becomes “lazy” much earlier than during standard training. Fully understanding the properties of converged adversarial kernels remains an important avenue for future work, that might allow to design faster algorithms for robust classification.

6 Final Remarks

We have studied adversarial robustness through the lens of the NTK across multiple architectures and data sets both in the idealized NTK regime and the “rich feature” regime. When connecting the spectrum of the kernel with fundamental properties characterizing robustness our phenomenological study reveals a universal picture of the emergence of robust and non-robust features and their role during training. There are certain limitations and unexplored themes in our work; Sec. 3 argues that transferable attacks from the NTK may be as effective as white-box attacks, but this warrants an in-depth study across architectures, kernels and data sets (which has not been the main focus of this work). Sec. 4 visualises features for fairly simple models, since the computation of kernel derivatives is a costly procedure. It would be interesting to use our framework to visualise features from more complicated architectures. Finally, our work in Sec. 5 invites more research on the kernel at the end of adversarial training, similar to what has been done for standard models [Long21].

We hope that our viewpoint can motivate further theoretical understanding of adversarial phenomena (such as transferability) and the design of better and/or faster adversarial learning algorithms, by further analyzing the kernels from robust deep neural networks.


The authors would like to thank Jingtong Su, Alberto Bietti, Yunzhen Feng, and Artem Vysogorets for fruitful discussions and feedback in various stages of this work. NT thanks Dimitris Tsipras for a helpful discussion in the beginning of this project. The authors would like to acknowledge support through the National Science Foundation under NSF Award 1922658. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.



The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default to , , or . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets? See Sec. 4 and Appendix.

  • Did you include the license to the code and datasets? The code and the data are proprietary.

  • Did you include the license to the code and datasets?

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work?

    3. Did you discuss any potential negative societal impacts of your work? Our work sheds light properties of adversarial examples to make mahcine learning models more reliable in the long run.

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results?

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators?

    2. Did you mention the license of the assets?

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating?

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Robust and Non-Robust features

The idea that data features are to be blamed for the adversarial weakness of machine learning models was proposed in [Ily+19, Tsi+19]. In particular, Ily+19 show that training with adversarially perturbed images labeled with the “wrong” label yields classifiers with non-trivial test performance (“learning from non-robust features only”), while, in a dual experiment, they demonstrate that standard training with “robustified” data (data that presumably are “denoised” from non-robust features) produces a classifier with non-trivial robust accuracy (“relies only on robust features”). Motivated by these observations, the authors propose a model of robust/non-robust features that are hidden in the data, and whose presence determines the eventual robustness of models. To accompany the definitions of Sec. 2.2, we extend them for multiclass classification, since Sec. 4 introduces our NTK feature framework for both binary and multiclass problems.

Let be the data generating distribution, with (input space) and (action space). We define features as functions from the input to the action space, and categorize them as follows, according to their performance as classifiers. Fix :

  1. -Useful feature: A feature is called -useful if

  2. -Robust feature: A feature is called -robust if it is predictive of the true label under any perturbation inside a bounded “ball” , that is if

  3. Useful, non-robust feature: A feature is called useful, non-robust if it confers an advantage above guessing the most likely label, i.e. , but is -robust only for (within some precision).

The above framework was introduced by [Ily+19, Tsi+19], and we have slightly adapted it in terms of accuracy as classifiers derived from features. Goh19 showed how such feature functions arise in a simple linear model, and proposed two mechanisms to construct useful, non-robust features. In [ZhLi20], the authors view the weights of neural networks as features, and show that adversarial training “purifies/robustifies” them.

Appendix B Derivation of Adversarial Perturbations for Kernel Regression

In this section, we derive expressions for adversarial attacks on Neural Tangent Kernels presented in the main paper, as well as additional derivations obtained from first-order expansions around the input.

b.1 Adversarial Perturbations from Cross-Entropy Loss

We first derive the expression in Eq. (8) of the paper. Let be an input to the NTK prediction


where is a dataset of size . We consider the binary and the multiclass case separately.

In the binary case, where , we feed expression Eq. (16) to a sigmoid and maximize the cross entropy loss between the output and the true label:


where we set to lie in . We compute the gradient of the loss with respect to :


So the optimal one-step attack, under an adversary, reduces to computing perturbation


since for all .

In the case of a k-class classification problem with one hot labels , we can express the cross entropy loss between the NTK predictions Eq. (16) and the labels as:


where denotes the -th output of Eq. (16). Computing the loss gradient as before yields the optimal perturbation ,


The above calculations allow us to speed up the computation of the attacks in the case of NTKs with closed form expression, since the gradient


with D being the Jacobian of wrt to , can be pre-computed, without the need for auto-differentiation tools. We leverage this in the experiments of Sec. 3.

b.2 Alternative Approaches to Generate Perturbations

One can derive other perturbation variants by changing the loss function from cross-entropy to other functions studied in the literature in this context (e.g. [CaWa17]). Alternatively, we can study the output on a test input directly to devise strategies to most efficiently perturb it, using a Taylor expansion around the input, leading to a linear expression (shown here for scalar kernels):


for some that depends on the training data and the NTK kernel only.

Binary case:

Suppose we would like to evaluate a model described by Eq. (7) at the end of training,


on slightly perturbed variations of the original training data. Then, slightly abusing notation, we set, , that is for all for small, but unknown, perturbations . By taking a first-order Taylor expansion in the perturbation, we can write the -th element of as follows:


For each row we obtain:


Hence, can be written as for a perturbation matrix , with -th row . Substituting into Eq. (24), we get:


Thus, the output of the model on is:


leading to the linear expression advertised in Eq. (23). The adversarial perturbation changes the output by , an expression which allows us to compute the adversarial perturbation to maximally change the output within the desired constraints on .

Since Eq. (24) describes regression models with LSE (-loss), while adversarial examples typically are studied for classification models, we use thresholding (i.e. taking the sign of the output in the case of binary classification tasks) or by outputting the maximum prediction (in the case of multiclass problems) to turn Eq. (24) into a classifier.

Inspecting Eq. (28), maximal “confusion" of the classification model is achieved by aligning with (directed towards the decision boundary). In case of the commonly used restriction, i.e. , the optimal adversarial perturbation is given by:


The computation of this optimal adversarial perturbation requires an expression for the NTK and its gradient with respect to the training data. For models where an analytical expression of the NTK is available, only access to the labeled training data is necessary (as presented, for instance, in Sec. C). In more complicated models or those that deviate from the assumptions for Eq. (24) one can compute an empirical kernel by sampling over kernels at initialization over a few instances and obtain the matrices with autodifferentiation tools.

Eq. (28) has been derived for perturbations of the training data. Consider now the case when we evaluate Eq. (24) on perturbations of unseen test data, that is on . Then, Eq. (27) becomes:


Again, solely the second term depends on the perturbation, so we proceed by choosing a maximally perturbing direction as before. The only difference lies in the matrix that now depends on the test set


In practice, an adversary can calculate the NTK offline and calculate the optimal perturbation on a new test input by computing the corresponding row of the matrix . Importantly, no information on the test data labels is needed.

Multiclass case:

We adapt the derivations of the binary case to the setting where the output dimension is larger than one in the underlying regression setting (see below), resulting in a multiclass classifier. This leads to the multi-dimensional analogue of the linear Eq. (23) for , :


Again, the can be computed from the NTK and its derivative as well as the training data labels. Exactly analogous considerations as in the binary case allow to adapt this expression to perturbations of the test data.

At this point we have a choice of how to adversarially perturb the classifier to achieve the largest effect on the network output. We present the two most obvious methods.

Max-of- perturbation: Similar in spirit to traditional approaches in adversarial attacks (CaWa17) we choose such as to most efficiently decrease the correct response while maximally increasing one of the false responses . The solution is given by:


It is obtained by solving



Sum-of- perturbation: For one-hot vectors we could, instead, maximize the cross-entropy between the labels and the new outputs, thus choosing to produce a maximally mixed output. If is the correct label, this yields


derived as follows


Maximizing this cross entropy amounts to maximizing

For small perturbations we can develop the exponential to first order333The resulting expression for the maximum also holds when developing to second order., which leads to finding the maximum of

yielding Eq. (35).

Derivation of Eq. (32): While we remain with as in the binary case, the other quantities change as , and , i.e. for each data pair we have . Let denote the entry of that corresponds to the -th and the -th output of the model (evaluated at and ). Then, with similar reasoning that led to Eq. (25) we now obtain:


For the prediction of the model on the whole dataset, we have:


which for a given sample gives:


where is equal to


Appendix C Transfer Results for Wide Two-Layer Networks

In this section, we present additional experimental details for Sec. 3.2 and show the results of the experiments on MNIST. We train two-layer neural networks of the form


where the first layer is initialized with the normal distribution, the second layer is frozen to its initial random values in

, and denotes the width of the network. The NTK of this architecture is given by


We choose this family of models in order to be consistent with early works that analyzed training and generalization properties of neural networks in the NTK regime [Aro+19a, Du+19a]. We perform experiments on image classification on MNIST and on a binary task extracted from CIFAR-10 (car vs airplane). We train the networks in a regression fashion, minimizing the loss between the predictions and one-hot vectors, using full-batch gradient descent on the entire dataset (full training data for MNIST and 5K images for each of car and airplane in binary CIFAR). We keep the learning rate fixed to and vary the width of the network in . We train 3 networks for each dataset until convergence ( epochs), each initialized with a different random seed. When we measure quantities from the neural net, we subtract the initial prediction , since the NTK expression Eq. (16) does not take the initialization of the network into account. When attacking the models ( attacks), we use perturbation budget for MNIST and

for CIFAR-10. The experiments are performed with PyTorch


For each model, we calculate the loss gradients with respect to the input during training, and compare them to those derived for the NTK in Eqs. (18) and (21) for the binary and the multiclass task, respectively, using cosine similarity:


where is the NTK prediction defined in Eq. (16), denotes the output of the neural net and is the initial prediction of the neural net (prior to training). In order to match the time-scales, we manually align the networks on epoch = with a time-point for the NTK, and based on this number, we match the rest of the epochs assuming linear dependence (as theory predicts - Eq. (16)). Fig. 6(a) shows cosine similarity of loss gradients and robust accuracy of the network (evaluated against its own adversarial examples, and those from the NTK) for MNIST. Fig. 6(b) illustrates the similarity of loss gradients of neural nets and their NTKs for 3 different epochs.

(a) Comparison of NTK and neural net derived quantities for digit recognition (MNIST) during training.
(b) Loss gradients.
Figure 7: Visualizing the similarity of NTK and NN adversarial quantities. (a) Top. Cosine similarity between the loss gradient of the neural net and of the NTK prediction for the same time point (MNIST). Bottom. Robust accuracy of neural net against its own adversarial examples (solid) and corresponding NTK examples (dashed) for MNIST. Blue and red lines overlap in the second plot, and the effect of the random seed is insignificant. (b) Illustration of the similarity of loss gradients for NTK (right column) and neural nets of width (left) and (middle) for a specific image extracted from CIFAR-10. Columns show gradients for different epochs (, respectively).

Notice the very small discrepancy between the loss gradients of different networks (initialized with different random seeds) in Fig. 6(a). They are all centered around the loss gradient of the NTK, a manifestation of transferability of adversarial examples, at least for models with the same architecture. The NTK framework might possibly provide a wider explanation of this phenomenon, also across architectures. For instance, for fully connected kernels, the NTK expression for kernels of depth is a relatively simple function of expressions for depth [JHG18, BiMa19] which could explain transferability across architectures of varying depth.

Appendix D NTK Features: Additional Details

In this section, we present additional material for Sec. 4; we show derivations that are missing from the main text, and complement the plots by showing the same information for more architectures and datasets.

d.1 Loss Gradient Decomposition

First, recall our definitions of features from Sec. 4. Let be a dataset, where and (binary classification). Then, kernel regression on this dataset gives predictions of the form . Given, the eigendecomposition of the Gram Matrix , we can decompose the prediction as follows


where . Notably, this means that the gradient of the cross entropy loss can be also understood as a composition of gradients coming from these features, as the following proposition shows.

Figure 8: Robustness Usefulness space for various kernels, MNIST multiclass. The axes lie in . "Useful" features have usefulness above 0.1 (the random guessing probability for our balanced data set). The colored red, green and yellow boxes are arbitrary, meant to visually distinguish useful-robust from other features.
Proposition 1.

The loss gradient of can be decomposed as follows:


where is a quantity that depends on .


Recall from Eq. (18), that . Then, we have


d.2 Additional Plots

Complementing Fig. 2 in the main text, we show (the first 100) NTK features in Robustness - Usefulness space defined in Sec. 4 for a larger number of architectures for both MNIST and CIFAR in Fig. 8 and 9. We use available analytical NTK expressions for standard FC{1,2,3,4,5} and CONV{1,2} architectures in the NTK regime to evaluate and decompose kernels on a subset of 10K MNIST training images and 10K binary CIFAR images - 5K cars and 5K airplanes. We note that within a dataset, the plots do not change much between architectures, speaking to the universal nature of these kernel-induced features.

Figure 9: Robustness Usefulness space for various kernels, CIFAR-10 car vs plane. The axes lie in . Fig. 2 in the main text shows FC2 and CONV 2. "Useful" features have usefulness (the random guessing probability for a binary balanced data set). The colored red, green and yellow boxes are arbitrary, meant to visually distinguish useful-robust from other features.
Figure 10: Robustness of kernel when keeping a few of the top robust features (MNIST). The Gram matrix is computed using 10k images from the training set. Blue lines show clean accuracy, red and purple (lines are overlapping) show accuracy against FGSM and PGD10 examples generated using the full kernel machines (consisting of all the features) and orange and green show the resulting robustness of the new model (FGSM and PGD10, respectively). Accuracy on the -axis lies in .

d.3 Robust Features Alone are not Enough

Feature definitions outlined in Sec. 4 open an avenue to use traditional feature selection methods to search for robust models. In particular, here we rank the features of an NTK based on their robustness on a validation set (accuracy against adversarial examples computed from the same feature - setting: FGSM with

for MNIST or for CIFAR-10). Specifically, we test and rank each "one-feature kernel" function . Given this ranking, we construct a sequence of new kernels by progressively aggregating the most robust features with their original eigenvalues. This gives rise to kernel machines of the form , where indicates the number of top robust features kept. We present the results of this approach in Figures 10 (MNIST) and 11 (CIFAR-10), where we plot clean accuracy as well as robust accuracy against perturbation from the kernel itself as well as against "transfer" perturbations from the original (full) kernel.

On the binary classification task, some robustness can be garnered by keeping the most robust features and there seems to exist a sweet spot where the robustness is maximized (this seems to be consistent across other models as well). On multiclass MNIST, however, despite the relative simplicity of the dataset, we are not able to obtain non-trivial performance without compromising robustness. We conclude that it is unlikely that robust features (of standard models) alone are sufficient for robust classification, and the burden of some data augmentation, like in the form of adversarial training, seems necessary, at least for the models considered in our experiments.

Figure 11: Robustness of kernel keeping a few of the robust features (CIFAR car vs plane). The Gram matrix is computed using all 10K images from the training set. Blue lines show clean accuracy, red and purple (lines are overlapping) show accuracy against FGSM and PGD10 examples generated using the full kernel machines (consisting of all the features) and orange and green show the resulting robustness of the new model (FGSM and PGD10, respectively). Accuracy on the -axis lies in .

Appendix E Experimental Details for the Kernel Dynamics Section

Here we provide the details of our experiments in Sec. 5, where we compare standard and adversarial training by tracking several kernel quantities.

For experiments with MNIST, we use a simple convolutional architecture with 3 layers. The first 2 layers compute a convolution (with a 3

3 kernel), followed by a ReLU and then by an average pooling layer (of kernel size 2

2 and stride 2). The 3rd layer is fully-connected with a ReLU non-linearity, followed by a linear prediction layer with 10 outputs. The layers have width 32, 64 and 256, respectively.

For CIFAR-10, we use a deeper architecture consisting of 6 layers. Layers 1 and 2, 3 and 4, 5 and 6 are fully convolutional with 32, 64 and 128 channels, respectively, and a kernel of size 3

3. There is a max pooling operation after layer 2, and average pooling after the final layer, followed by a linear prediction layer. Both pooling operations use a kernel of size 2

2 and stride 2.

We use a fixed learning rate of

for all experiments and no weight decay. We do not use any data augmentation, since we are interested in analyzing the behavior of kernels, rather than obtaining the best possible results. Stochastic gradient descent is used in all cases, with a batch size of 300 for MNIST and 250 for CIFAR-10. The kernels quantities are tracked for the same (first) batch during training. For adversarial training, we either used FGSM or PGD (for generating the adversarial examples) with 20 steps against

adversaries. The maximum perturbation size is set to and (for MNIST and CIFAR-10, respectively), and in the case of PGD training we use an attack step size of and , respectively. Experiments were run with JAX [Brad+18], and empirical NTKs were computed using the Neural Tangents Library [Nov+20]. Neural nets were trained using Flax [Hee+20] and the JaxOpt library [Blo+21], adapting code available from the JaxOpt repository. This code snippet was licensed under the Apache License, Version 2.0.

Figure 12: Training curves for networks trained in Sec. 5. From left to right: Standard training on MNIST, Adversarial (PGD-20) training on MNIST, Standard training on CIFAR-10, Adversarial (PGD-20) training on CIFAR-10. For each of the 4 settings, we show train/test accuracy on clean and on adversarially perturbed (PGD-20) data.

Models were trained for 200 epochs. Fig. 12 summarizes the performance of the networks during training. In Fig. 13, we show how norm concentration evolves during training - similar to the plots for CIFAR-10 in Fig. 5, but for MNIST and for two choices of eigenvalue index cut-off.

Figure 13: Concentration of norm during standard training vs adversarial training. Left: Concentration on top 10 (MNIST). Middle: Concentration on top 20 (MNIST). Right. Concentration on top 10 (CIFAR-10) (Fig. 5 in Sec. 5 shows Concentration on top 20 for CIFAR-10). For MNIST, we observe that when performing adversarial training with just one-step adversary (FGSM), the mass drops below the level of standard training. This is likely related to a phenomenon called catastrophic overfitting which is widespread in simple FGSM training settings [Won+20].

Fig. 14 shows the polar dynamics for the top space (top 20 eigenvalues) of the kernel. We observe little to no change for adversarial training from Fig. 6 in the main text that showed the same information for the entire space, though for standard training there is less rotation in the top space. We entertain this as an indication that adversarial training modifies the “robust” (top) features of the kernel more than standard training.

Figure 14: Top-20 dynamics on polar space.

Finally, Fig. 15 shows the values within the kernel matrices before and after training for MNIST for standard and adversarial training. We draw the same conclusions as the main text, namely the “standard” kernel has significantly larger values than the “adversarial” one.

Figure 15: Kernel images for MNIST. Left to Right: Kernel at initialization, Kernel after standard training, Kernel after adversarial training (20 pgd steps). Notice that during training the values increase, but they do substantially more for standard training. Also, observe that for adversarial training there is more spread between different classes. Each little square in the diagonal corresponds to a different class.

e.1 Linearized Adversarial Training

Motivated by the apparent laziness of the kernel during adversarial training and the findings of prior works [Gei+19, Fort+20] that considered linearization (with respect to the parameters) of the model after some epochs, we do the same for adversarial training.

Figure 16: Linearized Adversarial Training on CIFAR-10. Left Linearized after 0 epochs. Middle Linearized after 25 epochs. Right Linearized after 50 epochs. Y-axis has range .

We include a small study that linearizes the kernel after a certain number of epochs. In particular, Fig. 16 shows the training behavior after linearizing the CIFAR-10 model after 25 and 50 epochs, and also at initialization. After linearization, we continue adversarial training in this simple linearized model (meaning we generate adversarial examples from the linear model). We observe that adversarial training continues, without a collapse of the training method. In comparison to non-linearized training (Fig. 12), training seems to stagnate. We also observe that the earlier we linearize, the greater the gap is between standard and robust performance. We leave the investigation of this intriguing phenomenon and a detailed comparison to standard training to future work.