1 Introduction
The seminal work of Jacot et al [JHG18] introduced the “Neural Tangent Kernel” (NTK) as the limit of neural networks with widths approaching infinity. Since this limit holds provably under certain initializations, and kernels are more amenable to analysis than neural networks, the NTK promises to be a useful reduction to understand deep learning. Thus, it has initiated a rich research program to use the NTK to explain various behaviors of neural networks, such as convergence to global minima [DZP+18, DLL+19], good generalization performance [ALL18, ADH+19a], implicit bias of networks [TSM+20] as well as neural scaling laws [BDK+21a].
In addition to the infinite NTK, the emperical NTK — the kernel with features that are gradients of a finitewidth neural network— can be a useful object to study, since it is an approximation to both the true neural network and the infinite NTK. This has also been studied extensively as a tool to understand deep learning [FDP+20, LON21, PPG+21, OMF21].
In this work, we probe the upper limits of this research program: we want to understand the extent to which understanding NTKs (empirical and infinite) can teach us about the success of neural networks. We study this question under the lens of scaling [KMH+20, RRB+19]—how performance improves as a function of samples and as a function of time— since the scaling is an important “signature” of the mechanisms underlying any learning algorithm. We thus compare the scaling of real networks to the scaling of NTKs in the following ways.

Data scaling of initial kernel (Section 3): We show that both the infinite and empirical NTK (at initialization) can have worse data scaling exponents than neural networks, in realistic settings (see Figure 1
). We find that this is robust to various important hyperparameter changes such as learning rate (in the range used in practice), batchsize and optimization method.

Width scaling of initial kernel (Section 3): Since neural networks provably converge to the NTK at infinite width, we investigate why the scaling behavior differs at finite width. We show (Figure 2, 2) realistic settings where as the width of the neural network increases to very large values, the test performance of the network gets worse
and approaches the performance of the infinite NTK, unlike existing results in literature which suggest that overparameterization is always good. This also raises new questions about scaling of neural networks with width, in particular the “variancelimited” neural scaling regimes
[BDK+21a]. 
Data scaling of afterkernel (Section 4):We consider the afterkernel [LON21] i.e. Empirical NTK extracted after training to completion on a fixed number of samples. We show (Figure 1(B), 4) that the afterkernel continues to improve as we increase the training dataset size. On the other hand, we find (Figure 4) that the scaling exponent of the afterkernel, extracted after training on a fixed number of samples remains worse than that of the corresponding neural network.

Time scaling (Section 5): We show (Figure 1(C), 5) realistic settings where the empirical NTK continues to improve uniformly throughout most of the training. This is in contrast with prior work [FDP+20, OMF21, ABP21, LON21] which suggests that the empirical NTK changes rapidly in the beginning of the training followed by a slowing of this change.
We demonstrate these phenomena occur in certain settings which are based on real, nonsynthetic data, and modern architectures (for e.g.: for datasets CIFAR10 and SVHN and convolutional networks). While we do not claim that these phenomena manifest for
all possible datasets and architectures, we believe that our examples highlight important limitations to the use of NTK to understand the test performance of neural networks. Formalizing the set of distributions or architectures for which these phenomenon occur is an important direction for future theoretical research.1.1 Comparison to Prior Work on NTK Generalization
Our main focus is to understand feature learning occurring due to finite width. To do this, we make the following deliberate choices in all of our experiments: a) We use the NTK parameterization, this makes sure that infinite width networks will be equivalent to kernels b) We use the same optimization setup for the neural network, Empirical NTK and Infinite NTK, this makes sure that as width tends to infinity all 3 models will have the same limit. We make sure that our comparisons are robust by c) using scaling laws to compare these models and d) doing various hyperparameter ablations (Figure 3).
Below we describe several lines of related works and how our work differs from them.
Small initialization and representation learning at infinite width.
Infinite widths neural networks in the NTK and standard initialization are equivalent to kernels [JHG18, YH21]. On the other hand it has been shown [YH21, SS19, NP20, AOY19, FLY+20] that with small initialization feature learning is possible at infinite width. The feature learning displayed in our experiments is not due to small initialization as we initialize our networks in the NTK parameterization. This was a deliberate choice as we are interested in feature learning occurring due to finite width as this is the kind of feature learning displayed by empirical neural networks (which usually do not have a small initialization).
Data Scaling for NTKs and neural networks.
Scaling laws have been empirically shown [KMH+20, RRB+19] for neural networks and have been theoretically proven [BCP20, CBP21, BP22] for NTKs under natural assumptions. Comparison between the scaling laws for neural network and empirical NTKs has been previous looked at by Paccolat et. al. [PPG+21] and Guillermo et. al. [OMF21] and both find that neural networks have better scaling than empirical NTK at initialization. Both of these papers do not compare to infinite NTKs which leaves open the possibility that neural networks and infinite width NTKs behave the same wrt their scaling constants.
Pointwise comparisons of neural networks and corresponding infinite NTKs
has also been studied extensively in the literature [ADH+19b, LSP+20, SDD21] but the results have been divided. As discussed earlier we focus on comparing scaling laws. We argue that scaling laws, instead of pointwise comparisons, are the appropriate tool to compare neural networks and NTKs. Practically, pointwise comparisons between any two models can be fraught with issues as the ordering can flip depending on dataset size, as well as the specific choice of hyperparameters. On the other hand, scaling exponents have been found to be more robust to the choice of hyperparameters [BGG+22, KMH+20]. More importantly, the claim that NTK can capture "most" of the performance of the neural network can be subjective, specially when we are comparing small error or loss values. We show that when we look closely at the scaling exponents of these objects instead, we find major differences.
Theoretical studied effects of finite width with respect to the NTK regime.
Finite width corrections to the NTK theory have been studied by [AD20, RYH21, BDK+21b]. While these results do not need infinite widths they still require much higher than practically used widths particularly for the training sizes used in practice. These papers either consider a) the finite width corrections of empirical NTK or b) they consider the change in NTK but predict that the higher order analogues of empirical NTK remain constant. For a) we show that the empirical NTK is very far from the performance of finite width neural networks. Regarding b), in Appendix C we show that the higher order analogues of empirical NTK change significantly.
After Kernel and Time Dynamics
We describe other related works in Appendix F.
2 Experimental Methodology
Here we describe the common methodology used in our experiments.
The core object we want to understand is the datascaling law of real neural networks— that is, what is its asymptotic performance as a function of the number of train samples? Concretely, in this work we restrict to classification problems, where we measure performance in terms of test classification error. For a given classification algorithm, let be its learning curve: its expected test error as a function of number of samples . In practice, many neural networks exhibit powerlaw decay in their learning curves, [KMH+20]. In such settings, we have and we are interested primarily in the scaling exponent , which determines the asymptotic rate of convergence.
Empirical and Infinite NTK
Let be a neural network with representing the weights and an input. By Taylor expansion around we have:
The empirical NTK of the neural network around weights refers to the model . We note that this is not the same as linearizing the network as we are omitting the term. The empirical NTK is a linear model with respect to the weights . The infinite NTK refers to the limit of the empirical NTK of the network around initial weights as width tends to infinity.
For a given learning problem and given neural network architecture NN, we want to understand its datascaling law . We consider the infinite NTK of the NN and the empirical NTK of the NN at initialization and their corresponding learning curves, and . Now we ask: is the scalingexponent of always close to the scalingexponent of either or , in realistic settings? That is, how well does the NTK approximation capture the generalization of real networks, on natural distributions?
Recall that this question is especially interesting because the three objects involved (Neural Network, NTK, and ENTK) all become provably equivalent in the appropriate width limit. Thus, at infinitewidth we know their scaling laws must be the equivalent. The question is then, how far are we from this limit in practice? Are the widths used in practice large enough for their scalingbehavior to be captured by the infinitewidth limit? To probe these questions, we empirically study scaling laws of these methods on imageclassification problems.
Remark on comparisons.
We intentionally only compare a neural network to its corresponding NTK, and not to other kernels. Our motivation not address the question of “can (some) kernel perform as well as as a given neural network?”— indeed, there may be some better kernel to consider than the NTK. However, our goal is to study the specific kernels given by the NTK approximation, in correspondence with real networks.
Datasets.
We use the following datasets:

A 2 class subset (dog, horse) of the CIFAR5m [NNS21] dataset, as a binary classification problem, which we denote CIFAR5mbin. This is a dataset of synthetic but realistic RGB images similar to CIFAR10, which were generated using a generative model.

A binary classification task on the SVHN dataset [NWC+11] with the labels being the parity of the digit, denoted by SVHNparity. For the training data we use a balanced random subset of ’train’ and ’extra’ partitions, for test data we use the ’test’ partition.
We focus on the CIFAR5mbin experiments in the main body. Corresponding SVHNparity experiments can be found in Appendix E.
We use these particular datasets because we need datasets with a large number of samples in order to measure datascaling, and CIFAR5mbin and the SVHN dataset both have > 600k samples. We chose to consider binary tasks as this makes the kernel experiments computationally feasible. Moreover, although there are other datasets with similar sample sizes (e.g. ImageNet), the datasets we use have the advantage that they are lowresolution and an easier task— thus, scalinglaw experiments are far more computationally feasible. We also do some experiments on a synthetic dataset in Appendix
D.Architectures.
We use the following base architectures: Myrtle CNN [PAG18, SFG+20] with 64 channels in the first layer for the CIFAR5mbin task and a 5 layer CNN with 64 channels for the SVHNparity task. We consider various width scaling for these networks: For the Myrtle CNN we vary the width from 16 to 1024 and from 16 to 4096 for the 5 layer CNN. See Appendix A for more details.
Experimental Details.
We describe some subtleties in the experimental setup. We use NTK parameterization [JHG18] for both the neural network and the kernels as this is the parameterization used in proving the equivalence of neural network and NTK at infinite width. We train with MSE loss and labels. All of our networks are in the overparameterized regime i.e. are able to reach 0 train error. To preserve the correspondence between the neural networks, empirical NTKs and infinite NTKs we train all of them with SGD with momentum with the same hyperparameters. This also ensures that in all experiments neural networks will be trained below the critical learning rate, i.e. the learning rate at which training of the empirical and infinite NTK can converge (See also Appendix A.3). Training for the empirical NTK is done by linearizing the initialized neural neural network using [NXH+20] library while for infinite NTK we directly use SGD with momentum on the linear system given by the infinite NTK and the labels. We describe further experimental details for each individual experiment in Appendix A. Experiments were done on a combination of Nvidia V100 and A40 GPUs on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.
3 Data Scaling Laws of Neural Networks and NTKs in the Overparameterized Regime
In this section we compare the datascaling laws of neural networks to their corresponding emperical NTKs and infinite NTKs. Our main claim is the following.
Claim 3.1.
There exists a natural setting of task and network architecture such that the neural network trained with SGD has a better scaling constant than its corresponding infinite and empirical NTK at initialization. Further, this gap in scaling continues to hold over a wide range of widths and learning rates used in practice.
The above claim can be interpreted as stating that there exists natural settings where the regime in which real neural networks are trained is meaningfully separated from the NTK regime, and real neural networks have a better scaling law.
In Figure 1 (A), we train a Myrtle CNN [PAG18, SFG+20], its empirical NTK at initialization, and its infinite NTK on the CIFAR5mbin task. In each case, we train to fit the train set with SGD and optimal early stopping. We then numerically fit scaling laws, and find, scalingexponents of: .185 (empirical NTK), .213 (infinite NTK), .307 (neural network). Thus, in this imageclassification setting, the real neural network significantly outperforms its corresponding NTKs with respect to datascaling. See the Appendix A for full experimental details.
We now investigate how robust this result is to changes in the width of the architecture and optimizer, within realistic bounds.
we observe that (a) the Empirical NTK performance continues to improve with width, moving towards the infinite NTK performance while (b) neural network performance improves initially and then starts to deteriorate towards the infinite NTK performance. Error bars represent estimated standard deviation. See Appendix
A for more details.Effect of Width.
We explore the effect of width. In Figure 2 we train neural networks with widths much smaller (16) and much larger (1024) than the width (64) used in Figure 1(A). We find that these networks behaved similarly with respect to their scaling constants (.276 and .279 respectively), and performed better than the infinite width NTK (scaling constant: .213), confirming that real neural networks are far from the NTK regime. However, we know that in the truly infinite width limit, all these methods will perform identically. Moreover, as mentioned in Section 2 we are careful to ensure this limit is preserved by our optimization and initialization setup. This implies that at some point, increasing width of the real network will start to hurt performance— although it may be computationally infeasible to observe such large widths.
To explore the widthdependency, in Figure 2 we plot the expected performance of Empirical NTK at Initialization and Neural network as we increase the width, using a fixed training size of 4000. Here we see that (a) the Empirical NTK at Initialization continues to improve with larger width and approaches the infinite NTK’s error from above, while (b) the neural network improves initially and then starts to deteriorate and approach the infinite NTK’s error from below. In Figure 2 we repeat the experiment for the SVHNparity task. In this setup it was computationally feasible to try out much larger width (upto 4096) with a smaller training size of 1000. Hence in this experiment as the width increases, we can observe stronger deterioration of the performance of neural network, towards the infinite NTK performance.
Together these results suggest that “intermediate” widths (not too large, not too small) are important for the performance of real overparameterized neural networks, and any explanatory theory must be consistent with this.
Effect of Learning Rate.
We now study how robust our results are to changes in the learning rate, within practically used bounds. Note that changing the learning rate only affects the neural network training, and does not affect any of their corresponding NTKs. In Figure 3 we train networks in the same setting as Figure 2, but with varying learning rates. We find that after moderate modifications of the the learning rate the neural network still has a better scaling law than infinite and empirical NTK at initialization, suggesting that practically used learning rates (for practically used widths) are far from the NTK regime. The scaling constants are for the 3x higher learning rate, 10x lower learning rate and the infinite neural network. We discuss the effects of more drastic changes (1000x) in the learning rate in Appendix B.
Other Changes in Optimization.
We now study whether our results hold under other changes to optimization parameters. In Figure 3, 3, 3, we see the effect of doing GD instead of SGD, effect of training without momentum and of using the final test error instead of doing optimal early stopping respectively. We see that in all of these cases, while there is some change in the scaling laws, the neural network scaling constant is still always better than the one for infinite NTK. The scaling constants for neural networks in Figure 3, 3, 3 are .294, .310 and .292 respectively. The scaling constant for the infinite NTK is .213 in Figure 3, 3 and .219 for Figure 3. This suggests that these optimization factors are not the fundamental reason behind the improved scaling laws of neural networks.
Various extensions to the NTK regime have been proposed [RYH21, AD20] in the literature which allow for the change in empirical NTK but posit that higher order analogues of the NTK remain constant. This would predict that higher order analogues of the empirical NTK at initialization would be sufficient to match the performance of neural networks. In Appendix C we show that this is not the case suggesting that these theories may also not be sufficient to explain the performance of practical neural networks.
Discussion and Future Questions.
The equivalence between neural networks and corresponding NTKs applies when . On the other hand nearly all overparameterized networks and natural tasks fall in the regime of (though width is still large enough to fit the dataset). The results of this section — showing separations between neural networks in the latter regime and NTKs lead to following concrete question on the gap between theory and practice which could guide future work.
Question 3.1.
How can we understand the the behavior of overparameterized networks in the regime?
4 Exploration of AfterKernel wrt Dataset size
In the previous section, we studied the empirical NTK when linearized around weights at initialization. In this section we will study the behaviour of Empirical NTK when linearized around the weights obtained at the end of training. This is known as the afterkernel for empirical NTK, in the terminology of [LON21]. We will show, in the more precise sense defined below, that (1) the afterkernel continues to improves with dataset size, and thus (2) no fixedtime afterkernel is sufficient to capture the data scaling law of its corresponding neural network.
Formally, denote the afterkernel from the neural network trained on samples as . We will denote the accuracy of when fit on samples as . Here, the samples are a subset of the original samples. When we use fresh samples to fit we use the notation . We study the afterkernel as improved performance of neural networks over NTKs has been attributed [OMF21, ABP21] to the adaptation of the empirical NTK of the neural network to the task. Concretely, prior works [LON21, PPG+21] have shown that this explanation is complete in the following sense: The behaviour of is similar to that of the neural network fit on samples. In other words, when we fit an afterkernel obtained from training on samples to the same samples we get an accuracy very close^{1}^{1}1We again note that empirical NTK does not refer to the linearization of the network (See Section 2 for an exact definition). If we had linearized the network this statement would be trivially true as the linearized network around final weights would start out with an accuracy matching that of the trained neural network. to that of the neural network fit on the same samples. We verify this for our setup in Figure 4. This tells us that the following two factors are sufficient to explain the behaviour of neural networks fit on samples: (1) Change in empirical NTK from empirical NTK around initial weights to the afterkernel due to training on samples. (2) Fitting the afterkernel on samples.
What this does not tell us is how these two improvements scale with training size . In particular, we know that i.e. the empirical NTK at initialization fit on samples does not match the neural network trained on samples on the other hand does. This raises the following natural question: How data dependent does the kernel need to be to recover the performance of the neural network? For example, it is possible that for some sample size and all , the afterkernel is roughly constant, and has same scaling law as the neural network itself. We find that this is not the case– the after kenel continuously improves with dataset size .
4.1 Experimental Results
AfterKernel continues to improve with dataset size.
Fixed afterkernel is not sufficient to capture neural network data scaling.
In Figure 4 we plot the data scaling curves for the base MyrtleCNN, its empirical NTK at initialization, , with scaling constants respectively. We find that the neural network has the best scaling constant. This shows that the scaling of the afterkernel with training size is an important component of neural network scaling laws as even the afterkernel learnt with 64k samples (on the simple CIFAR5mbin task) is not sufficient to explain the data scaling of neural networks. We also see that has better performance than , another evidence towards the fact that afterkernel improves with dataset size.
5 Time Dynamics
In the previous section, we saw that the change in the empirical NTK from initialization to the end of training (the afterkernel) is sufficient to explain the improved performance of neural networks. Thus the empirical NTK must have evolved throughout training, and in this section we take a closer look at this evolution. Our main focus in this section is to investigate the following informal proposal in the literature [FDP+20, LON21] about how the empirical NTK evolves:
Hypothesis 5.1 ((Informal, from [Fdp+20, Lon21]).
The empirical NTK evolves rapidly in the beginning of training (
epochs), but then undergoes a “phase transition” into a slower regime.
One way to interpret the above hypothesis is that there is both a qualitative and quantitative difference in the empirical NTK between the “early phase” of training (the first few epochs) and the later stage of training. This is called a “phase transition” in the literature, in analogy to physics, where systems undergo discontinuities between two regimes with quantitatively different dynamics.
In this section we will give evidence that suggests, contrary to prior work, that there is no such “phrase transition”. We show that if empirical NTK performance is measured at the appropriate scale, performance appears to continuously improve throughout training (from early to late stages), at approximately the same “rate.” Our experiments are in fact compatible with the experiments in prior work (e.g. [FDP+20]): we simply observe that if performance and time are measured on a loglog scale (as is appropriate for measuring multiscale dynamics), then the NTK is seen to improve continuously throughout most of the training.
5.1 Experiments
Setup.
We now describe the setup more formally. Let refer to the empirical NTK (as described in Section 2) extracted at time in the training, where we measure time in terms of number of SGD batches seen in optimization. denotes the model fit to the whole training data. Prior works [FDP+20, LON21] have used the slope of curve of test error of versus to decide if the kernel is changing rapidly or not. We will do the same with one crucial difference: We will measure this slope on a loglog plot instead of directly plotting test error and time. We do this as empirically scaling laws with respect to time (or tokens processed) have been observed [KMH+20] for natural language tasks in neural networks and formally proved for kernels [VY21] for natural tasks. These results suggest the need for loglog plots to observe qualitative phase transitions in training dynamics.
Results. Our main claim is that the test error of as a function of time is approximately linear on a loglog scale, throughout the course of training. Recall that is the model obtained by extracting the empirical NTK after batches of training the real neural network.
In Figure 5 we compare the test error of the base MyrtleCNN at time , test error of empirical NTK at initialization at time and when trained on samples with the same hyperparameters as Figure 1. Since we want to probe Hypothesis 5.1, which is about the beginning of training, we plot these quantities until train error reaches (which requires 32 epochs in our experiments). This should be sufficient to cover any reasonable definition of “beginning of training”.
Observe that in Figure 5 we do not observe a “phase transition” after which the improvement in kernel test error (in red) slows down. In fact, we observe that the kernel starts out being essentially constant and then starts and continues to improve uniformly.
We instead observe the following two regimes (in the initial part of training):

In the first regime (before the dashed vertical line) the empirical NTK at initialization and the neural network have very similar behaviour, and is nearly constant. This only lasts for around 140 batches 0.5 epochs^{2}^{2}2Note that this means that if we plot per epoch this phase would not be visible at all..

In the next regime (after the dashed line) the empirical NTK at initialization and the neural network diverge. As they diverge, the extracted kernel also starts to improve with a constant slope, and this improvement continues uniformly until the terminal stage of training.
Importantly, the kernel does not transition into a “slower phase” of learning at any point^{3}^{3}3As we keep training at some point train loss will tend to 0 and will converge to a fixed value. This does not affect our results as we are only interested in the initial part of training as described in Hypothesis 5.1 in our experiments. Corresponding SVHN experiments can be found in Appendix E.
Next, we measure the performance of in terms of its data scaling law. For to computational limitations (since measuring datascaling is expensive), we can only measure the scaling law for several selected values of time , instead of every batch (as in Figure 5).
In Figure 5 we plot datascaling of for (32 epochs), in the same setup as Figure 5. We also plot the data scaling of the base MyrtleCNN with the same hyperparameters. As in Figure 4 of Section 4 we again observe that the neural network has the best scaling law, outperforming any of the extracted kernels. This shows that representations learnt after any constant time of training are not sufficient to explain the data scaling of neural networks. Rather, these representations improve throughout training, and the entire course of training must be considered to recover the correct scaling law.
6 Conclusions
In this work, we compared the datascaling properties of neural networks to their corresponding infinite NTKs and empirical NTKs (both at initialization and after various levels of training). We found that the kernels do not scale as well as neural networks, even if the kernels are allowed be partially “pretrained” themselves. Since scaling laws capture the sample efficiency of a training algorithm, it is an important indicator of the similarities or differences between two methods. Thus, our results suggest that while the NTK is a good starting point to understand deep learning, there are important aspects that are not accurately captured by the NTK— in particular, datascaling laws.
Acknowledgments
NV is supported by NSF CCF1909429. YB is supported by a Siebel Scholarship. PN is supported by NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS2031883 and #814639, by NSF through IIS1815697 and by the TILOS institute (NSF CCF2112665). The computations in this paper were run on the FASRC Cannon cluster and were supported by the FAS Division of Science Research Computing Group at Harvard University, Simons Investigator Fellowship, NSF grants DMS2134157 and CCF1565264 and DOE grant DESC0022199.
References

[AP20]
(2020)
The neural tangent kernel in high dimensions: triple descent and a multiscale theory of generalization.
In
Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event
, Proceedings of Machine Learning Research, Vol. 119, pp. 74–84. External Links: Link Cited by: Appendix F.  [ALL18] (2018) Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918. Cited by: §1.

[AD20]
(2020)
Asymptotics of wide convolutional neural networks
. CoRR abs/2008.08675. External Links: Link, 2008.08675 Cited by: §1.1, §3.  [AOY19] (2019) A meanfield limit for certain deep neural networks. arXiv. External Links: Document, Link Cited by: §1.1.
 [ADH+19a] (2019) Finegrained analysis of optimization and generalization for overparameterized twolayer neural networks. In International Conference on Machine Learning, pp. 322–332. Cited by: §1.
 [ADH+19b] (2019) On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 8139–8148. External Links: Link Cited by: §1.1.
 [ABP21] (2021) Neural networks as kernel learners: the silent alignment effect. External Links: 2111.00034 Cited by: Appendix F, item 4, §4.
 [BDK+21a] (2021) Explaining neural scaling laws. arXiv preprint arXiv:2102.06701. Cited by: Appendix F, item 2, §1.
 [BDK+21b] (2021) Explaining neural scaling laws. CoRR abs/2102.06701. External Links: Link, 2102.06701 Cited by: §1.1.
 [BGG+22] (2022) Data scaling laws in NMT: the effect of noise and architecture. CoRR abs/2202.01994. External Links: Link, 2202.01994 Cited by: §1.1.
 [BHM+19] (2019) Reconciling modern machine learning practice and the biasvariance tradeoff. External Links: 1812.11118 Cited by: Appendix F.
 [BCP20] (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 1024–1034. External Links: Link Cited by: §1.1.
 [BP22] (2022) Learning curves for SGD on structured features. In International Conference on Learning Representations, External Links: Link Cited by: §1.1.
 [CBP21] (2021) Spectral bias and taskmodel alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications 12 (1), pp. 1–12. Cited by: §1.1.
 [dSB20] (2020) Triple descent and the two kinds of overfitting: where & why do they appear?. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Appendix F.
 [DM20] (2020) Learning parities with neural networks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Appendix F.
 [DLL+19] (2019) Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. Cited by: §1.
 [DZP+18] (2018) Gradient descent provably optimizes overparameterized neural networks. arXiv preprint arXiv:1810.02054. Cited by: §1.
 [DG20] (2020) Asymptotics of wide networks from feynman diagrams. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, External Links: Link Cited by: §A.3.
 [FLY+20] (2020) Modeling from features: a meanfield framework for overparameterized deep neural networks. arXiv. External Links: Document, Link Cited by: §1.1.
 [FDP+20] (2020) Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Appendix F, item 4, §1, §5.1, Hypothesis 5.1, §5, §5.
 [GSd+19] (201907) Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E 100 (1). External Links: ISSN 24700053, Link, Document Cited by: Appendix F.
 [GMM+20] (2020) When do neural networks outperform kernel methods?. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Appendix F.
 [JHG18] (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 38, 2018, Montréal, Canada, pp. 8580–8589. External Links: Link Cited by: Limitations of the NTK for Understanding Generalization in Deep Learning, §1.1, §1, §2.

[KMH+20]
(2020)
Scaling laws for neural language models
. arXiv preprint arXiv:2001.08361. Cited by: §1.1, §1.1, §1, §2, §5.1.  [KWL+21] (2021) Local signal adaptivity: provable feature learning in neural networks beyond kernels. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 614, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 24883–24897. External Links: Link Cited by: Appendix F.
 [LSP+20] (2020) Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 612, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Link Cited by: Appendix F, §1.1.
 [LXS+19] (2019) Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 814, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’AlchéBuc, E. B. Fox, and R. Garnett (Eds.), pp. 8570–8581. External Links: Link Cited by: §A.3.
 [LZG21] (2021) Towards an understanding of benign overfitting in neural networks. CoRR abs/2106.03212. External Links: Link, 2106.03212 Cited by: Appendix F.
 [LP21] (2021) Provable convergence of nesterov accelerated method for overparameterized neural networks. CoRR abs/2107.01832. External Links: Link, 2107.01832 Cited by: §A.3.
 [LON21] (2021) Properties of the after kernel. CoRR abs/2105.10585. External Links: Link, 2105.10585 Cited by: Appendix F, Appendix F, item 3, item 4, §1, §4, §4, §5.1, Hypothesis 5.1, §5.
 [NKB+20] (2020) Deep double descent: where bigger models and more data hurt. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 2630, 2020, External Links: Link Cited by: Appendix F.
 [NNS21] (2021) The deep bootstrap framework: good online learners are good offline generalizers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 37, 2021, External Links: Link Cited by: item 1.
 [NWC+11] (2011) Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. Cited by: item 2.
 [NP20] (2020) A rigorous framework for the mean field limit of multilayer neural networks. CoRR abs/2001.11443. External Links: Link, 2001.11443 Cited by: §1.1.
 [NXH+20] (2020) Neural tangents: fast and easy infinite neural networks in python. In International Conference on Learning Representations, External Links: Link Cited by: §A.1, §2.
 [OMF21] (2021) What can linearized neural networks actually say about generalization?. In Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, External Links: Link Cited by: Appendix F, item 4, §1.1, §1, §4.
 [PPG+21] (2021) Geometric compression of invariant manifolds in neural networks. Journal of Statistical Mechanics: Theory and Experiment 2021 (4), pp. 044001. Cited by: Appendix F, §1.1, §1, §4.
 [PAG18] (2018) How to train your resnet 4: architecture. Note: https://web.archive.org/web/20210512184210/https://myrtle.ai/learn/howtotrainyourresnet4architecture/Accessed: 20220125 Cited by: §2, §3.
 [RYH21] (2021) The principles of deep learning theory. CoRR abs/2106.10165. External Links: Link, 2106.10165 Cited by: §1.1, §3.
 [RRB+19] (2019) A constructive prediction of the generalization error across scales. arXiv preprint arXiv:1909.12673. Cited by: §1.1, §1.
 [SFG+20] (2020) Neural kernels without tangents. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 1318 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 8614–8623. External Links: Link Cited by: §2, §3.
 [SK20] (2020) A neural scaling law from the dimension of the data manifold. CoRR abs/2004.10802. External Links: Link, 2004.10802 Cited by: Appendix F.

[SDD21]
(2021)
Neural tangent kernel eigenvalues accurately predict generalization
. CoRR abs/2110.03922. External Links: Link, 2110.03922 Cited by: §1.1.  [SS19] (2019) Mean field analysis of deep neural networks. arXiv. External Links: Document, Link Cited by: §1.1.
 [TSM+20] (2020) Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. Cited by: §1.
 [VY21] (2021) Explicit loss asymptotics in the gradient descent training of neural networks. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Link Cited by: §5.1.
 [YH21] (2021) Tensor programs IV: feature learning in infinitewidth neural networks. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 1824 July 2021, Virtual Event, M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 11727–11737. External Links: Link Cited by: §1.1.
Appendix A Experimental Details
a.1 Architecture
The exact architecture of the MyrtleCNN we use is the following: Conv Layer with
channels, Relu,
Conv Layer with channels, Relu, Avgpooling, Conv Layer with channels, Relu, Avgpooling, Conv Layer with channels, Relu, Avgpooling,Avgpooling, Dense Layer with 1 output. Stride is always
. Here is our code for the network in the module of the Neural Tangents [NXH+20] library:We refer to as the “width” of the network. Our base network in Figure 1 has .
We use the following 5 layer CNN for the SVHNparity task:
The base network has .
The MLP that we use in our synthetic dataset experiments is a depth4 MLP with the following code:
a.2 Scaling Laws
In our plots we use the scaling laws for the form where is referred to as the scaling constant. This is take into account the fact that any given neural network has a maximum possible accuracy. Note that as our task is a deterministic there is no labelnoise to be accounted for. We calculate by solving the least squares problem between and the of empirically found test errors.
a.3 SGD with momentum, equivalence between infinite NTKs and neural networks
a.4 Additional Experimental Details for Figures
64 bit versus 32 bit precision Our neural network experiments are done with 32 bit precision while kernel experiments are done with 64 bit precision. To verify that this difference is not the cause of better performance of neural networks we ran the experiment in Figure 1 with 64 bit precision for train size of 64k. The test error actually improved, test error with optimal early stopping was .03993 while it was .04118 at 32 bit precision.
Starting with 0 output neural networks The convergence between neural networks, empirical NTK and the infinite NTK as width tends to infinity needs that the neural networks have 0 output at initialization. This can be done by subtracting the initial outputs from the neural network output. While we do not do this for the experiments in the paper, we verify in Table 1 that this makes almost not difference for Figure 1. We use a table instead of a plot as most of the differences are too small to be visible on a plot.
Dataset Size  500  1k  2k  4k  8k  16k  32k  64k 

Usual Neural Networks  .1504  .1177  .09655  .08083  .0658  .0545  .04958  .04118 
Neural Netwoks with 0 output  .1501  .1204  .09672  .08302  .0668  .05462  .04585  .04152 
We now describe additional experimental details for each figure beyond those described in Section 2.
In Figure 1 all 3 models were trained with batch size of and SGD with learning rate of and momentum of (as implemented in ). For all models we used optimal early stopping where the test error is logged after an increase of multiplicative factor in the number of gradient steps. Unless mentioned otherwise this will the optimization setup we will use. For neural network and empirical NTK each model is averaged over 4 random initializations and error bars denote standard deviations (except the last two points of empirical NTK which we were not able to rerun due to computational constraints). The infinite NTK is a deterministic model hence we only have only have a single run for it. We also trained the kernels by directly solving the linear system with optimal L2regularization but that yielded worst test performance.
In Figure 2
, for empirical NTK with large widths (128, 256) the values across different initializations were very similar hence we did not do multiple runs of even higher widths (512, 1024). For the neural network standard error of the mean (SEM) was calculated using 6 runs for widths up to 128 and 4 runs for higher widths.
In Figure 2 the models are trained with SGD: batch size of 200, learning rate of 10, momentum. All models are averaged across 8 random initializations. In this figure we report the final test error at convergence (train loss for neural network) for all models.
In Figure 3 we train with learning rate of 10.0 with no momentum.
In Figure 3 we train the width 512 version of the MyrtleCNN till it reached train loss by which point their test error converges to a nearly fixed value.
Appendix B Effect of Very Low Learning rate
In this section we explore the effects of very low learning rate. The motivation is to understand the gradient flow limit i.e. the limiting behaviour of trained neural networks as the learning rate tends to 0. In Figure 6 we repeat Figure 1 except that we train the neural network with learning rate . The measured scaling constants are reasonably close by: for the MyrtleCNN, its empirical NTK and infinite NTK respectively. This suggests a natural question:
Question B.1.
Do the benefits (with respect to scaling constant) of finite width networks over corresponding empirical NTK vanish in the gradient flow limit?
To answer this question affirmatively we would would need to repeat the plot of Figure 6 for various widths which was computationally infeasible for us. We leave this for future work.
We now move to considering the effect of learning rate on a fixed training size.
In Figure 6 we plot the performance with respect to learning rate for training sizes 4000 (From the setup in Figure 1) and observe that at low learning rates performance is worse^{4}^{4}4Note that this is just for a single width. We do not know how width affects these results. We do know that at infinite width the learning rate does not have any effect, other than due to optimal early stopping. than infinite NTK but still better than empirical NTK at initialization. From this plots it is not clear if at the lowest learning rates which we could train the performance has converged to the gradient flow performance. In this setup it was computationally infeasible for us to explore smaller learning rates. To do this we move to the synthetic setting (with the same setup as in Figure 8) in Figure 6. Here the performance (final test error) converges as we go towards smaller learning rates which indicates that we have converged to the gradient flow limit. In this limit the neural network performs better than the infinite and empirical NTK at initialization. Note that higher learning rates still leads to even better performance.
We interpreted all of these experiments as suggesting that while high learning rate plays an important role in the performance of empirical networks it may not be necessary in having a improved performance over corresponding NTKs.. But more experimental evidence is needed to understand the role of learning rate and in understanding the gradient flow limit, particularly for natural tasks.
Appendix C Higher order analogues of the NTK
Let with representing the weights and a sample. By Taylor expansion around we have:
The empirical NTK of the neural network around weights refers to the model .
We consider the following two second order analogue of the NTK: and
In Figure 7 we use the setup of Figure 1 to plot the performance of the order analogue of the empirical NTK (, referred by in the plot). This shows that even this higher order analogue is not sufficient to recover the scaling law of neural networks.
In Figure 4 we saw that the afterkernel was sufficient to explain the improved performance of neural network over the empirical NTK. We now show an analogous result for the order analogue of NTK. As we want to understand the effect of change in the higher order terms we need to remove the influence on the afterkernel. We do so by defining the higher order analogue of afterkernel as the model which does not contain the afterkernel. We will denote this by when we use the weights after training on samples. In Figure 7 we show that the performance of is very close to that of the neural network.
Both of these experiments suggest that theories which assume that higher order analogues of the NTK remain fixed throughout the training may not be sufficient to explain the performance of neural networks.
Appendix D Experiments on Synthetic Data
Some of our experiments were not feasible on the CIFAR5mbin and SVHNparity tasks. We did these experiments on the following synthetic task: Sample , , the input sample is and the label is .
Experiments on very low learning rates for this synthetic task can be found in Appendix B.
In Figure 8 we do an analogous experiment to Figure 2 and 2 for this synthetic task. We again observe that neural networks at small width improve with increase in width but at high width they start to worsen off with increase in width. The models are trained with SGD: batch size of 100, learning rate of 10, no momentum. All models are averaged across 8 random initializations. In this figure we report the final test error at convergence (train loss for neural network) for all models.
Appendix E SVHNparity Experiments
In Figure 9 the analogue of Figure 1. The scaling constants in Figure 9 are and for the neural network, infinite NTK and the empirical NTK at initialization respectively. We also observe that unlike the CIFAR5mbin task here the neural network outperforms the infinite NTK even at small dataset sizes. This may be because the inductive bias of the NTKs is not suited for a the parity task. In Figure 9 we plot the more extensive figure corresponding to Figure 5. We again observe that the kernel continues to improve for most of the training.
All of the above models are trained with SGD: batch size of 200, learning rate of 10, momentum. In Figure 9 the neural network and empirical NTK experiments are averaged over 4 runs with error bars denoting standard deviation. The infinite NTK is a deterministic model and hence we only do a single run.
Figure 2 is another SVHNparity experiment.
Appendix F Other related work
Beyond Double Descent: Double descent [BHM+19, GSd+19, NKB+20] predicts that in the regime of overparameterized models increasing the width improves the test error. We observe that the performance of overparameterized models is better than that of infinite width models showing that there is a natural setting where there is at least one more ascent after the double descent phenomena. Behaviours beyond the double descent phenomena have been predicted [AP20, dSB20, LZG21] and also observed [LSP+20] in empirical neural networks. Our works is different from these works as we show that in our setup simultaneously a) empirical NTK displays a monotonic improvement in the overparameterized regime towards the infinite NTK performance while b) the neural network performs better than the infinite NTK. This directly points towards another ascent after the double descent and also pinpoints its cause as the divergence between finite width neural networks and the empirical NTK at initialization.
After Kernel The empirical NTK after the training of neural network has been termed as afterkernel [LON21]. It has been shown [LON21, PPG+21]. We extend these works by studying how the afterkernel changes with dataset size and show that it continues to improve with dataset size.
Time dynamics of training from the NTK perspective. has been studied by [FDP+20, OMF21, ABP21, LON21]. These papers suggest that the empirical NTK changes rapidly in the beginning of the training followed by a slowing of this change. We argue against this interpretation in Section 5.
Explanations for Scaling Laws Current explanations of scaling laws [SK20, BDK+21a] rely on a fixed representation space. Operationalizing representation as the afterkernel, our results suggest that in practical neural networks the representation itself improves as datasize increases. Hence we may need more refined theories for explaining scaling laws of neural networks which take this into account.