The seminal work of Jacot et al [JHG18] introduced the “Neural Tangent Kernel” (NTK) as the limit of neural networks with widths approaching infinity. Since this limit holds provably under certain initializations, and kernels are more amenable to analysis than neural networks, the NTK promises to be a useful reduction to understand deep learning. Thus, it has initiated a rich research program to use the NTK to explain various behaviors of neural networks, such as convergence to global minima [DZP+18, DLL+19], good generalization performance [ALL18, ADH+19a], implicit bias of networks [TSM+20] as well as neural scaling laws [BDK+21a].
In addition to the infinite NTK, the emperical NTK — the kernel with features that are gradients of a finite-width neural network— can be a useful object to study, since it is an approximation to both the true neural network and the infinite NTK. This has also been studied extensively as a tool to understand deep learning [FDP+20, LON21, PPG+21, OMF21].
In this work, we probe the upper limits of this research program: we want to understand the extent to which understanding NTKs (empirical and infinite) can teach us about the success of neural networks. We study this question under the lens of scaling [KMH+20, RRB+19]—how performance improves as a function of samples and as a function of time— since the scaling is an important “signature” of the mechanisms underlying any learning algorithm. We thus compare the scaling of real networks to the scaling of NTKs in the following ways.
Data scaling of initial kernel (Section 3): We show that both the infinite and empirical NTK (at initialization) can have worse data scaling exponents than neural networks, in realistic settings (see Figure 1
). We find that this is robust to various important hyperparameter changes such as learning rate (in the range used in practice), batchsize and optimization method.
Width scaling of initial kernel (Section 3): Since neural networks provably converge to the NTK at infinite width, we investigate why the scaling behavior differs at finite width. We show (Figure 2, 2) realistic settings where as the width of the neural network increases to very large values, the test performance of the network gets worse
and approaches the performance of the infinite NTK, unlike existing results in literature which suggest that over-parameterization is always good. This also raises new questions about scaling of neural networks with width, in particular the “variance-limited” neural scaling regimes[BDK+21a].
Data scaling of after-kernel (Section 4):We consider the after-kernel [LON21] i.e. Empirical NTK extracted after training to completion on a fixed number of samples. We show (Figure 1(B), 4) that the after-kernel continues to improve as we increase the training dataset size. On the other hand, we find (Figure 4) that the scaling exponent of the after-kernel, extracted after training on a fixed number of samples remains worse than that of the corresponding neural network.
Time scaling (Section 5): We show (Figure 1(C), 5) realistic settings where the empirical NTK continues to improve uniformly throughout most of the training. This is in contrast with prior work [FDP+20, OMF21, ABP21, LON21] which suggests that the empirical NTK changes rapidly in the beginning of the training followed by a slowing of this change.
We demonstrate these phenomena occur in certain settings which are based on real, non-synthetic data, and modern architectures (for e.g.: for datasets CIFAR-10 and SVHN and convolutional networks). While we do not claim that these phenomena manifest forall possible datasets and architectures, we believe that our examples highlight important limitations to the use of NTK to understand the test performance of neural networks. Formalizing the set of distributions or architectures for which these phenomenon occur is an important direction for future theoretical research.
1.1 Comparison to Prior Work on NTK Generalization
Our main focus is to understand feature learning occurring due to finite width. To do this, we make the following deliberate choices in all of our experiments: a) We use the NTK parameterization, this makes sure that infinite width networks will be equivalent to kernels b) We use the same optimization setup for the neural network, Empirical NTK and Infinite NTK, this makes sure that as width tends to infinity all 3 models will have the same limit. We make sure that our comparisons are robust by c) using scaling laws to compare these models and d) doing various hyperparameter ablations (Figure 3).
Below we describe several lines of related works and how our work differs from them.
Small initialization and representation learning at infinite width.
Infinite widths neural networks in the NTK and standard initialization are equivalent to kernels [JHG18, YH21]. On the other hand it has been shown [YH21, SS19, NP20, AOY19, FLY+20] that with small initialization feature learning is possible at infinite width. The feature learning displayed in our experiments is not due to small initialization as we initialize our networks in the NTK parameterization. This was a deliberate choice as we are interested in feature learning occurring due to finite width as this is the kind of feature learning displayed by empirical neural networks (which usually do not have a small initialization).
Data Scaling for NTKs and neural networks.
Scaling laws have been empirically shown [KMH+20, RRB+19] for neural networks and have been theoretically proven [BCP20, CBP21, BP22] for NTKs under natural assumptions. Comparison between the scaling laws for neural network and empirical NTKs has been previous looked at by Paccolat et. al. [PPG+21] and Guillermo et. al. [OMF21] and both find that neural networks have better scaling than empirical NTK at initialization. Both of these papers do not compare to infinite NTKs which leaves open the possibility that neural networks and infinite width NTKs behave the same wrt their scaling constants.
Pointwise comparisons of neural networks and corresponding infinite NTKs
has also been studied extensively in the literature [ADH+19b, LSP+20, SDD21] but the results have been divided. As discussed earlier we focus on comparing scaling laws. We argue that scaling laws, instead of point-wise comparisons, are the appropriate tool to compare neural networks and NTKs. Practically, pointwise comparisons between any two models can be fraught with issues as the ordering can flip depending on dataset size, as well as the specific choice of hyperparameters. On the other hand, scaling exponents have been found to be more robust to the choice of hyperparameters [BGG+22, KMH+20]. More importantly, the claim that NTK can capture "most" of the performance of the neural network can be subjective, specially when we are comparing small error or loss values. We show that when we look closely at the scaling exponents of these objects instead, we find major differences.
Theoretical studied effects of finite width with respect to the NTK regime.
Finite width corrections to the NTK theory have been studied by [AD20, RYH21, BDK+21b]. While these results do not need infinite widths they still require much higher than practically used widths particularly for the training sizes used in practice. These papers either consider a) the finite width corrections of empirical NTK or b) they consider the change in NTK but predict that the higher order analogues of empirical NTK remain constant. For a) we show that the empirical NTK is very far from the performance of finite width neural networks. Regarding b), in Appendix C we show that the higher order analogues of empirical NTK change significantly.
After Kernel and Time Dynamics
We describe other related works in Appendix F.
2 Experimental Methodology
Here we describe the common methodology used in our experiments.
The core object we want to understand is the data-scaling law of real neural networks— that is, what is its asymptotic performance as a function of the number of train samples? Concretely, in this work we restrict to classification problems, where we measure performance in terms of test classification error. For a given classification algorithm, let be its learning curve: its expected test error as a function of number of samples . In practice, many neural networks exhibit power-law decay in their learning curves, [KMH+20]. In such settings, we have and we are interested primarily in the scaling exponent , which determines the asymptotic rate of convergence.
Empirical and Infinite NTK
Let be a neural network with representing the weights and an input. By Taylor expansion around we have:
The empirical NTK of the neural network around weights refers to the model . We note that this is not the same as linearizing the network as we are omitting the term. The empirical NTK is a linear model with respect to the weights . The infinite NTK refers to the limit of the empirical NTK of the network around initial weights as width tends to infinity.
For a given learning problem and given neural network architecture NN, we want to understand its data-scaling law . We consider the infinite NTK of the NN and the empirical NTK of the NN at initialization and their corresponding learning curves, and . Now we ask: is the scaling-exponent of always close to the scaling-exponent of either or , in realistic settings? That is, how well does the NTK approximation capture the generalization of real networks, on natural distributions?
Recall that this question is especially interesting because the three objects involved (Neural Network, NTK, and ENTK) all become provably equivalent in the appropriate width limit. Thus, at infinite-width we know their scaling laws must be the equivalent. The question is then, how far are we from this limit in practice? Are the widths used in practice large enough for their scaling-behavior to be captured by the infinite-width limit? To probe these questions, we empirically study scaling laws of these methods on image-classification problems.
Remark on comparisons.
We intentionally only compare a neural network to its corresponding NTK, and not to other kernels. Our motivation not address the question of “can (some) kernel perform as well as as a given neural network?”— indeed, there may be some better kernel to consider than the NTK. However, our goal is to study the specific kernels given by the NTK approximation, in correspondence with real networks.
We use the following datasets:
A 2 class subset (dog, horse) of the CIFAR-5m [NNS21] dataset, as a binary classification problem, which we denote CIFAR-5m-bin. This is a dataset of synthetic but realistic RGB images similar to CIFAR-10, which were generated using a generative model.
A binary classification task on the SVHN dataset [NWC+11] with the labels being the parity of the digit, denoted by SVHN-parity. For the training data we use a balanced random subset of ’train’ and ’extra’ partitions, for test data we use the ’test’ partition.
We focus on the CIFAR-5m-bin experiments in the main body. Corresponding SVHN-parity experiments can be found in Appendix E.
We use these particular datasets because we need datasets with a large number of samples in order to measure data-scaling, and CIFAR-5m-bin and the SVHN dataset both have > 600k samples. We chose to consider binary tasks as this makes the kernel experiments computationally feasible. Moreover, although there are other datasets with similar sample sizes (e.g. ImageNet), the datasets we use have the advantage that they are low-resolution and an easier task— thus, scaling-law experiments are far more computationally feasible. We also do some experiments on a synthetic dataset in AppendixD.
We use the following base architectures: Myrtle CNN [PAG18, SFG+20] with 64 channels in the first layer for the CIFAR-5m-bin task and a 5 layer CNN with 64 channels for the SVHN-parity task. We consider various width scaling for these networks: For the Myrtle CNN we vary the width from 16 to 1024 and from 16 to 4096 for the 5 layer CNN. See Appendix A for more details.
We describe some subtleties in the experimental setup. We use NTK parameterization [JHG18] for both the neural network and the kernels as this is the parameterization used in proving the equivalence of neural network and NTK at infinite width. We train with MSE loss and labels. All of our networks are in the overparameterized regime i.e. are able to reach 0 train error. To preserve the correspondence between the neural networks, empirical NTKs and infinite NTKs we train all of them with SGD with momentum with the same hyperparameters. This also ensures that in all experiments neural networks will be trained below the critical learning rate, i.e. the learning rate at which training of the empirical and infinite NTK can converge (See also Appendix A.3). Training for the empirical NTK is done by linearizing the initialized neural neural network using [NXH+20] library while for infinite NTK we directly use SGD with momentum on the linear system given by the infinite NTK and the labels. We describe further experimental details for each individual experiment in Appendix A. Experiments were done on a combination of Nvidia V100 and A40 GPUs on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.
3 Data Scaling Laws of Neural Networks and NTKs in the Overparameterized Regime
In this section we compare the data-scaling laws of neural networks to their corresponding emperical NTKs and infinite NTKs. Our main claim is the following.
There exists a natural setting of task and network architecture such that the neural network trained with SGD has a better scaling constant than its corresponding infinite and empirical NTK at initialization. Further, this gap in scaling continues to hold over a wide range of widths and learning rates used in practice.
The above claim can be interpreted as stating that there exists natural settings where the regime in which real neural networks are trained is meaningfully separated from the NTK regime, and real neural networks have a better scaling law.
In Figure 1 (A), we train a Myrtle CNN [PAG18, SFG+20], its empirical NTK at initialization, and its infinite NTK on the CIFAR-5m-bin task. In each case, we train to fit the train set with SGD and optimal early stopping. We then numerically fit scaling laws, and find, scaling-exponents of: .185 (empirical NTK), .213 (infinite NTK), .307 (neural network). Thus, in this image-classification setting, the real neural network significantly outperforms its corresponding NTKs with respect to data-scaling. See the Appendix A for full experimental details.
We now investigate how robust this result is to changes in the width of the architecture and optimizer, within realistic bounds.
we observe that (a) the Empirical NTK performance continues to improve with width, moving towards the infinite NTK performance while (b) neural network performance improves initially and then starts to deteriorate towards the infinite NTK performance. Error bars represent estimated standard deviation. See AppendixA for more details.
Effect of Width.
We explore the effect of width. In Figure 2 we train neural networks with widths much smaller (16) and much larger (1024) than the width (64) used in Figure 1(A). We find that these networks behaved similarly with respect to their scaling constants (.276 and .279 respectively), and performed better than the infinite width NTK (scaling constant: .213), confirming that real neural networks are far from the NTK regime. However, we know that in the truly infinite width limit, all these methods will perform identically. Moreover, as mentioned in Section 2 we are careful to ensure this limit is preserved by our optimization and initialization setup. This implies that at some point, increasing width of the real network will start to hurt performance— although it may be computationally infeasible to observe such large widths.
To explore the width-dependency, in Figure 2 we plot the expected performance of Empirical NTK at Initialization and Neural network as we increase the width, using a fixed training size of 4000. Here we see that (a) the Empirical NTK at Initialization continues to improve with larger width and approaches the infinite NTK’s error from above, while (b) the neural network improves initially and then starts to deteriorate and approach the infinite NTK’s error from below. In Figure 2 we repeat the experiment for the SVHN-parity task. In this setup it was computationally feasible to try out much larger width (upto 4096) with a smaller training size of 1000. Hence in this experiment as the width increases, we can observe stronger deterioration of the performance of neural network, towards the infinite NTK performance.
Together these results suggest that “intermediate” widths (not too large, not too small) are important for the performance of real overparameterized neural networks, and any explanatory theory must be consistent with this.
Effect of Learning Rate.
We now study how robust our results are to changes in the learning rate, within practically used bounds. Note that changing the learning rate only affects the neural network training, and does not affect any of their corresponding NTKs. In Figure 3 we train networks in the same setting as Figure 2, but with varying learning rates. We find that after moderate modifications of the the learning rate the neural network still has a better scaling law than infinite and empirical NTK at initialization, suggesting that practically used learning rates (for practically used widths) are far from the NTK regime. The scaling constants are for the 3x higher learning rate, 10x lower learning rate and the infinite neural network. We discuss the effects of more drastic changes (1000x) in the learning rate in Appendix B.
Other Changes in Optimization.
We now study whether our results hold under other changes to optimization parameters. In Figure 3, 3, 3, we see the effect of doing GD instead of SGD, effect of training without momentum and of using the final test error instead of doing optimal early stopping respectively. We see that in all of these cases, while there is some change in the scaling laws, the neural network scaling constant is still always better than the one for infinite NTK. The scaling constants for neural networks in Figure 3, 3, 3 are .294, .310 and .292 respectively. The scaling constant for the infinite NTK is .213 in Figure 3, 3 and .219 for Figure 3. This suggests that these optimization factors are not the fundamental reason behind the improved scaling laws of neural networks.
Various extensions to the NTK regime have been proposed [RYH21, AD20] in the literature which allow for the change in empirical NTK but posit that higher order analogues of the NTK remain constant. This would predict that higher order analogues of the empirical NTK at initialization would be sufficient to match the performance of neural networks. In Appendix C we show that this is not the case suggesting that these theories may also not be sufficient to explain the performance of practical neural networks.
Discussion and Future Questions.
The equivalence between neural networks and corresponding NTKs applies when . On the other hand nearly all overparameterized networks and natural tasks fall in the regime of (though width is still large enough to fit the dataset). The results of this section — showing separations between neural networks in the latter regime and NTKs lead to following concrete question on the gap between theory and practice which could guide future work.
How can we understand the the behavior of overparameterized networks in the regime?
4 Exploration of After-Kernel wrt Dataset size
In the previous section, we studied the empirical NTK when linearized around weights at initialization. In this section we will study the behaviour of Empirical NTK when linearized around the weights obtained at the end of training. This is known as the after-kernel for empirical NTK, in the terminology of [LON21]. We will show, in the more precise sense defined below, that (1) the after-kernel continues to improves with dataset size, and thus (2) no fixed-time after-kernel is sufficient to capture the data scaling law of its corresponding neural network.
Formally, denote the after-kernel from the neural network trained on samples as . We will denote the accuracy of when fit on samples as . Here, the samples are a subset of the original samples. When we use fresh samples to fit we use the notation . We study the after-kernel as improved performance of neural networks over NTKs has been attributed [OMF21, ABP21] to the adaptation of the empirical NTK of the neural network to the task. Concretely, prior works [LON21, PPG+21] have shown that this explanation is complete in the following sense: The behaviour of is similar to that of the neural network fit on samples. In other words, when we fit an after-kernel obtained from training on samples to the same samples we get an accuracy very close111We again note that empirical NTK does not refer to the linearization of the network (See Section 2 for an exact definition). If we had linearized the network this statement would be trivially true as the linearized network around final weights would start out with an accuracy matching that of the trained neural network. to that of the neural network fit on the same samples. We verify this for our setup in Figure 4. This tells us that the following two factors are sufficient to explain the behaviour of neural networks fit on samples: (1) Change in empirical NTK from empirical NTK around initial weights to the after-kernel due to training on samples. (2) Fitting the after-kernel on samples.
What this does not tell us is how these two improvements scale with training size . In particular, we know that i.e. the empirical NTK at initialization fit on samples does not match the neural network trained on samples on the other hand does. This raises the following natural question: How data dependent does the kernel need to be to recover the performance of the neural network? For example, it is possible that for some sample size and all , the after-kernel is roughly constant, and has same scaling law as the neural network itself. We find that this is not the case– the after kenel continuously improves with dataset size .
4.1 Experimental Results
After-Kernel continues to improve with dataset size.
Fixed after-kernel is not sufficient to capture neural network data scaling.
In Figure 4 we plot the data scaling curves for the base Myrtle-CNN, its empirical NTK at initialization, , with scaling constants respectively. We find that the neural network has the best scaling constant. This shows that the scaling of the after-kernel with training size is an important component of neural network scaling laws as even the after-kernel learnt with 64k samples (on the simple CIFAR-5m-bin task) is not sufficient to explain the data scaling of neural networks. We also see that has better performance than , another evidence towards the fact that after-kernel improves with dataset size.
5 Time Dynamics
In the previous section, we saw that the change in the empirical NTK from initialization to the end of training (the after-kernel) is sufficient to explain the improved performance of neural networks. Thus the empirical NTK must have evolved throughout training, and in this section we take a closer look at this evolution. Our main focus in this section is to investigate the following informal proposal in the literature [FDP+20, LON21] about how the empirical NTK evolves:
The empirical NTK evolves rapidly in the beginning of training
but then undergoes a “phase transition” into a slower regime.
epochs), but then undergoes a “phase transition” into a slower regime.
One way to interpret the above hypothesis is that there is both a qualitative and quantitative difference in the empirical NTK between the “early phase” of training (the first few epochs) and the later stage of training. This is called a “phase transition” in the literature, in analogy to physics, where systems undergo discontinuities between two regimes with quantitatively different dynamics.
In this section we will give evidence that suggests, contrary to prior work, that there is no such “phrase transition”. We show that if empirical NTK performance is measured at the appropriate scale, performance appears to continuously improve throughout training (from early to late stages), at approximately the same “rate.” Our experiments are in fact compatible with the experiments in prior work (e.g. [FDP+20]): we simply observe that if performance and time are measured on a log-log scale (as is appropriate for measuring multi-scale dynamics), then the NTK is seen to improve continuously throughout most of the training.
We now describe the setup more formally. Let refer to the empirical NTK (as described in Section 2) extracted at time in the training, where we measure time in terms of number of SGD batches seen in optimization. denotes the model fit to the whole training data. Prior works [FDP+20, LON21] have used the slope of curve of test error of versus to decide if the kernel is changing rapidly or not. We will do the same with one crucial difference: We will measure this slope on a log-log plot instead of directly plotting test error and time. We do this as empirically scaling laws with respect to time (or tokens processed) have been observed [KMH+20] for natural language tasks in neural networks and formally proved for kernels [VY21] for natural tasks. These results suggest the need for log-log plots to observe qualitative phase transitions in training dynamics.
Results. Our main claim is that the test error of as a function of time is approximately linear on a log-log scale, throughout the course of training. Recall that is the model obtained by extracting the empirical NTK after batches of training the real neural network.
In Figure 5 we compare the test error of the base Myrtle-CNN at time , test error of empirical NTK at initialization at time and when trained on samples with the same hyperparameters as Figure 1. Since we want to probe Hypothesis 5.1, which is about the beginning of training, we plot these quantities until train error reaches (which requires 32 epochs in our experiments). This should be sufficient to cover any reasonable definition of “beginning of training”.
Observe that in Figure 5 we do not observe a “phase transition” after which the improvement in kernel test error (in red) slows down. In fact, we observe that the kernel starts out being essentially constant and then starts and continues to improve uniformly.
We instead observe the following two regimes (in the initial part of training):
In the first regime (before the dashed vertical line) the empirical NTK at initialization and the neural network have very similar behaviour, and is nearly constant. This only lasts for around 140 batches 0.5 epochs222Note that this means that if we plot per epoch this phase would not be visible at all..
In the next regime (after the dashed line) the empirical NTK at initialization and the neural network diverge. As they diverge, the extracted kernel also starts to improve with a constant slope, and this improvement continues uniformly until the terminal stage of training.
Importantly, the kernel does not transition into a “slower phase” of learning at any point333As we keep training at some point train loss will tend to 0 and will converge to a fixed value. This does not affect our results as we are only interested in the initial part of training as described in Hypothesis 5.1 in our experiments. Corresponding SVHN experiments can be found in Appendix E.
Next, we measure the performance of in terms of its data scaling law. For to computational limitations (since measuring data-scaling is expensive), we can only measure the scaling law for several selected values of time , instead of every batch (as in Figure 5).
In Figure 5 we plot data-scaling of for (32 epochs), in the same setup as Figure 5. We also plot the data scaling of the base Myrtle-CNN with the same hyperparameters. As in Figure 4 of Section 4 we again observe that the neural network has the best scaling law, outperforming any of the extracted kernels. This shows that representations learnt after any constant time of training are not sufficient to explain the data scaling of neural networks. Rather, these representations improve throughout training, and the entire course of training must be considered to recover the correct scaling law.
In this work, we compared the data-scaling properties of neural networks to their corresponding infinite NTKs and empirical NTKs (both at initialization and after various levels of training). We found that the kernels do not scale as well as neural networks, even if the kernels are allowed be partially “pretrained” themselves. Since scaling laws capture the sample efficiency of a training algorithm, it is an important indicator of the similarities or differences between two methods. Thus, our results suggest that while the NTK is a good starting point to understand deep learning, there are important aspects that are not accurately captured by the NTK— in particular, data-scaling laws.
NV is supported by NSF CCF-1909429. YB is supported by a Siebel Scholarship. PN is supported by NSF and the Simons Foundation for the Collaboration on the Theoretical Foundations of Deep Learning through awards DMS-2031883 and #814639, by NSF through IIS-1815697 and by the TILOS institute (NSF CCF-2112665). The computations in this paper were run on the FASRC Cannon cluster and were supported by the FAS Division of Science Research Computing Group at Harvard University, Simons Investigator Fellowship, NSF grants DMS-2134157 and CCF-1565264 and DOE grant DE-SC0022199.
The neural tangent kernel in high dimensions: triple descent and a multi-scale theory of generalization.
Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 74–84. External Links: Cited by: Appendix F.
- [ALL18] (2018) Learning and generalization in overparameterized neural networks, going beyond two layers. arXiv preprint arXiv:1811.04918. Cited by: §1.
Asymptotics of wide convolutional neural networks. CoRR abs/2008.08675. External Links: Cited by: §1.1, §3.
- [AOY19] (2019) A mean-field limit for certain deep neural networks. arXiv. External Links: Cited by: §1.1.
- [ADH+19a] (2019) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. Cited by: §1.
- [ADH+19b] (2019) On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 8139–8148. External Links: Cited by: §1.1.
- [ABP21] (2021) Neural networks as kernel learners: the silent alignment effect. External Links: Cited by: Appendix F, item 4, §4.
- [BDK+21a] (2021) Explaining neural scaling laws. arXiv preprint arXiv:2102.06701. Cited by: Appendix F, item 2, §1.
- [BDK+21b] (2021) Explaining neural scaling laws. CoRR abs/2102.06701. External Links: Cited by: §1.1.
- [BGG+22] (2022) Data scaling laws in NMT: the effect of noise and architecture. CoRR abs/2202.01994. External Links: Cited by: §1.1.
- [BHM+19] (2019) Reconciling modern machine learning practice and the bias-variance trade-off. External Links: Cited by: Appendix F.
- [BCP20] (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 1024–1034. External Links: Cited by: §1.1.
- [BP22] (2022) Learning curves for SGD on structured features. In International Conference on Learning Representations, External Links: Cited by: §1.1.
- [CBP21] (2021) Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. Nature communications 12 (1), pp. 1–12. Cited by: §1.1.
- [dSB20] (2020) Triple descent and the two kinds of overfitting: where & why do they appear?. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: Appendix F.
- [DM20] (2020) Learning parities with neural networks. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: Appendix F.
- [DLL+19] (2019) Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. Cited by: §1.
- [DZP+18] (2018) Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054. Cited by: §1.
- [DG20] (2020) Asymptotics of wide networks from feynman diagrams. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Cited by: §A.3.
- [FLY+20] (2020) Modeling from features: a mean-field framework for over-parameterized deep neural networks. arXiv. External Links: Cited by: §1.1.
- [FDP+20] (2020) Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: Appendix F, item 4, §1, §5.1, Hypothesis 5.1, §5, §5.
- [GSd+19] (2019-07) Jamming transition as a paradigm to understand the loss landscape of deep neural networks. Physical Review E 100 (1). External Links: Cited by: Appendix F.
- [GMM+20] (2020) When do neural networks outperform kernel methods?. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: Appendix F.
- [JHG18] (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 8580–8589. External Links: Cited by: Limitations of the NTK for Understanding Generalization in Deep Learning, §1.1, §1, §2.
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361. Cited by: §1.1, §1.1, §1, §2, §5.1.
- [KWL+21] (2021) Local signal adaptivity: provable feature learning in neural networks beyond kernels. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, M. Ranzato, A. Beygelzimer, Y. N. Dauphin, P. Liang, and J. W. Vaughan (Eds.), pp. 24883–24897. External Links: Cited by: Appendix F.
- [LSP+20] (2020) Finite versus infinite neural networks: an empirical study. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin (Eds.), External Links: Cited by: Appendix F, §1.1.
- [LXS+19] (2019) Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett (Eds.), pp. 8570–8581. External Links: Cited by: §A.3.
- [LZG21] (2021) Towards an understanding of benign overfitting in neural networks. CoRR abs/2106.03212. External Links: Cited by: Appendix F.
- [LP21] (2021) Provable convergence of nesterov accelerated method for over-parameterized neural networks. CoRR abs/2107.01832. External Links: Cited by: §A.3.
- [LON21] (2021) Properties of the after kernel. CoRR abs/2105.10585. External Links: Cited by: Appendix F, Appendix F, item 3, item 4, §1, §4, §4, §5.1, Hypothesis 5.1, §5.
- [NKB+20] (2020) Deep double descent: where bigger models and more data hurt. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, External Links: Cited by: Appendix F.
- [NNS21] (2021) The deep bootstrap framework: good online learners are good offline generalizers. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021, External Links: Cited by: item 1.
- [NWC+11] (2011) Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. Cited by: item 2.
- [NP20] (2020) A rigorous framework for the mean field limit of multilayer neural networks. CoRR abs/2001.11443. External Links: Cited by: §1.1.
- [NXH+20] (2020) Neural tangents: fast and easy infinite neural networks in python. In International Conference on Learning Representations, External Links: Cited by: §A.1, §2.
- [OMF21] (2021) What can linearized neural networks actually say about generalization?. In Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, External Links: Cited by: Appendix F, item 4, §1.1, §1, §4.
- [PPG+21] (2021) Geometric compression of invariant manifolds in neural networks. Journal of Statistical Mechanics: Theory and Experiment 2021 (4), pp. 044001. Cited by: Appendix F, §1.1, §1, §4.
- [PAG18] (2018) How to train your resnet 4: architecture. Note: https://web.archive.org/web/20210512184210/https://myrtle.ai/learn/how-to-train-your-resnet-4-architecture/Accessed: 2022-01-25 Cited by: §2, §3.
- [RYH21] (2021) The principles of deep learning theory. CoRR abs/2106.10165. External Links: Cited by: §1.1, §3.
- [RRB+19] (2019) A constructive prediction of the generalization error across scales. arXiv preprint arXiv:1909.12673. Cited by: §1.1, §1.
- [SFG+20] (2020) Neural kernels without tangents. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, Proceedings of Machine Learning Research, Vol. 119, pp. 8614–8623. External Links: Cited by: §2, §3.
- [SK20] (2020) A neural scaling law from the dimension of the data manifold. CoRR abs/2004.10802. External Links: Cited by: Appendix F.
Neural tangent kernel eigenvalues accurately predict generalization. CoRR abs/2110.03922. External Links: Cited by: §1.1.
- [SS19] (2019) Mean field analysis of deep neural networks. arXiv. External Links: Cited by: §1.1.
- [TSM+20] (2020) Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739. Cited by: §1.
- [VY21] (2021) Explicit loss asymptotics in the gradient descent training of neural networks. In Advances in Neural Information Processing Systems, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan (Eds.), External Links: Cited by: §5.1.
- [YH21] (2021) Tensor programs IV: feature learning in infinite-width neural networks. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, M. Meila and T. Zhang (Eds.), Proceedings of Machine Learning Research, Vol. 139, pp. 11727–11737. External Links: Cited by: §1.1.
Appendix A Experimental Details
The exact architecture of the Myrtle-CNN we use is the following: Conv Layer with
channels, Relu,Conv Layer with channels, Relu, Avg-pooling, Conv Layer with channels, Relu, Avg-pooling, Conv Layer with channels, Relu, Avg-pooling,
Avg-pooling, Dense Layer with 1 output. Stride is always. Here is our code for the network in the module of the Neural Tangents [NXH+20] library:
We refer to as the “width” of the network. Our base network in Figure 1 has .
We use the following 5 layer CNN for the SVHN-parity task:
The base network has .
The MLP that we use in our synthetic dataset experiments is a depth-4 MLP with the following code:
a.2 Scaling Laws
In our plots we use the scaling laws for the form where is referred to as the scaling constant. This is take into account the fact that any given neural network has a maximum possible accuracy. Note that as our task is a deterministic there is no label-noise to be accounted for. We calculate by solving the least squares problem between and the of empirically found test errors.
a.3 SGD with momentum, equivalence between infinite NTKs and neural networks
a.4 Additional Experimental Details for Figures
64 bit versus 32 bit precision Our neural network experiments are done with 32 bit precision while kernel experiments are done with 64 bit precision. To verify that this difference is not the cause of better performance of neural networks we ran the experiment in Figure 1 with 64 bit precision for train size of 64k. The test error actually improved, test error with optimal early stopping was .03993 while it was .04118 at 32 bit precision.
Starting with 0 output neural networks The convergence between neural networks, empirical NTK and the infinite NTK as width tends to infinity needs that the neural networks have 0 output at initialization. This can be done by subtracting the initial outputs from the neural network output. While we do not do this for the experiments in the paper, we verify in Table 1 that this makes almost not difference for Figure 1. We use a table instead of a plot as most of the differences are too small to be visible on a plot.
|Usual Neural Networks||.1504||.1177||.09655||.08083||.0658||.0545||.04958||.04118|
|Neural Netwoks with 0 output||.1501||.1204||.09672||.08302||.0668||.05462||.04585||.04152|
We now describe additional experimental details for each figure beyond those described in Section 2.
In Figure 1 all 3 models were trained with batch size of and SGD with learning rate of and momentum of (as implemented in ). For all models we used optimal early stopping where the test error is logged after an increase of multiplicative factor in the number of gradient steps. Unless mentioned otherwise this will the optimization setup we will use. For neural network and empirical NTK each model is averaged over 4 random initializations and error bars denote standard deviations (except the last two points of empirical NTK which we were not able to rerun due to computational constraints). The infinite NTK is a deterministic model hence we only have only have a single run for it. We also trained the kernels by directly solving the linear system with optimal L2-regularization but that yielded worst test performance.
In Figure 2
, for empirical NTK with large widths (128, 256) the values across different initializations were very similar hence we did not do multiple runs of even higher widths (512, 1024). For the neural network standard error of the mean (SEM) was calculated using 6 runs for widths up to 128 and 4 runs for higher widths.
In Figure 2 the models are trained with SGD: batch size of 200, learning rate of 10, momentum. All models are averaged across 8 random initializations. In this figure we report the final test error at convergence (train loss for neural network) for all models.
In Figure 3 we train with learning rate of 10.0 with no momentum.
In Figure 3 we train the width 512 version of the Myrtle-CNN till it reached train loss by which point their test error converges to a nearly fixed value.
Appendix B Effect of Very Low Learning rate
In this section we explore the effects of very low learning rate. The motivation is to understand the gradient flow limit i.e. the limiting behaviour of trained neural networks as the learning rate tends to 0. In Figure 6 we repeat Figure 1 except that we train the neural network with learning rate . The measured scaling constants are reasonably close by: for the Myrtle-CNN, its empirical NTK and infinite NTK respectively. This suggests a natural question:
Do the benefits (with respect to scaling constant) of finite width networks over corresponding empirical NTK vanish in the gradient flow limit?
To answer this question affirmatively we would would need to repeat the plot of Figure 6 for various widths which was computationally infeasible for us. We leave this for future work.
We now move to considering the effect of learning rate on a fixed training size.
In Figure 6 we plot the performance with respect to learning rate for training sizes 4000 (From the setup in Figure 1) and observe that at low learning rates performance is worse444Note that this is just for a single width. We do not know how width affects these results. We do know that at infinite width the learning rate does not have any effect, other than due to optimal early stopping. than infinite NTK but still better than empirical NTK at initialization. From this plots it is not clear if at the lowest learning rates which we could train the performance has converged to the gradient flow performance. In this setup it was computationally infeasible for us to explore smaller learning rates. To do this we move to the synthetic setting (with the same setup as in Figure 8) in Figure 6. Here the performance (final test error) converges as we go towards smaller learning rates which indicates that we have converged to the gradient flow limit. In this limit the neural network performs better than the infinite and empirical NTK at initialization. Note that higher learning rates still leads to even better performance.
We interpreted all of these experiments as suggesting that while high learning rate plays an important role in the performance of empirical networks it may not be necessary in having a improved performance over corresponding NTKs.. But more experimental evidence is needed to understand the role of learning rate and in understanding the gradient flow limit, particularly for natural tasks.
Appendix C Higher order analogues of the NTK
Let with representing the weights and a sample. By Taylor expansion around we have:
The empirical NTK of the neural network around weights refers to the model .
We consider the following two second order analogue of the NTK: and
In Figure 7 we use the setup of Figure 1 to plot the performance of the order analogue of the empirical NTK (, referred by in the plot). This shows that even this higher order analogue is not sufficient to recover the scaling law of neural networks.
In Figure 4 we saw that the after-kernel was sufficient to explain the improved performance of neural network over the empirical NTK. We now show an analogous result for the order analogue of NTK. As we want to understand the effect of change in the higher order terms we need to remove the influence on the after-kernel. We do so by defining the higher order analogue of after-kernel as the model which does not contain the after-kernel. We will denote this by when we use the weights after training on samples. In Figure 7 we show that the performance of is very close to that of the neural network.
Both of these experiments suggest that theories which assume that higher order analogues of the NTK remain fixed throughout the training may not be sufficient to explain the performance of neural networks.
Appendix D Experiments on Synthetic Data
Some of our experiments were not feasible on the CIFAR-5m-bin and SVHN-parity tasks. We did these experiments on the following synthetic task: Sample , , the input sample is and the label is .
Experiments on very low learning rates for this synthetic task can be found in Appendix B.
In Figure 8 we do an analogous experiment to Figure 2 and 2 for this synthetic task. We again observe that neural networks at small width improve with increase in width but at high width they start to worsen off with increase in width. The models are trained with SGD: batch size of 100, learning rate of 10, no momentum. All models are averaged across 8 random initializations. In this figure we report the final test error at convergence (train loss for neural network) for all models.
Appendix E SVHN-parity Experiments
In Figure 9 the analogue of Figure 1. The scaling constants in Figure 9 are and for the neural network, infinite NTK and the empirical NTK at initialization respectively. We also observe that unlike the CIFAR-5m-bin task here the neural network outperforms the infinite NTK even at small dataset sizes. This may be because the inductive bias of the NTKs is not suited for a the parity task. In Figure 9 we plot the more extensive figure corresponding to Figure 5. We again observe that the kernel continues to improve for most of the training.
All of the above models are trained with SGD: batch size of 200, learning rate of 10, momentum. In Figure 9 the neural network and empirical NTK experiments are averaged over 4 runs with error bars denoting standard deviation. The infinite NTK is a deterministic model and hence we only do a single run.
Figure 2 is another SVHN-parity experiment.
Appendix F Other related work
Beyond Double Descent: Double descent [BHM+19, GSd+19, NKB+20] predicts that in the regime of overparameterized models increasing the width improves the test error. We observe that the performance of overparameterized models is better than that of infinite width models showing that there is a natural setting where there is at least one more ascent after the double descent phenomena. Behaviours beyond the double descent phenomena have been predicted [AP20, dSB20, LZG21] and also observed [LSP+20] in empirical neural networks. Our works is different from these works as we show that in our setup simultaneously a) empirical NTK displays a monotonic improvement in the overparameterized regime towards the infinite NTK performance while b) the neural network performs better than the infinite NTK. This directly points towards another ascent after the double descent and also pinpoints its cause as the divergence between finite width neural networks and the empirical NTK at initialization.
After Kernel The empirical NTK after the training of neural network has been termed as after-kernel [LON21]. It has been shown [LON21, PPG+21]. We extend these works by studying how the after-kernel changes with dataset size and show that it continues to improve with dataset size.
Time dynamics of training from the NTK perspective. has been studied by [FDP+20, OMF21, ABP21, LON21]. These papers suggest that the empirical NTK changes rapidly in the beginning of the training followed by a slowing of this change. We argue against this interpretation in Section 5.
Explanations for Scaling Laws Current explanations of scaling laws [SK20, BDK+21a] rely on a fixed representation space. Operationalizing representation as the after-kernel, our results suggest that in practical neural networks the representation itself improves as data-size increases. Hence we may need more refined theories for explaining scaling laws of neural networks which take this into account.