Due to the challenge of modeling the structure of realistic data, theoretical studies of generalization often attempt to derive data-agnostic generalization bounds or study the typical performance of the algorithm on simple data distributions. The first set of theories derive bounds based on the complexity or capacity of the function class and often struggle to explain the success of modern learning systems which generalize well on real data but are sufficiently powerful to fit random noise [41, 58]. Rather than exploring data-independent worst-case performance, it is often useful to analyze how algorithms generalize typically or on average over a stipulated data distribution  . A typical assumption made in this style of analysis is that the data distribution possesses a high degree of symmetry by assuming the data follows a factorized probability distribution across input variables . For example, spherical cow models treat data vectors as drawn from the isotropic Gaussian distribution or uniformly from the sphere while Boolean hypercube models treat data as random binary vectors. Models which study such simplified data distributions have been employed in several classic and recent studies exploring the capacity of supervised learning algorithms and associative memory , overfitting peaks and phase transitions in learning
. A typical assumption made in this style of analysis is that the data distribution possesses a high degree of symmetry by assuming the data follows a factorized probability distribution across input variables
. For example, spherical cow models treat data vectors as drawn from the isotropic Gaussian distribution or uniformly from the sphere while Boolean hypercube models treat data as random binary vectors. Models which study such simplified data distributions have been employed in several classic and recent studies exploring the capacity of supervised learning algorithms and associative memory[22, 30]
, overfitting peaks and phase transitions in learning[27, 42, 17, 1, 39, 31], and neural network training dynamics .
Rather than being distributed isotropically throughout the entire set of ambient dimensions, realistic datasets often lie on low dimensional structures. For example, MNIST and CIFAR-10 lie on surfaces with intrinsic dimension of and respectively  . Incorporating data manifold structure into models of generalization has provided more accurate assessments of classifier capacity
. Incorporating data manifold structure into models of generalization has provided more accurate assessments of classifier capacity[15, 14], nonlinear function approximation [53, 23, 10, 11, 37, 33, 6], linear network dynamics  and two-layer neural network test error [25, 56] on realistic learning problems such as MNIST or CIFAR-10 [35, 34]. The analysis of two layer networks revealed the importance of modeling the intrinsically low-dimensional latent structure of the data when analyzing learning dynamics. The authors of that study propose a hidden manifold model of the data where labels are generated by a teacher network which receives the low dimensional latent variables as input .
Of significant practical interest to machine learning theory is the dynamics of the test loss during stochastic gradient descent, which quantifies the expected error rate of the model throughout optimization. Several works have provided asymptotic guarantees for the convergence rate of SGD in general settings [46, 49, 52, 48, 13, 19, 57, 4, 26], obtaining worst case bounds in terms of general assumptions on the structure of the gradient and Hessian of the loss. Tight asymptotic loss scalings have been obtained for SGD on high dimensional least squares, though only the exponents of the power-law scalings were exactly computed from the feature covariance [8, 18, 45, 21]. Alternatively, SGD has been studied in the typical case in several works in the spirit of statistical physics, providing exact average test loss expressions for very simple data distributions. These include studies of single layer [28, 55, 9, 38] and two-layer [50, 16, 24] neural networks as well as shallow Gaussian mixture classification . To understand the average-case performance of SGD in more realistic learning problems, incorporating structural information about realistic data distributions is necessary.
In this paper, we first explore the minimal improvement on the spherical cow approximation by studying an elliptical cow model, where the image of the data under a possibly nonlinear feature map is treated as a Gaussian with certain covariance. We express the generalization error in terms of the induced distribution of nonlinear features, akin to SGD version of the offline kernel regression theory of recent works [10, 11, 37]. We derive test error dynamics throughout SGD in terms of the correlation structure in a feature space, such as a wide neural network’s initial gradient [32, 36] . Using this idea, we analyze SGD on random feature models and artificial neural networks using MNIST and CIFAR-10. We then analyze the general case where the feature distribution is arbitrary and provide an exact solution for the expected test loss dynamics. This result requires not only the second moment structure but also all of the fourth moments of the features. From this general theory, one can recover the Gaussian approximation in the limit of small learning rates, large batch sizes, or feature distributions with small fourth order cumulants. For MNIST and CIFAR-10, we empirically observe that the Gaussian model provides an excellent approximation to the true dynamics due to negligible non-Gaussian effects.
. Using this idea, we analyze SGD on random feature models and artificial neural networks using MNIST and CIFAR-10. We then analyze the general case where the feature distribution is arbitrary and provide an exact solution for the expected test loss dynamics. This result requires not only the second moment structure but also all of the fourth moments of the features. From this general theory, one can recover the Gaussian approximation in the limit of small learning rates, large batch sizes, or feature distributions with small fourth order cumulants. For MNIST and CIFAR-10, we empirically observe that the Gaussian model provides an excellent approximation to the true dynamics due to negligible non-Gaussian effects.
Another novelty of our approach is that it provides learning curves in discrete time and depends on minibatch size , allowing us to interpolate our theory between single sample SGD (
, allowing us to interpolate our theory between single sample SGD () and gradient descent on the population loss () by varying . We show how learning rate, minibatch size and data structure interact in the learning problem to determine generalization dynamics and examine what the best sampling strategy is for a fixed compute budget.
2 Theoretical Results
2.1 Motivations: Examples of interesting linearized settings
We study stochastic gradient descent on a linear model with parameters and feature map . In this setting we aim to optimize the set of parameters to minimize a population loss of the form
where are input data vectors associated with a probability distribution
are input data vectors associated with a probability distribution, is a nonlinear feature map and is a target function which we can evaluate on training samples. The aim of the present work is to elucidate how this population loss evolves during stochastic gradient descent on . This simple setting is of relevance for understanding many models including the random feature model  and the infinite width limit of neural networks [32, 5, 36] as we describe below. We derive a formula in terms of the eigendecomposition of the feature correlation matrix and the target function
where . Our theory predicts the expected test loss averaged over training sample sequences in terms of the quantities , revealing how the structure in the data and the learning problem influence test error dynamics during SGD. This theory is quite general, analyzing the performance of linearized models on arbitrary data distributions, feature maps , and target functions .
2.1.1 Random Feature Models
Our theory can be used to study the popular random feature models on realistic data by constructing a feature map as with input data , and projection matrix . The random feature model is thus a linear model with covariance structure
By diagonalizing we can find eigenvalues and eigenvectors
we can find eigenvalues
and eigenvectors. These quantities, along with information about the target function, will be inputs into our theory, allowing us to predict learning curves during SGD.
2.1.2 Kernel Methods and Linearized Neural Networks
Wide neural networks behave as linear functions of their parameters around the initialization and nonlinear functions of the input data . To study such linearized networks with parameters and initial parameters in the framework of our theory, we interpret as the displacement in the weights from initialization. This allows construction of a nonlinear feature map of the form . In this setting it suffices to understand the correlation structure
which is simply the Fisher information matrix .
2.2 Problem Setup
Let (with possibly infinite) be feature vectors with correlation structure . During learning, parameters are updated to estimate a target function
are updated to estimate a target functionwhich can be expressed as a linear combination of features . At each time step , the weights are updated by taking a stochastic gradient step on a fresh mini-batch of examples
where each of the vectors are sampled independently and . The learning rate controls the gradient descent step size while the batch size gives a empirical estimate of the gradient at timestep . At each timestep, the test-loss, or generalization error, has the form
which quantifies exactly the test error of the vector . Note, however, that is a random variable since . Using a similar technique, we also provide a calculation of the variance
is a random variable sincedepends on the precise history of sampled feature vectors . Our theory, which generalizes the recursive method of Werfel, Xie and Seung  allows us to compute the expected test loss by averaging over all possible sequences, to obtain
. Using a similar technique, we also provide a calculation of the variancewhich quantifies the fluctuations in the learning curve due to stochastic sampling of features.
2.3 Learnable and Noise Free Problems: The Elliptical Cow Model
Before studying the general case, we first analyze the setting where the target function is learnable, meaning that there exist weights such that . We will further assume that the induced feature distribution is Gaussian so that all moments of can be written in terms of the covariance . We will remove these assumptions in later sections.
Suppose the features follow a Gaussian distribution and the target function is learnable in these features . After steps of SGD with minibatch size and learning rate , the expected (over possible sample sequences ) test loss has the form
where is a vector containing the eigenvalues of and is a vector containing elements for eigenvectors of . The matrix has the form
where constructs a matrix with the argument vector placed along the diagonal.
See Appendix A.1. ∎
Below we provide some immediate interpretations of this result.
The matrix can be thought of as containing two components; a matrix which represents the time-evolution of the loss under average gradient updates. The remaining matrix involving a batch size dependence arise due to fluctuations in the gradients, a consequence of the stochastic sampling process.
The test loss obtained when training directly on the population loss can be obtained by taking the minibatch size . In this case, and one obtains the convergence of performing gradient descent directly on the population loss . This population loss can also be obtained by considering small learning rates, ie the limit, where .
For general and , the matrix is non-diagonal, indicating that the components are not learned independently as increases like for , but rather interact during learning. Thus, we expect non-trivial coupling accross eigenmodes at large . This is unlike the offline theory for learning in feature spaces, i.e. kernel regression, where errors across eigenmodes were shown to decouple and are learned at different rates [10, 11].
We not only can compute the average test loss at time , but also its variance .
Assuming Gaussian features and a learnable target function , the variance of the loss at time is
where is element wise square and is defined in the same way as the expected loss formula.
A proof is provided in Appendix A.2. ∎
2.3.1 Special Case 1: Unstructured Isotropic Features
This special case was previously analyzed by Werfel, Xie, Seung  which takes and . We extend their result for arbitrary , giving the following learning curve
which follows from the fact that (the vector of all ’s) is an eigenvector of with eigenvalue . We therefore find exponential convergence in the generalization error with effective rate . We can further optimize the effective rate with respect to to get optimal convergence rate, giving and
Again, we can immediately draw some interesting conclusions about this result
Strong dimension dependence: as , we see that, with optimal choice of , learning happens at a rate . This small exponent is due to the necessity of scaling inversely with the dimension since the term coming from gradient variance in Equation (10) ( factor) scales like . Increasing the minibatch size improves the exponential rate by reducing the gradient noise variance. In the large batch limit , the optimal loss scales as .
At small , the convergence at any learning rate is much slower than the convergence of the limit, which does not suffer from a dimensionality dependence due to gradient noise.
We also note that this feature model has the same rate of convergence for every learnable target function .
In Figure 1 (a) we show theoretical and simulated learning curves for this model for varying values of at the optimal learning rate and in Figure 1 (d), we show the loss as a function of minibatch size for a fixed compute budget . In this model of isotropic features, the best minibatch size is .
for (d) isotropic features, (e) power law features and (f) MNIST ReLU random features with simulations (dots average and standard deviation forruns). Intermediate batch sizes can be preferable on power law features and MNIST.
2.3.2 Special Case 2: Power Laws and Effective Dimensionality
Realistic datasets such as natural images or audio tend to exhibit nontrivial correlation structure, which often results in power-law spectra when the data is projected into a feature space, such as a randomly intialized neural network [53, 10, 11, 6, 54, 8]. In the limit, if the feature spectra and task specra follow power laws, and with , then Theorem 2.1 implies that generalization error also falls with a power law: where is a constant. Notably, these predicted exponents we recovered as a special case of our theory agree with prior work on SGD with power law spectra, which give exponents in terms of the feature correlation structure [8, 18, 54]. As we show in Appendix A.4, this power law can be derived by taking an integral approximation of the population loss and approximating the integral with Laplace’s method. After steps of gradient descent, the error is dominated by the eigenmode with index which minimizes . The test error scaling under such an approximation is . We show an example of such a power law scaling in Figure 1 (b). Notably, since the total variance approaches a finite value as , the learning curves are relatively insensitive to ambient dimension, and are rather sensitive to the intrinsic dimension of the data manifold. For this model, we find that there can exist optimal batch sizes when the compute budget is fixed (Figure 1 (e)).
For a fixed feature map and data distribution, some target functions will be easier to learn than others indicating a strong inductive bias. As we discussed in the previous section, the contribution of the error from each eigenmode decouple in the limit. In this limit, each eigendirection is learned with a different timeconstant . Thus, the coefficient along direction is learned with a time-constant . Noting that is the variance of along the -th eigenfunction, it follows that tasks which have most of their variance in the top eigenspace will be learned rapidly since their variance is estimated using feature space directions with small time-constants. Thus, feature maps which give better alignment to the task (larger
-th eigenfunction, it follows that tasks
which have most of their variance in the top eigenspace will be learned rapidly since their variance is estimated using feature space directions with small time-constants. Thus, feature maps which give better alignment to the task (larger) will have better generalization. For , the error can be crudely approximated as a tail sum of remaining variance in the target function . This motivates use of tail sums to quantify feature and task alignment.
This power law scaling is of interest, not only as an alternative to the isotropic setting, but also because it appears to accurately match the qualitative behavior of wide neural networks trained on realistic data [29, 6, 53, 10]. In Figure 1 (c), we see that the scaling of the loss is more similar to the power law setting than the isotropic features setting in a random features model of MNIST, agreeing excellently with our theory. Again, an optimal batch size exists when the compute budget is fixed (Figure 1 (f)). We provide further evidence of the existence of power law structure on realistic data in Figure 2 (a)-(c), where we provide spectra and test loss learning curves for MNIST and CIFAR-10 on ReLU random features. The eigenvalues and the task power tail sums both follow power laws, generating power law test loss curves. These learning curves are contrasted with isotropically distributed data in passed through the same ReLU random feature model and we see that structured data distributions allow much faster learning than the unstructured data. Again, our theory predicts experimental curves accurately across variations in nonlinearities, learning rate, batch size and noise (Figure 2).
2.4 Arbitrary Induced Feature Distributions: The General Solution
The result in the previous section was proven exactly in the case of Gaussian vectors (see Appendix A.1). For arbitrary (possibly non-Gaussian) distributions, we obtain a slightly more involved result (see Appendix A.3).
Let be an arbitrary feature map with covariance matrix . After diagonalizing the features , introduce the fourth moment tensor
, introduce the fourth moment tensor
where expectation is taken over the distribution. Let denote a flattening of an matrix into a vector of length and let represent a flattening of a 4D tensor into a two-dimensional matrix. Then the expected loss (over ) is
We see that the test loss dynamics depend on the second and fourth moments of the features through quantities and respectively. This result is exact, however, we see that it requires analyzing the evolution of vectors in dimensions before calculating the final sum over the diagonals , rendering it impractical to simulate for high dimensional feature maps. We recover the Gaussian result as a special case when is a simple weighted sum of these three products of Kronecker tensors .
The question remains whether the Gaussian approximation will provide an accurate model on realistic data. This is a weak version of the Gaussian equivalence conjecture from random feature model theory [31, 37]. Based on these previous works, we expect that the test loss of the Gaussian model closely tracks the test loss of wide artificial neural networks. We do not provide a proof of this conjecture, but verify its accuracy in empirical experiments on MNIST and CIFAR-10 as shown in Figure 2. In Figure 3, we show that the fourth moment matrix for a ReLU random feature model and its projection along the eigenbasis of the feature covariance is accurately approximated by the equivalent Gaussian model.
2.5 Unlearnable or Noise Corrupted Problems
In general, the target function may depend on features which cannot be expressed as linear combinations of features . We model the quantity of noise which is not, expressible with . Note that does not have to be a deterministic function of , but can also be a stochastic process which is uncorrelated with .
For a target function with unlearnable variance , the expected test loss has the form
which has an asymptotic, irreducible error as .
See Appendix A.5 for the proof. The convergence to the asymptotic error takes the form . We note that this quantity is not necessarily monotonic in and can exhibit local maxima for sufficiently large , as in Figure 2 (f). This is reminiscent of the sample-wise double descent phenomenon in offline learning curves [39, 42, 11, 17], yet the peaking behavior in this model is limited to linear combinations of decaying exponentials (where are the eigenvalues of ) rather than divergences of the form , as in the offline double descent model.
3 Comparing Neural Network Feature Maps
We can utilize our theory to compare how wide neural networks of different depths generalize when trained with SGD on a real dataset. In the limit of infinite width and small learning rates, neural networks training and generalization behave as linear models of their parameters. In finite width neural networks, the NTK, which measures the geometry of the parameter gradients over different data points can evolve in time. However, for sufficiently large widths, finite neural networks have been shown to behave as linear functions of their parameters . To predict test loss dynamics with our theory, it therefore suffices to characterize the geometry of the gradient features . In Figure 4, we show the Neural Tangent Kernel (NTK) eigenspectra and task-power spectra for fully connected neural networks of varying depth, calculated with the Neural Tangents API . We compute the kernel on a subset of randomly sampled MNIST images and estimate the power law exponents for the kernel and task spectra and . We find that, accross architectures, the task spectra are highly similar, but that the kernel eigenvalues decay more slowly for deeper models, corresponding to a smaller exponent . As a consequence, deeper neural network models train more quickly during stochastic gradient descent as we show in Figure 4 (c). After fitting power laws to the spectra and the task power , we compared the true test loss dynamics (color) for a width-500 neural network model with the predicted power-law scalings from the fit exponents . The predicted scalings from NTK regression accurately describe trained networks at finite width. On CIFAR-10, we compare the scalings of the convolutional model and a standard multi-layer perceptron and find that the convolutional model obtains a better exponent due to its faster decaying tail sum
. The predicted scalings from NTK regression accurately describe trained networks at finite width. On CIFAR-10, we compare the scalings of the convolutional model and a standard multi-layer perceptron and find that the convolutional model obtains a better exponent due to its faster decaying tail sum.
By studying a simple model of stochastic gradient descent, we were able to uncover how the geometry of the data in an induced feature space governs the dynamics of the test loss. We derived average learning curves for both Gaussian and general non-Gaussian features and showed the conditions under which the Gaussian approximation is accurate. The proposed model allowed us to explore the role of the data distribution and neural network architecture on the learning curves, demonstrating how the power-law spectra observed in wide neural networks on real data allow an escape of the curse of dimensionality during SGD. We verified our theory with experiments on MNIST and CIFAR-10. In addition, we explored the role of batch size, learning rate, and label noise level on generalization. We found that for a fixed compute budget small minibatch sizes can be best and that label noise can induce peaks in the average case test loss, though not as sharp as those in the offline learning case.
for both Gaussian and general non-Gaussian features and showed the conditions under which the Gaussian approximation is accurate. The proposed model allowed us to explore the role of the data distribution and neural network architecture on the learning curves, demonstrating how the power-law spectra observed in wide neural networks on real data allow an escape of the curse of dimensionality during SGD. We verified our theory with experiments on MNIST and CIFAR-10. In addition, we explored the role of batch size, learning rate, and label noise level on generalization. We found that for a fixed compute budget small minibatch sizes can be best and that label noise can induce peaks in the average case test loss, though not as sharp as those in the offline learning case.
Limitations: Though our model successfully incorporates the structure of the data into a prediction of the test loss dynamics, it is limited in that it applies to linearized machine learning models, where one learns a linear combinations of nonlinear static features. Thus, our theory’s application to artificial neural networks is limited to random feature models, where only the last layer is trained, or deep networks in the lazy learning regime, where the network acts as a structured and static feature map . In finite width neural networks, understanding the test loss dynamics during SGD will require coping with non-convexity of the objective and the time evolution of the gradient features. Adaptive learning rate schedules would also be a fruitful extension of the present work, closing the gap between theory and the optimizers used in practice. We hope that our work can inspire future studies on the structure of the data distribution and its interaction with network architecture in the nonlinear feature-learning regime.
We thank the Harvard Data Science Initiative and Harvard Dean’s Competitive Fund for Promising Scholarship for their support. We would also like to thank Jacob Zavatone-Veth for useful discussions and comments on this manuscript.
We thank the Harvard Data Science Initiative and Harvard Dean’s Competitive Fund for Promising Scholarship for their support. We would also like to thank Jacob Zavatone-Veth for useful discussions and comments on this manuscript.
-  (2016-08) Statistical mechanics of optimal convex inference in high dimensions. 6, pp. 031034. External Links: Cited by: §1.
Statistical mechanics of complex neural systems and high dimensional data. 2013 (03), pp. P03014. External Links: Cited by: §1.
-  (2020) High-dimensional dynamics of generalization error in neural networks. 132, pp. 428–446. External Links: Cited by: §1.
-  (2019-25–28 Jun) Normal approximation for stochastic gradient descent via non-asymptotic rates of martingale clt. In Proceedings of the Thirty-Second Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA, pp. 115–137. External Links: Cited by: §1.
-  (2019) On exact computation with an infinitely wide neural net. Cited by: §2.1.
-  (2021) Explaining neural scaling laws. arXiv preprint arXiv:2102.06701. Cited by: §1, §2.3.2, §2.3.2.
-  (1999-01) Advanced mathematical methods for scientists and engineers: asymptotic methods and perturbation theory. Vol. 1. External Links: Cited by: §A.4, §2.3.2.
-  (2020) Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. External Links: Cited by: §1, §2.3.2.
-  (1994-12) On-line learning with a perceptron. 28 (7), pp. 525–530. External Links: Cited by: §1.
-  (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024–1034. Cited by: §1, §1, 3rd item, §2.3.2, §2.3.2.
-  (2020) Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. 12, pp. 1–12. Cited by: §1, §1, 3rd item, §2.3.2, §2.5.
-  (2019) On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Cited by: §4.
-  (1954) On a Stochastic Approximation Method. 25 (3), pp. 463 – 483. External Links: Cited by: §1.
-  (2018-10) Learning Data Manifolds with a Cutting Plane Method. 30 (10), pp. 2593–2615. External Links: Cited by: §1.
-  (2018-07) Classification and geometry of general perceptual manifolds. 8, pp. 031003. External Links: Cited by: §1.
-  (1991-05) Eigenvalues of covariance matrices: application to neural-network learning. Phys. Rev. Lett. 66, pp. 2396–2399. External Links: Cited by: §1.
-  (2020) Triple descent and the two kinds of overfitting: where & why do they appear?. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 3058–3069. External Links: Cited by: §1, §2.5.
-  (2016-02) Harder, better, faster, stronger convergence rates for least-squares regression. 18, pp. . Cited by: §1, §2.3.2.
-  (2021) Asymptotic optimality in stochastic optimization. 49 (1), pp. 21 – 48. External Links: Cited by: §1.
-  (2001) Statistical mechanics of learning. Cambridge University Press. External Links: Cited by: §1.
-  (2020) Sobolev norm learning rates for regularized least-squares algorithms. 21 (205), pp. 1–38. External Links: Cited by: §1.
-  (1988) Optimal storage properties of neural network models. 21, pp. 271–284. Cited by: §1.
-  (2020-13–18 Jul) Generalisation error in learning with random features and the hidden manifold model. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 3452–3462. External Links: Cited by: §1.
-  (2019) Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Cited by: §1.
-  (2020-12) Modeling the influence of data structure on learning in neural networks: the hidden manifold model. 10, pp. 041044. External Links: Cited by: §1.
-  (2020) The heavy-tail phenomenon in sgd. arXiv preprint arXiv:2006.04740. Cited by: §1.
-  (1989-06) Phase transitions in simple learning. 22 (12), pp. 2133–2150. External Links: Cited by: §1.
-  (1991-08) Learning processes in neural networks. 44, pp. 2718–2726. External Links: Cited by: §1.
-  (2017) Deep learning scaling is predictable, empirically. External Links: Cited by: §2.3.2.
-  (1982) Neural networks and physical systems with emergent collective computational abilities. 79 (8), pp. 2554–2558. External Links: Cited by: §1.
-  (2021) Universality laws for high-dimensional learning with random features. External Links: Cited by: §1, §2.4.
-  (2020) Neural tangent kernel: convergence and generalization in neural networks. External Links: Cited by: §1, §2.1.
-  (2020) Kernel alignment risk estimator: risk prediction from training data. External Links: Cited by: §1.
-  () CIFAR-10 (canadian institute for advanced research). Phys. Rev. XJournal of Statistical Mechanics: Theory and ExperimentNature CommunicationsSiam Journal on Control and OptimizationThe Annals of Mathematical StatisticsThe Annals of Mathematical StatisticsThe Annals of StatisticsJournal of Physics A: Mathematical and GeneralProceedings of the National Academy of SciencesJournal of Physics ANeural NetworksPhys. Rev. XJournal of Machine Learning ResearchJournal of Machine Learning ResearchEurophysics Letters (EPL)Phys. Rev. AStatistics and ComputingThe Annals of StatisticsNeural ComputationPhys. Rev. XarXiv preprint arXiv:1909.11304arXiv preprint arXiv:1904.11955. External Links: Cited by: §1.
-  (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Cited by: §1.
-  (2020-12) Wide neural networks of any depth evolve as linear models under gradient descent. Journal of Statistical Mechanics: Theory and Experiment 2020 (12), pp. 124002. External Links: Cited by: §1, §2.1.2, §2.1, §3.
-  (2021) Capturing the learning curves of generic features maps for realistic data sets with a teacher-student model. External Links: Cited by: §1, §1, §2.4.
-  (1998) Statistical mechanical analysis of the dynamics of learning in perceptrons. 8, pp. 55–88. Cited by: §1.
-  (2020) The generalization error of random features regression: precise asymptotics and double descent curve. External Links: Cited by: §1, §2.5.
-  (2020) Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. External Links: Cited by: §1.
-  (2012) Foundations of machine learning. The MIT Press. External Links: Cited by: §1.
More data can hurt for linear regression: sample-wise double descent. External Links: Cited by: §1, §2.5.
-  (2020) Neural tangents: fast and easy infinite neural networks in python. In International Conference on Learning Representations, External Links: Cited by: §A.8, §3.
-  (2018) The spectrum of the fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Cited by: §2.1.2.
-  (2018) Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. External Links: Cited by: §1.
-  (1992) Acceleration of stochastic approximation by averaging. 30, pp. 838–855. Cited by: §1.
-  (2008) Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Vol. 20, pp. . External Links: Cited by: §2.1.
-  (1951) A Stochastic Approximation Method. 22 (3), pp. 400 – 407. External Links: Cited by: §1.
-  (1988-02) Efficient estimations from a slowly convergent robbins-monro process. pp. . Cited by: §1.
-  (1999-04) Dynamics of on-line gradient descent learning for multilayer neural networks. pp. . Cited by: §1.
-  (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Cited by: §1.
-  (1989) Asymptotic Properties of Statistical Estimators in Stochastic Programming. 17 (2), pp. 841 – 858. External Links: Cited by: §1.
-  (2020-12) Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment 2020 (12), pp. 124001. External Links: Cited by: §1, §2.3.2, §2.3.2.
-  (2021) Universal scaling laws in the gradient descent training of neural networks. External Links: Cited by: §2.3.2.
-  (2004) Learning curves for stochastic gradient descent in linear feedforward networks. In Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf (Eds.), Vol. 16, pp. . External Links: Cited by: §1, §2.2, §2.3.1.
-  (2019) Data-dependence of plateau phenomenon in learning with neural network — statistical mechanical analysis. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Cited by: §1.
-  (2020) An analysis of constant step size sgd in the non-convex regime: asymptotic normality and bias. External Links: Cited by: §1.
-  (2017) Understanding deep learning requires rethinking generalization. ArXiv abs/1611.03530. Cited by: §1.
Appendix A Appendix
a.1 Proof of Theorem 2.1
Let represent the difference between the current and optimal weights and define the correlation matrix for this difference
Using stochastic gradient descent, with gradient vector , the matrix satisfies the recursion
First, note that since are all independently sampled at timestep , we can break up the average into the fresh batch of samples and an average over
The last term requires computation of fourth moments
First, consider the case where . Letting , we need to compute terms of the form
For Gaussian random vectors, we resort to Wick-Isserlis theorem for the fourth moment
This correlation structure for implies that its covariance has the form
Using the formula for , we arrive at the following recursion relation for
Since we are ultimately interested in the generalization error , it suffices to track the evolution of
Vectorizing this equation for generates the following solution
The coefficient . To get the generalization error, we merely compute as desired.
a.2 Proof of Theorem 2.2
Under the assumption of Gaussian features, the discrepancy is the sum of Gaussian random variables and is therefore Gaussian. By again appealing to Wick-Isserlis theorem, the second moment of the loss can be shown to have the form
Decomposing in the appropriate basis, we find
The diagonal elements can be solved for as while the off diagonal elements all decouple and satisfy
Thus the total variance takes the form
a.3 Proof of Theorem 2.3
We rotate all of the feature vectors into the eigenbasis of the covariance, generating diagonalized features and introduce the following fourth moment tensor
We redefine in an appropriate basis
where so that ’s dynamics take the form
The elements of the final matrix have the form
We thus generate the following dynamics for
Let , then we have
Solving these dynamics for , recognizing that , and taking an inner product against gives the desired result.
a.4 Power Law Scalings in Small Learning Rate Limit
By either taking a small learning rate or a large batch size, the test loss dynamics reduce to the test loss obtained from gradient descent on the population loss. In this section, we consider the small learning rate limit , where the average test loss follows
Under the assumption that the eigenvalue and target function power spectra both follow power laws and , the loss can be approximated by an integral over all modes
We identify the function and proceed with Laplace’s method . This consists of Taylor expanding around its minimum to second order and computing a Gaussian integral
We must identify the which minimizes . The interpretation of this value is that it indexes the mode which dominates the error at a large time . The first order condition gives