Learning Curves for SGD on Structured Features

06/04/2021 ∙ by Blake Bordelon, et al. ∙ Harvard University 0

The generalization performance of a machine learning algorithm such as a neural network depends in a non-trivial way on the structure of the data distribution. Models of generalization in machine learning theory often ignore the low-dimensional structure of natural signals, either by considering data-agnostic bounds or by studying the performance of the algorithm when trained on uncorrelated features. To analyze the influence of data structure on test loss dynamics, we study an exactly solveable model of stochastic gradient descent (SGD) which predicts test loss when training on features with arbitrary covariance structure. We solve the theory exactly for both Gaussian features and arbitrary features and we show that the simpler Gaussian model accurately predicts test loss of nonlinear random-feature models and deep neural networks trained with SGD on real datasets such as MNIST and CIFAR-10. We show that modeling the geometry of the data in the induced feature space is indeed crucial to accurately predict the test error throughout learning.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Due to the challenge of modeling the structure of realistic data, theoretical studies of generalization often attempt to derive data-agnostic generalization bounds or study the typical performance of the algorithm on simple data distributions. The first set of theories derive bounds based on the complexity or capacity of the function class and often struggle to explain the success of modern learning systems which generalize well on real data but are sufficiently powerful to fit random noise [41, 58]. Rather than exploring data-independent worst-case performance, it is often useful to analyze how algorithms generalize typically or on average over a stipulated data distribution [20]

. A typical assumption made in this style of analysis is that the data distribution possesses a high degree of symmetry by assuming the data follows a factorized probability distribution across input variables

[2]

. For example, spherical cow models treat data vectors as drawn from the isotropic Gaussian distribution or uniformly from the sphere while Boolean hypercube models treat data as random binary vectors. Models which study such simplified data distributions have been employed in several classic and recent studies exploring the capacity of supervised learning algorithms and associative memory

[22, 30]

, overfitting peaks and phase transitions in learning

[27, 42, 17, 1, 39, 31], and neural network training dynamics [3].

Rather than being distributed isotropically throughout the entire set of ambient dimensions, realistic datasets often lie on low dimensional structures. For example, MNIST and CIFAR-10 lie on surfaces with intrinsic dimension of and respectively [53]

. Incorporating data manifold structure into models of generalization has provided more accurate assessments of classifier capacity

[15, 14], nonlinear function approximation [53, 23, 10, 11, 37, 33, 6], linear network dynamics [51] and two-layer neural network test error [25, 56] on realistic learning problems such as MNIST or CIFAR-10 [35, 34]. The analysis of two layer networks revealed the importance of modeling the intrinsically low-dimensional latent structure of the data when analyzing learning dynamics. The authors of that study propose a hidden manifold model of the data where labels are generated by a teacher network which receives the low dimensional latent variables as input [25].

Of significant practical interest to machine learning theory is the dynamics of the test loss during stochastic gradient descent, which quantifies the expected error rate of the model throughout optimization. Several works have provided asymptotic guarantees for the convergence rate of SGD in general settings [46, 49, 52, 48, 13, 19, 57, 4, 26], obtaining worst case bounds in terms of general assumptions on the structure of the gradient and Hessian of the loss. Tight asymptotic loss scalings have been obtained for SGD on high dimensional least squares, though only the exponents of the power-law scalings were exactly computed from the feature covariance [8, 18, 45, 21]. Alternatively, SGD has been studied in the typical case in several works in the spirit of statistical physics, providing exact average test loss expressions for very simple data distributions. These include studies of single layer [28, 55, 9, 38] and two-layer [50, 16, 24] neural networks as well as shallow Gaussian mixture classification [40]. To understand the average-case performance of SGD in more realistic learning problems, incorporating structural information about realistic data distributions is necessary.

In this paper, we first explore the minimal improvement on the spherical cow approximation by studying an elliptical cow model, where the image of the data under a possibly nonlinear feature map is treated as a Gaussian with certain covariance. We express the generalization error in terms of the induced distribution of nonlinear features, akin to SGD version of the offline kernel regression theory of recent works [10, 11, 37]. We derive test error dynamics throughout SGD in terms of the correlation structure in a feature space, such as a wide neural network’s initial gradient [32, 36]

. Using this idea, we analyze SGD on random feature models and artificial neural networks using MNIST and CIFAR-10. We then analyze the general case where the feature distribution is arbitrary and provide an exact solution for the expected test loss dynamics. This result requires not only the second moment structure but also all of the fourth moments of the features. From this general theory, one can recover the Gaussian approximation in the limit of small learning rates, large batch sizes, or feature distributions with small fourth order cumulants. For MNIST and CIFAR-10, we empirically observe that the Gaussian model provides an excellent approximation to the true dynamics due to negligible non-Gaussian effects.

Another novelty of our approach is that it provides learning curves in discrete time and depends on minibatch size

, allowing us to interpolate our theory between single sample SGD (

) and gradient descent on the population loss () by varying . We show how learning rate, minibatch size and data structure interact in the learning problem to determine generalization dynamics and examine what the best sampling strategy is for a fixed compute budget.

2 Theoretical Results

2.1 Motivations: Examples of interesting linearized settings

We study stochastic gradient descent on a linear model with parameters and feature map . In this setting we aim to optimize the set of parameters to minimize a population loss of the form

(1)

where

are input data vectors associated with a probability distribution

, is a nonlinear feature map and is a target function which we can evaluate on training samples. The aim of the present work is to elucidate how this population loss evolves during stochastic gradient descent on . This simple setting is of relevance for understanding many models including the random feature model [47] and the infinite width limit of neural networks [32, 5, 36] as we describe below. We derive a formula in terms of the eigendecomposition of the feature correlation matrix and the target function

(2)

where . Our theory predicts the expected test loss averaged over training sample sequences in terms of the quantities , revealing how the structure in the data and the learning problem influence test error dynamics during SGD. This theory is quite general, analyzing the performance of linearized models on arbitrary data distributions, feature maps , and target functions .

2.1.1 Random Feature Models

Our theory can be used to study the popular random feature models on realistic data by constructing a feature map as with input data , and projection matrix

(usually taken to be a random matrix with Gaussian entries), and nonlinear activation function

. The random feature model is thus a linear model with covariance structure

(3)

By diagonalizing

we can find eigenvalues

and eigenvectors

. These quantities, along with information about the target function, will be inputs into our theory, allowing us to predict learning curves during SGD.

2.1.2 Kernel Methods and Linearized Neural Networks

Wide neural networks behave as linear functions of their parameters around the initialization and nonlinear functions of the input data [36]. To study such linearized networks with parameters and initial parameters in the framework of our theory, we interpret as the displacement in the weights from initialization. This allows construction of a nonlinear feature map of the form . In this setting it suffices to understand the correlation structure

(4)

which is simply the Fisher information matrix [44].

2.2 Problem Setup

Let (with possibly infinite) be feature vectors with correlation structure . During learning, parameters

are updated to estimate a target function

which can be expressed as a linear combination of features . At each time step , the weights are updated by taking a stochastic gradient step on a fresh mini-batch of examples

(5)

where each of the vectors are sampled independently and . The learning rate controls the gradient descent step size while the batch size gives a empirical estimate of the gradient at timestep . At each timestep, the test-loss, or generalization error, has the form

(6)

which quantifies exactly the test error of the vector . Note, however, that

is a random variable since

depends on the precise history of sampled feature vectors . Our theory, which generalizes the recursive method of Werfel, Xie and Seung [55] allows us to compute the expected test loss by averaging over all possible sequences, to obtain

. Using a similar technique, we also provide a calculation of the variance

which quantifies the fluctuations in the learning curve due to stochastic sampling of features.

2.3 Learnable and Noise Free Problems: The Elliptical Cow Model

Before studying the general case, we first analyze the setting where the target function is learnable, meaning that there exist weights such that . We will further assume that the induced feature distribution is Gaussian so that all moments of can be written in terms of the covariance . We will remove these assumptions in later sections.

Theorem 2.1.

Suppose the features follow a Gaussian distribution and the target function is learnable in these features . After steps of SGD with minibatch size and learning rate , the expected (over possible sample sequences ) test loss has the form

(7)

where is a vector containing the eigenvalues of and is a vector containing elements for eigenvectors of . The matrix has the form

(8)

where constructs a matrix with the argument vector placed along the diagonal.

Proof.

See Appendix A.1. ∎

Below we provide some immediate interpretations of this result.

  • [leftmargin=*]

  • The matrix can be thought of as containing two components; a matrix which represents the time-evolution of the loss under average gradient updates. The remaining matrix involving a batch size dependence arise due to fluctuations in the gradients, a consequence of the stochastic sampling process.

  • The test loss obtained when training directly on the population loss can be obtained by taking the minibatch size . In this case, and one obtains the convergence of performing gradient descent directly on the population loss . This population loss can also be obtained by considering small learning rates, ie the limit, where .

  • For general and , the matrix is non-diagonal, indicating that the components are not learned independently as increases like for , but rather interact during learning. Thus, we expect non-trivial coupling accross eigenmodes at large . This is unlike the offline theory for learning in feature spaces, i.e. kernel regression, where errors across eigenmodes were shown to decouple and are learned at different rates [10, 11].

  • Though increasing always improves generalization at fixed time (proof given in Appendix A.6), learning with a fixed compute budget (number of gradient evaluations) , can favor smaller batch sizes. We provide an example of this in the next sections and Figure 1 (d)-(f).

We not only can compute the average test loss at time , but also its variance .

Theorem 2.2.

Assuming Gaussian features and a learnable target function , the variance of the loss at time is

(9)

where is element wise square and is defined in the same way as the expected loss formula.

Proof.

A proof is provided in Appendix A.2. ∎

2.3.1 Special Case 1: Unstructured Isotropic Features

This special case was previously analyzed by Werfel, Xie, Seung [55] which takes and . We extend their result for arbitrary , giving the following learning curve

(10)

which follows from the fact that (the vector of all ’s) is an eigenvector of with eigenvalue . We therefore find exponential convergence in the generalization error with effective rate . We can further optimize the effective rate with respect to to get optimal convergence rate, giving and

(11)

Again, we can immediately draw some interesting conclusions about this result

  • [leftmargin=*]

  • Strong dimension dependence: as , we see that, with optimal choice of , learning happens at a rate . This small exponent is due to the necessity of scaling inversely with the dimension since the term coming from gradient variance in Equation (10) ( factor) scales like . Increasing the minibatch size improves the exponential rate by reducing the gradient noise variance. In the large batch limit , the optimal loss scales as .

  • At small , the convergence at any learning rate is much slower than the convergence of the limit, which does not suffer from a dimensionality dependence due to gradient noise.

  • For a fixed compute budget , the optimal batch size is ; see Figure 1 (d). This can be shown by differentiating with respect to (see Appendix A.7)

  • We also note that this feature model has the same rate of convergence for every learnable target function .

In Figure 1 (a) we show theoretical and simulated learning curves for this model for varying values of at the optimal learning rate and in Figure 1 (d), we show the loss as a function of minibatch size for a fixed compute budget . In this model of isotropic features, the best minibatch size is .

(a) Isotropic Features
(b) Power Law Features
(c)

MNIST Random ReLU Features

(d) Fixed Compute Isotropic
(e) Fixed Compute Power Law
(f) Fixed Compute ReLU MNIST
Figure 1: Isotropic features generated as have qualitatively different learning curves than power-law features observed in real data. Black dashed lines are theory. (a) Online learning with -dimensional isotropic features gives a test loss which scales like for any target function, indicating that learning requires steps of SGD. We use the optimal learning rates . (b) Power-law features with have non-extensive effective dimensionality and give a power-law scaling with exponent . (c) Learning to discrimninate MNIST 8’s and 9’s with dimensional random ReLU features with , generates a power law scaling at large , which is both quantitatively and qualitatively different than the scaling predicted by isotropic features . (d)-(f) The loss at a fixed compute budget

for (d) isotropic features, (e) power law features and (f) MNIST ReLU random features with simulations (dots average and standard deviation for

runs). Intermediate batch sizes can be preferable on power law features and MNIST.

2.3.2 Special Case 2: Power Laws and Effective Dimensionality

Realistic datasets such as natural images or audio tend to exhibit nontrivial correlation structure, which often results in power-law spectra when the data is projected into a feature space, such as a randomly intialized neural network [53, 10, 11, 6, 54, 8]. In the limit, if the feature spectra and task specra follow power laws, and with , then Theorem 2.1 implies that generalization error also falls with a power law: where is a constant. Notably, these predicted exponents we recovered as a special case of our theory agree with prior work on SGD with power law spectra, which give exponents in terms of the feature correlation structure [8, 18, 54]. As we show in Appendix A.4, this power law can be derived by taking an integral approximation of the population loss and approximating the integral with Laplace’s method. After steps of gradient descent, the error is dominated by the eigenmode with index which minimizes . The test error scaling under such an approximation is [7]. We show an example of such a power law scaling in Figure 1 (b). Notably, since the total variance approaches a finite value as , the learning curves are relatively insensitive to ambient dimension, and are rather sensitive to the intrinsic dimension of the data manifold. For this model, we find that there can exist optimal batch sizes when the compute budget is fixed (Figure 1 (e)).

For a fixed feature map and data distribution, some target functions will be easier to learn than others indicating a strong inductive bias. As we discussed in the previous section, the contribution of the error from each eigenmode decouple in the limit. In this limit, each eigendirection is learned with a different timeconstant . Thus, the coefficient along direction is learned with a time-constant . Noting that is the variance of along the

-th eigenfunction, it follows that tasks

which have most of their variance in the top eigenspace will be learned rapidly since their variance is estimated using feature space directions with small time-constants. Thus, feature maps which give better alignment to the task (larger

) will have better generalization. For , the error can be crudely approximated as a tail sum of remaining variance in the target function . This motivates use of tail sums to quantify feature and task alignment.

This power law scaling is of interest, not only as an alternative to the isotropic setting, but also because it appears to accurately match the qualitative behavior of wide neural networks trained on realistic data [29, 6, 53, 10]. In Figure 1 (c), we see that the scaling of the loss is more similar to the power law setting than the isotropic features setting in a random features model of MNIST, agreeing excellently with our theory. Again, an optimal batch size exists when the compute budget is fixed (Figure 1 (f)). We provide further evidence of the existence of power law structure on realistic data in Figure 2 (a)-(c), where we provide spectra and test loss learning curves for MNIST and CIFAR-10 on ReLU random features. The eigenvalues and the task power tail sums both follow power laws, generating power law test loss curves. These learning curves are contrasted with isotropically distributed data in passed through the same ReLU random feature model and we see that structured data distributions allow much faster learning than the unstructured data. Again, our theory predicts experimental curves accurately across variations in nonlinearities, learning rate, batch size and noise (Figure 2).

(a) Feature Spectra
(b) Task Power Tail Sum
(c) Learning Curves
(d) MNIST Feature Spectra
(e) Task Power Tail Sum
(f) Learning Curves
(g) Varying Learning Rate
(h) Varying Batch Size
(i) Vary Noise (Averaged 10 Trials)
Figure 2: Structure in the data distribution, nonlinearity, batchsize and learning rate all influence learning curves. (a) ReLU random feature embedding in dimensions of MNIST and CIFAR images have very different eigenvalue scalings than spherically isotropic vectors in dimensions. (b) The task power spectrum decays much faster for MNIST than for random isotropic vectors. (c) Learning curves reveal the data-structure dependence of test error dynamics. Dashed lines are theory curves derived from equation. (d) The spectra of the random feature map for nonlinearities . The ReLU features have higher dimensionality than the Tanh features (slower decay in ). (e) The tail sums of projection values reveal that the top eigenfunctions explain a greater fraction of variance in the target function for the ReLU random features compared to Tanh random features. (f) Experimental (solid) and theory (dashed) learning curves for the two models. As expected from the task and feature spectra, the ReLU model obtains a better rate at large . (g) Increasing the learning rate increases the initial speed of learning but induces large fluctuations in the loss and can be worse at large . (h) Increasing the batch size alters both the average test loss and the variance. (i) Noise in the target values during training produces an asymptotic error which persists even as .

2.4 Arbitrary Induced Feature Distributions: The General Solution

The result in the previous section was proven exactly in the case of Gaussian vectors (see Appendix A.1). For arbitrary (possibly non-Gaussian) distributions, we obtain a slightly more involved result (see Appendix A.3).

Theorem 2.3.

Let be an arbitrary feature map with covariance matrix . After diagonalizing the features

, introduce the fourth moment tensor

(12)

where expectation is taken over the distribution. Let denote a flattening of an matrix into a vector of length and let represent a flattening of a 4D tensor into a two-dimensional matrix. Then the expected loss (over ) is

(13)

We see that the test loss dynamics depend on the second and fourth moments of the features through quantities and respectively. This result is exact, however, we see that it requires analyzing the evolution of vectors in dimensions before calculating the final sum over the diagonals , rendering it impractical to simulate for high dimensional feature maps. We recover the Gaussian result as a special case when is a simple weighted sum of these three products of Kronecker tensors .

The question remains whether the Gaussian approximation will provide an accurate model on realistic data. This is a weak version of the Gaussian equivalence conjecture from random feature model theory [31, 37]. Based on these previous works, we expect that the test loss of the Gaussian model closely tracks the test loss of wide artificial neural networks. We do not provide a proof of this conjecture, but verify its accuracy in empirical experiments on MNIST and CIFAR-10 as shown in Figure 2. In Figure 3, we show that the fourth moment matrix for a ReLU random feature model and its projection along the eigenbasis of the feature covariance is accurately approximated by the equivalent Gaussian model.

(a) Non-Gaussian Effects on MNIST
(b) Non-Gaussian Effects on CIFAR
Figure 3: Non-Gaussian effects are small on random feature models. (a)-(b) The first 20-dimensions of the summed fourth moment matrix are plotted for the Gaussian approximation and the empirical fourth moment. Differences between the Gaussian approximation and true fourth moment matrices on this example are visible, but are only on the order of of the size of the entries in .

2.5 Unlearnable or Noise Corrupted Problems

In general, the target function may depend on features which cannot be expressed as linear combinations of features . We model the quantity of noise which is not, expressible with . Note that does not have to be a deterministic function of , but can also be a stochastic process which is uncorrelated with .

Theorem 2.4.

For a target function with unlearnable variance , the expected test loss has the form

(14)

which has an asymptotic, irreducible error as .

See Appendix A.5 for the proof. The convergence to the asymptotic error takes the form . We note that this quantity is not necessarily monotonic in and can exhibit local maxima for sufficiently large , as in Figure 2 (f). This is reminiscent of the sample-wise double descent phenomenon in offline learning curves [39, 42, 11, 17], yet the peaking behavior in this model is limited to linear combinations of decaying exponentials (where are the eigenvalues of ) rather than divergences of the form , as in the offline double descent model.

3 Comparing Neural Network Feature Maps

We can utilize our theory to compare how wide neural networks of different depths generalize when trained with SGD on a real dataset. In the limit of infinite width and small learning rates, neural networks training and generalization behave as linear models of their parameters. In finite width neural networks, the NTK, which measures the geometry of the parameter gradients over different data points can evolve in time. However, for sufficiently large widths, finite neural networks have been shown to behave as linear functions of their parameters [36]. To predict test loss dynamics with our theory, it therefore suffices to characterize the geometry of the gradient features . In Figure 4, we show the Neural Tangent Kernel (NTK) eigenspectra and task-power spectra for fully connected neural networks of varying depth, calculated with the Neural Tangents API [43]. We compute the kernel on a subset of randomly sampled MNIST images and estimate the power law exponents for the kernel and task spectra and . We find that, accross architectures, the task spectra are highly similar, but that the kernel eigenvalues decay more slowly for deeper models, corresponding to a smaller exponent . As a consequence, deeper neural network models train more quickly during stochastic gradient descent as we show in Figure 4 (c). After fitting power laws to the spectra and the task power , we compared the true test loss dynamics (color) for a width-500 neural network model with the predicted power-law scalings from the fit exponents

. The predicted scalings from NTK regression accurately describe trained networks at finite width. On CIFAR-10, we compare the scalings of the convolutional model and a standard multi-layer perceptron and find that the convolutional model obtains a better exponent due to its faster decaying tail sum

.

(a) MNIST NTK Spectra
(b) MNIST Task Spectra
(c) Test Loss Scaling Laws
(d) CIFAR-10 NTK Spectra
(e) CIFAR-10 Task Spectra
(f) Test Loss Scalings
Figure 4: ReLU neural networks of depth and width are trained with SGD on full MNIST. (a)-(b) Feature and spectra are estimated by diagonalizing the infinite width NTK matrix on the training data. We fit a simple power law to each of the curves and . (c) Experimental test loss during SGD (color) compared to theoretical power-law scalings (dashed color). Deeper networks train faster due to their slower decay in their feature eigenspectra , though they have similar task spectra. (d)-(f) The spectra and test loss for convolutional and fully connected networks on CIFAR-10. The convolutional model obtains a better convergence exponent due to its faster decaying task spectra. The predicted test loss scalings (dashed black) match those observed in experiments (color).

4 Conclusion

By studying a simple model of stochastic gradient descent, we were able to uncover how the geometry of the data in an induced feature space governs the dynamics of the test loss. We derived average learning curves

for both Gaussian and general non-Gaussian features and showed the conditions under which the Gaussian approximation is accurate. The proposed model allowed us to explore the role of the data distribution and neural network architecture on the learning curves, demonstrating how the power-law spectra observed in wide neural networks on real data allow an escape of the curse of dimensionality during SGD. We verified our theory with experiments on MNIST and CIFAR-10. In addition, we explored the role of batch size, learning rate, and label noise level on generalization. We found that for a fixed compute budget small minibatch sizes can be best and that label noise can induce peaks in the average case test loss, though not as sharp as those in the offline learning case.

Limitations: Though our model successfully incorporates the structure of the data into a prediction of the test loss dynamics, it is limited in that it applies to linearized machine learning models, where one learns a linear combinations of nonlinear static features. Thus, our theory’s application to artificial neural networks is limited to random feature models, where only the last layer is trained, or deep networks in the lazy learning regime, where the network acts as a structured and static feature map [12]. In finite width neural networks, understanding the test loss dynamics during SGD will require coping with non-convexity of the objective and the time evolution of the gradient features. Adaptive learning rate schedules would also be a fruitful extension of the present work, closing the gap between theory and the optimizers used in practice. We hope that our work can inspire future studies on the structure of the data distribution and its interaction with network architecture in the nonlinear feature-learning regime.

Acknowledgements

We thank the Harvard Data Science Initiative and Harvard Dean’s Competitive Fund for Promising Scholarship for their support. We would also like to thank Jacob Zavatone-Veth for useful discussions and comments on this manuscript.

References

  • [1] M. Advani and S. Ganguli (2016-08) Statistical mechanics of optimal convex inference in high dimensions. 6, pp. 031034. External Links: Document, Link Cited by: §1.
  • [2] M. Advani, S. Lahiri, and S. Ganguli (2013-03)

    Statistical mechanics of complex neural systems and high dimensional data

    .
    2013 (03), pp. P03014. External Links: ISSN 1742-5468, Link, Document Cited by: §1.
  • [3] M. S. Advani, A. M. Saxe, and H. Sompolinsky (2020) High-dimensional dynamics of generalization error in neural networks. 132, pp. 428–446. External Links: ISSN 0893-6080, Document, Link Cited by: §1.
  • [4] A. Anastasiou, K. Balasubramanian, and M. A. Erdogdu (2019-25–28 Jun) Normal approximation for stochastic gradient descent via non-asymptotic rates of martingale clt. In Proceedings of the Thirty-Second Conference on Learning Theory, A. Beygelzimer and D. Hsu (Eds.), Proceedings of Machine Learning Research, Vol. 99, Phoenix, USA, pp. 115–137. External Links: Link Cited by: §1.
  • [5] S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019) On exact computation with an infinitely wide neural net. Cited by: §2.1.
  • [6] Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma (2021) Explaining neural scaling laws. arXiv preprint arXiv:2102.06701. Cited by: §1, §2.3.2, §2.3.2.
  • [7] C. Bender and S. Orszag (1999-01) Advanced mathematical methods for scientists and engineers: asymptotic methods and perturbation theory. Vol. 1. External Links: ISBN 978-1-4419-3187-0, Document Cited by: §A.4, §2.3.2.
  • [8] R. Berthier, F. Bach, and P. Gaillard (2020) Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model. External Links: 2006.08212 Cited by: §1, §2.3.2.
  • [9] M. Biehl and P. Riegler (1994-12) On-line learning with a perceptron. 28 (7), pp. 525–530. External Links: Document, Link Cited by: §1.
  • [10] B. Bordelon, A. Canatar, and C. Pehlevan (2020) Spectrum dependent learning curves in kernel regression and wide neural networks. In International Conference on Machine Learning, pp. 1024–1034. Cited by: §1, §1, 3rd item, §2.3.2, §2.3.2.
  • [11] A. Canatar, B. Bordelon, and C. Pehlevan (2020) Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks. 12, pp. 1–12. Cited by: §1, §1, 3rd item, §2.3.2, §2.5.
  • [12] L. Chizat, E. Oyallon, and F. Bach (2019) On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §4.
  • [13] K. L. Chung (1954) On a Stochastic Approximation Method. 25 (3), pp. 463 – 483. External Links: Document, Link Cited by: §1.
  • [14] S. Chung, U. Cohen, H. Sompolinsky, and D. D. Lee (2018-10) Learning Data Manifolds with a Cutting Plane Method. 30 (10), pp. 2593–2615. External Links: ISSN 0899-7667, Document, Link, https://direct.mit.edu/neco/article-pdf/30/10/2593/1046637/neco_a_01119.pdf Cited by: §1.
  • [15] S. Chung, D. D. Lee, and H. Sompolinsky (2018-07) Classification and geometry of general perceptual manifolds. 8, pp. 031003. External Links: Document, Link Cited by: §1.
  • [16] Y. L. Cun, I. Kanter, and S. A. Solla (1991-05) Eigenvalues of covariance matrices: application to neural-network learning. Phys. Rev. Lett. 66, pp. 2396–2399. External Links: Document, Link Cited by: §1.
  • [17] S. dAscoli, L. Sagun, and G. Biroli (2020) Triple descent and the two kinds of overfitting: where & why do they appear?. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33, pp. 3058–3069. External Links: Link Cited by: §1, §2.5.
  • [18] A. Dieuleveut, N. Flammarion, and F. Bach (2016-02) Harder, better, faster, stronger convergence rates for least-squares regression. 18, pp. . Cited by: §1, §2.3.2.
  • [19] J. C. Duchi and F. Ruan (2021) Asymptotic optimality in stochastic optimization. 49 (1), pp. 21 – 48. External Links: Document, Link Cited by: §1.
  • [20] A. Engel and C. Van den Broeck (2001) Statistical mechanics of learning. Cambridge University Press. External Links: Document Cited by: §1.
  • [21] S. Fischer and I. Steinwart (2020) Sobolev norm learning rates for regularized least-squares algorithms. 21 (205), pp. 1–38. External Links: Link Cited by: §1.
  • [22] E. Gardner and B. Derrida (1988) Optimal storage properties of neural network models. 21, pp. 271–284. Cited by: §1.
  • [23] F. Gerace, B. Loureiro, F. Krzakala, M. Mezard, and L. Zdeborova (2020-13–18 Jul) Generalisation error in learning with random features and the hidden manifold model. In Proceedings of the 37th International Conference on Machine Learning, H. D. III and A. Singh (Eds.), Proceedings of Machine Learning Research, Vol. 119, pp. 3452–3462. External Links: Link Cited by: §1.
  • [24] S. Goldt, M. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborová (2019) Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
  • [25] S. Goldt, M. Mézard, F. Krzakala, and L. Zdeborová (2020-12) Modeling the influence of data structure on learning in neural networks: the hidden manifold model. 10, pp. 041044. External Links: Document, Link Cited by: §1.
  • [26] M. Gurbuzbalaban, U. Simsekli, and L. Zhu (2020) The heavy-tail phenomenon in sgd. arXiv preprint arXiv:2006.04740. Cited by: §1.
  • [27] J. A. Hertz, A. Krogh, and G. I. Thorbergsson (1989-06) Phase transitions in simple learning. 22 (12), pp. 2133–2150. External Links: Document, Link Cited by: §1.
  • [28] T. M. Heskes and B. Kappen (1991-08) Learning processes in neural networks. 44, pp. 2718–2726. External Links: Document, Link Cited by: §1.
  • [29] J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, Md. M. A. Patwary, Y. Yang, and Y. Zhou (2017) Deep learning scaling is predictable, empirically. External Links: 1712.00409 Cited by: §2.3.2.
  • [30] J. J. Hopfield (1982) Neural networks and physical systems with emergent collective computational abilities. 79 (8), pp. 2554–2558. External Links: Document, ISSN 0027-8424, Link, https://www.pnas.org/content/79/8/2554.full.pdf Cited by: §1.
  • [31] H. Hu and Y. M. Lu (2021) Universality laws for high-dimensional learning with random features. External Links: 2009.07669 Cited by: §1, §2.4.
  • [32] A. Jacot, F. Gabriel, and C. Hongler (2020) Neural tangent kernel: convergence and generalization in neural networks. External Links: 1806.07572 Cited by: §1, §2.1.
  • [33] A. Jacot, B. Şimşek, F. Spadaro, C. Hongler, and F. Gabriel (2020) Kernel alignment risk estimator: risk prediction from training data. External Links: 2006.09796 Cited by: §1.
  • [34] A. Krizhevsky, V. Nair, and G. Hinton () CIFAR-10 (canadian institute for advanced research). Phys. Rev. XJournal of Statistical Mechanics: Theory and ExperimentNature CommunicationsSiam Journal on Control and OptimizationThe Annals of Mathematical StatisticsThe Annals of Mathematical StatisticsThe Annals of StatisticsJournal of Physics A: Mathematical and GeneralProceedings of the National Academy of SciencesJournal of Physics ANeural NetworksPhys. Rev. XJournal of Machine Learning ResearchJournal of Machine Learning ResearchEurophysics Letters (EPL)Phys. Rev. AStatistics and ComputingThe Annals of StatisticsNeural ComputationPhys. Rev. XarXiv preprint arXiv:1909.11304arXiv preprint arXiv:1904.11955. External Links: Link Cited by: §1.
  • [35] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: §1.
  • [36] J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington (2020-12) Wide neural networks of any depth evolve as linear models under gradient descent. Journal of Statistical Mechanics: Theory and Experiment 2020 (12), pp. 124002. External Links: ISSN 1742-5468, Link, Document Cited by: §1, §2.1.2, §2.1, §3.
  • [37] B. Loureiro, C. Gerbelot, H. Cui, S. Goldt, F. Krzakala, M. Mézard, and L. Zdeborová (2021) Capturing the learning curves of generic features maps for realistic data sets with a teacher-student model. External Links: 2102.08127 Cited by: §1, §1, §2.4.
  • [38] C. Mace and A. Coolen (1998) Statistical mechanical analysis of the dynamics of learning in perceptrons. 8, pp. 55–88. Cited by: §1.
  • [39] S. Mei and A. Montanari (2020) The generalization error of random features regression: precise asymptotics and double descent curve. External Links: 1908.05355 Cited by: §1, §2.5.
  • [40] F. Mignacco, F. Krzakala, P. Urbani, and L. Zdeborová (2020) Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. External Links: 2006.06098 Cited by: §1.
  • [41] M. Mohri, A. Rostamizadeh, and A. Talwalkar (2012) Foundations of machine learning. The MIT Press. External Links: ISBN 026201825X Cited by: §1.
  • [42] P. Nakkiran (2019)

    More data can hurt for linear regression: sample-wise double descent

    .
    External Links: 1912.07242 Cited by: §1, §2.5.
  • [43] R. Novak, L. Xiao, J. Hron, J. Lee, A. A. Alemi, J. Sohl-Dickstein, and S. S. Schoenholz (2020) Neural tangents: fast and easy infinite neural networks in python. In International Conference on Learning Representations, External Links: Link Cited by: §A.8, §3.
  • [44] J. Pennington and P. Worah (2018) The spectrum of the fisher information matrix of a single-hidden-layer neural network. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31, pp. . External Links: Link Cited by: §2.1.2.
  • [45] L. Pillaud-Vivien, A. Rudi, and F. Bach (2018) Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. External Links: 1805.10074 Cited by: §1.
  • [46] B. Polyak and A. Juditsky (1992) Acceleration of stochastic approximation by averaging. 30, pp. 838–855. Cited by: §1.
  • [47] A. Rahimi and B. Recht (2008) Random features for large-scale kernel machines. In Advances in Neural Information Processing Systems, J. Platt, D. Koller, Y. Singer, and S. Roweis (Eds.), Vol. 20, pp. . External Links: Link Cited by: §2.1.
  • [48] H. Robbins and S. Monro (1951) A Stochastic Approximation Method. 22 (3), pp. 400 – 407. External Links: Document, Link Cited by: §1.
  • [49] D. Ruppert (1988-02) Efficient estimations from a slowly convergent robbins-monro process. pp. . Cited by: §1.
  • [50] D. Saad and S. Solla (1999-04) Dynamics of on-line gradient descent learning for multilayer neural networks. pp. . Cited by: §1.
  • [51] A. M. Saxe, J. L. McClelland, and S. Ganguli (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1.
  • [52] A. Shapiro (1989) Asymptotic Properties of Statistical Estimators in Stochastic Programming. 17 (2), pp. 841 – 858. External Links: Document, Link Cited by: §1.
  • [53] S. Spigler, M. Geiger, and M. Wyart (2020-12) Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment 2020 (12), pp. 124001. External Links: ISSN 1742-5468, Link, Document Cited by: §1, §2.3.2, §2.3.2.
  • [54] M. Velikanov and D. Yarotsky (2021) Universal scaling laws in the gradient descent training of neural networks. External Links: 2105.00507 Cited by: §2.3.2.
  • [55] J. Werfel, X. Xie, and H. Seung (2004) Learning curves for stochastic gradient descent in linear feedforward networks. In Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf (Eds.), Vol. 16, pp. . External Links: Link Cited by: §1, §2.2, §2.3.1.
  • [56] Y. Yoshida and M. Okada (2019) Data-dependence of plateau phenomenon in learning with neural network — statistical mechanical analysis. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §1.
  • [57] L. Yu, K. Balasubramanian, S. Volgushev, and M. A. Erdogdu (2020) An analysis of constant step size sgd in the non-convex regime: asymptotic normality and bias. External Links: 2006.07904 Cited by: §1.
  • [58] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2017) Understanding deep learning requires rethinking generalization. ArXiv abs/1611.03530. Cited by: §1.

Appendix A Appendix

a.1 Proof of Theorem 2.1

Let represent the difference between the current and optimal weights and define the correlation matrix for this difference

(15)

Using stochastic gradient descent, with gradient vector , the matrix satisfies the recursion

(16)

First, note that since are all independently sampled at timestep , we can break up the average into the fresh batch of samples and an average over

(17)

The last term requires computation of fourth moments

(18)
(19)

First, consider the case where . Letting , we need to compute terms of the form

(20)

For Gaussian random vectors, we resort to Wick-Isserlis theorem for the fourth moment

(21)

giving

(22)

This correlation structure for implies that its covariance has the form

(23)

Using the formula for , we arrive at the following recursion relation for

(24)

Since we are ultimately interested in the generalization error , it suffices to track the evolution of

(25)

Vectorizing this equation for generates the following solution

(26)

The coefficient . To get the generalization error, we merely compute as desired.

a.2 Proof of Theorem 2.2

Under the assumption of Gaussian features, the discrepancy is the sum of Gaussian random variables and is therefore Gaussian. By again appealing to Wick-Isserlis theorem, the second moment of the loss can be shown to have the form

(27)

Decomposing in the appropriate basis, we find

(28)
(29)

The diagonal elements can be solved for as while the off diagonal elements all decouple and satisfy

(30)

Thus the total variance takes the form

(31)

a.3 Proof of Theorem 2.3

We rotate all of the feature vectors into the eigenbasis of the covariance, generating diagonalized features and introduce the following fourth moment tensor

(32)

We redefine in an appropriate basis

(33)

where so that ’s dynamics take the form

(34)

The elements of the final matrix have the form

(35)

We thus generate the following dynamics for

(36)

Let , then we have

(37)

Solving these dynamics for , recognizing that , and taking an inner product against gives the desired result.

a.4 Power Law Scalings in Small Learning Rate Limit

By either taking a small learning rate or a large batch size, the test loss dynamics reduce to the test loss obtained from gradient descent on the population loss. In this section, we consider the small learning rate limit , where the average test loss follows

(38)

Under the assumption that the eigenvalue and target function power spectra both follow power laws and , the loss can be approximated by an integral over all modes

(39)
(40)

We identify the function and proceed with Laplace’s method [7]. This consists of Taylor expanding around its minimum to second order and computing a Gaussian integral

(41)

We must identify the which minimizes . The interpretation of this value is that it indexes the mode which dominates the error at a large time . The first order condition gives