Learning Two-Layer Neural Networks, One (Giant) Step at a Time

by   Yatin Dandi, et al.

We study the training dynamics of shallow neural networks, investigating the conditions under which a limited number of large batch gradient descent steps can facilitate feature learning beyond the kernel regime. We compare the influence of batch size and that of multiple (but finitely many) steps. Our analysis of a single-step process reveals that while a batch size of n = O(d) enables feature learning, it is only adequate for learning a single direction, or a single-index model. In contrast, n = O(d^2) is essential for learning multiple directions and specialization. Moreover, we demonstrate that “hard” directions, which lack the first ℓ Hermite coefficients, remain unobserved and require a batch size of n = O(d^ℓ) for being captured by gradient descent. Upon iterating a few steps, the scenario changes: a batch-size of n = O(d) is enough to learn new target directions spanning the subspace linearly connected in the Hermite basis to the previously learned directions, thereby a staircase property. Our analysis utilizes a blend of techniques related to concentration, projection-based conditioning, and Gaussian equivalence that are of independent interest. By determining the conditions necessary for learning and specialization, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on feature learning in neural networks.


The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

In this work, we investigate the dynamics of stochastic gradient descent...

Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

One of the central questions in the theory of deep learning is to unders...

Neural Networks can Learn Representations with Gradient Descent

Significant theoretical work has established that in specific regimes, n...

Graph Neural Networks Provably Benefit from Structural Information: A Feature Learning Perspective

Graph neural networks (GNNs) have pioneered advancements in graph repres...

A Resizable Mini-batch Gradient Descent based on a Randomized Weighted Majority

Determining the appropriate batch size for mini-batch gradient descent i...

Walking in the Shadow: A New Perspective on Descent Directions for Constrained Minimization

Descent directions such as movement towards Frank-Wolfe vertices, away s...

On Single Index Models beyond Gaussian Data

Sparse high-dimensional functions have arisen as a rich framework to stu...

Please sign up or login with your details

Forgot password? Click here to reset