Learning Two-Layer Neural Networks, One (Giant) Step at a Time

05/29/2023
by   Yatin Dandi, et al.
0

We study the training dynamics of shallow neural networks, investigating the conditions under which a limited number of large batch gradient descent steps can facilitate feature learning beyond the kernel regime. We compare the influence of batch size and that of multiple (but finitely many) steps. Our analysis of a single-step process reveals that while a batch size of n = O(d) enables feature learning, it is only adequate for learning a single direction, or a single-index model. In contrast, n = O(d^2) is essential for learning multiple directions and specialization. Moreover, we demonstrate that “hard” directions, which lack the first ℓ Hermite coefficients, remain unobserved and require a batch size of n = O(d^ℓ) for being captured by gradient descent. Upon iterating a few steps, the scenario changes: a batch-size of n = O(d) is enough to learn new target directions spanning the subspace linearly connected in the Hermite basis to the previously learned directions, thereby a staircase property. Our analysis utilizes a blend of techniques related to concentration, projection-based conditioning, and Gaussian equivalence that are of independent interest. By determining the conditions necessary for learning and specialization, our results highlight the interaction between batch size and number of iterations, and lead to a hierarchical depiction where learning performance exhibits a stairway to accuracy over time and batch size, shedding new light on feature learning in neural networks.

READ FULL TEXT
research
08/06/2023

The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

In this work, we investigate the dynamics of stochastic gradient descent...
research
05/11/2023

Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

One of the central questions in the theory of deep learning is to unders...
research
06/30/2022

Neural Networks can Learn Representations with Gradient Descent

Significant theoretical work has established that in specific regimes, n...
research
06/24/2023

Graph Neural Networks Provably Benefit from Structural Information: A Feature Learning Perspective

Graph neural networks (GNNs) have pioneered advancements in graph repres...
research
11/17/2017

A Resizable Mini-batch Gradient Descent based on a Randomized Weighted Majority

Determining the appropriate batch size for mini-batch gradient descent i...
research
06/15/2020

Walking in the Shadow: A New Perspective on Descent Directions for Constrained Minimization

Descent directions such as movement towards Frank-Wolfe vertices, away s...
research
07/28/2023

On Single Index Models beyond Gaussian Data

Sparse high-dimensional functions have arisen as a rich framework to stu...

Please sign up or login with your details

Forgot password? Click here to reset