Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

10/26/2021
by   Kaifeng Lyu, et al.
6

The generalization mystery of overparametrized deep nets has motivated efforts to understand how gradient descent (GD) converges to low-loss solutions that generalize well. Real-life neural networks are initialized from small random values and trained with cross-entropy loss for classification (unlike the "lazy" or "NTK" regime of training where analysis was more successful), and a recent sequence of results (Lyu and Li, 2020; Chizat and Bach, 2020; Ji and Telgarsky, 2020) provide theoretical evidence that GD may converge to the "max-margin" solution with zero loss, which presumably generalizes well. However, the global optimality of margin is proved only in some settings where neural nets are infinitely or exponentially wide. The current paper is able to establish this global optimality for two-layer Leaky ReLU nets trained with gradient flow on linearly separable and symmetric data, regardless of the width. The analysis also gives some theoretical justification for recent empirical findings (Kalimeris et al., 2019) on the so-called simplicity bias of GD towards linear or other "simple" classes of solutions, especially early in training. On the pessimistic side, the paper suggests that such results are fragile. A simple data manipulation can make gradient flow converge to a linear classifier with suboptimal margin.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/26/2019

Bias of Homotopic Gradient Descent for the Hinge Loss

Gradient descent is a simple and widely used optimization method for mac...
research
02/11/2022

Improving Generalization via Uncertainty Driven Perturbations

Recently Shah et al., 2020 pointed out the pitfalls of the simplicity bi...
research
06/11/2020

Directional convergence and alignment in deep learning

In this paper, we show that although the minimizers of cross-entropy and...
research
12/08/2018

Weighted Risk Minimization & Deep Learning

Importance weighting is a key ingredient in many algorithms for causal i...
research
08/24/2023

Don't blame Dataset Shift! Shortcut Learning due to Gradients and Cross Entropy

Common explanations for shortcut learning assume that the shortcut impro...
research
01/07/2021

Towards Understanding Learning in Neural Networks with Linear Teachers

Can a neural network minimizing cross-entropy learn linearly separable d...
research
10/13/2021

Boosting the Certified Robustness of L-infinity Distance Nets

Recently, Zhang et al. (2021) developed a new neural network architectur...

Please sign up or login with your details

Forgot password? Click here to reset