Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

12/25/2018
by   Samet Oymak, et al.
0

Many modern learning tasks involve fitting nonlinear models to data which are trained in an overparameterized regime where the parameters of the model exceed the size of the training dataset. Due to this overparameterization, the training loss may have infinitely many global minima and it is critical to understand the properties of the solutions found by first-order optimization schemes such as (stochastic) gradient descent starting from different initializations. In this paper we demonstrate that when the loss has certain properties over a minimally small neighborhood of the initial point, first order methods such as (stochastic) gradient descent have a few intriguing properties: (1) the iterates converge at a geometric rate to a global optima even when the loss is nonconvex, (2) among all global optima of the loss the iterates converge to one with a near minimal distance to the initial point, (3) the iterates take a near direct route from the initial point to this global optima. As part of our proof technique, we introduce a new potential function which captures the precise tradeoff between the loss function and the distance to the initial point as the iterations progress. For Stochastic Gradient Descent (SGD), we develop novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates. We demonstrate the utility of our general theory for a variety of problem domains spanning low-rank matrix recovery to neural network training. Underlying our analysis are novel insights that may have implications for training and generalization of more sophisticated learning problems including those involving deep neural network architectures.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/21/2018

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

We study the problem of training deep neural networks with Rectified Lin...
research
06/10/2019

Stochastic Mirror Descent on Overparameterized Nonlinear Models: Convergence, Implicit Regularization, and Generalization

Most modern learning problems are highly overparameterized, meaning that...
research
08/28/2020

Predicting Training Time Without Training

We tackle the problem of predicting the number of optimization steps tha...
research
02/11/2020

Unique Properties of Wide Minima in Deep Networks

It is well known that (stochastic) gradient descent has an implicit bias...
research
12/04/2020

Effect of the initial configuration of weights on the training and function of artificial neural networks

The function and performance of neural networks is largely determined by...
research
06/26/2023

Black holes and the loss landscape in machine learning

Understanding the loss landscape is an important problem in machine lear...
research
06/06/2019

Bad Global Minima Exist and SGD Can Reach Them

Several recent works have aimed to explain why severely overparameterize...

Please sign up or login with your details

Forgot password? Click here to reset