Unique Properties of Wide Minima in Deep Networks

02/11/2020
by   Rotem Mulayoff, et al.
0

It is well known that (stochastic) gradient descent has an implicit bias towards wide minima. In deep neural network training, this mechanism serves to screen out minima. However, the precise effect that this has on the trained network is not yet fully understood. In this paper, we characterize the wide minima in linear neural networks trained with a quadratic loss. First, we show that linear ResNets with zero initialization necessarily converge to the widest of all minima. We then prove that these minima correspond to nearly balanced networks whereby the gain from the input to any intermediate representation does not change drastically from one layer to the next. Finally, we show that consecutive layers in wide minima solutions are coupled. That is, one of the left singular vectors of each weight matrix, equals one of the right singular vectors of the next matrix. This forms a distinct path from input to output, that, as we show, is dedicated to the signal that experiences the largest gain end-to-end. Experiments indicate that these properties are characteristic of both linear and nonlinear models trained in practice.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/05/2017

Deep linear neural networks with arbitrary loss: All local minima are global

We consider deep linear networks with arbitrary differentiable loss. We ...
research
12/16/2018

Non-attracting Regions of Local Minima in Deep and Wide Neural Networks

Understanding the loss surface of neural networks is essential for the d...
research
12/25/2018

Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?

Many modern learning tasks involve fitting nonlinear models to data whic...
research
05/25/2023

Implicit bias of SGD in L_2-regularized linear DNNs: One-way jumps from high to low rank

The L_2-regularized loss of Deep Linear Networks (DLNs) with more than o...
research
10/02/2020

No Spurious Local Minima: on the Optimization Landscapes of Wide and Deep Neural Networks

Empirical studies suggest that wide neural networks are comparably easy ...
research
11/19/2016

Local minima in training of neural networks

There has been a lot of recent interest in trying to characterize the er...
research
05/24/2023

On progressive sharpening, flat minima and generalisation

We present a new approach to understanding the relationship between loss...

Please sign up or login with your details

Forgot password? Click here to reset