The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks

by   Mor Shpigel Nacson, et al.

We study the type of solutions to which stochastic gradient descent converges when used to train a single hidden-layer multivariate ReLU network with the quadratic loss. Our results are based on a dynamical stability analysis. In the univariate case, it was shown that linearly stable minima correspond to network functions (predictors), whose second derivative has a bounded weighted L^1 norm. Notably, the bound gets smaller as the step size increases, implying that training with a large step size leads to `smoother' predictors. Here we generalize this result to the multivariate case, showing that a similar result applies to the Laplacian of the predictor. We demonstrate the tightness of our bound on the MNIST dataset, and show that it accurately captures the behavior of the solutions as a function of the step size. Additionally, we prove a depth separation result on the approximation power of ReLU networks corresponding to stable minima of the loss. Specifically, although shallow ReLU networks are universal approximators, we prove that stable shallow networks are not. Namely, there is a function that cannot be well-approximated by stable single hidden-layer ReLU networks trained with a non-vanishing step size. This is while the same function can be realized as a stable two hidden-layer ReLU network. Finally, we prove that if a function is sufficiently smooth (in a Sobolev sense) then it can be approximated arbitrarily well using shallow ReLU networks that correspond to stable solutions of gradient descent.


page 35

page 36


Gradient descent provably escapes saddle points in the training of shallow ReLU networks

Dynamical systems theory has recently been applied in optimization to pr...

Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization

Neural networks with ReLU activations have achieved great empirical succ...

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

We consider deep networks, trained via stochastic gradient descent to mi...

The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent

In this paper, we study the implicit regularization of stochastic gradie...

Exact Mean Square Linear Stability Analysis for SGD

The dynamical stability of optimization methods at the vicinity of minim...

A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case

A key element of understanding the efficacy of overparameterized neural ...

Spurious Local Minima of Shallow ReLU Networks Conform with the Symmetry of the Target Model

We consider the optimization problem associated with fitting two-layer R...

Please sign up or login with your details

Forgot password? Click here to reset