Demystifying the Global Convergence Puzzle of Learning Over-parameterized ReLU Nets in Very High Dimensions

06/05/2022
by   Peng He, et al.
0

This theoretical paper is devoted to developing a rigorous theory for demystifying the global convergence phenomenon in a challenging scenario: learning over-parameterized Rectified Linear Unit (ReLU) nets for very high dimensional dataset under very mild assumptions. A major ingredient of our analysis is a fine-grained analysis of random activation matrices. The essential virtue of dissecting activation matrices is that it bridges the dynamics of optimization and angular distribution in high-dimensional data space. This angle-based detailed analysis leads to asymptotic characterizations of gradient norm and directional curvature of objective function at each gradient descent iteration, revealing that the empirical loss function enjoys nice geometrical properties in the overparameterized setting. Along the way, we significantly improve existing theoretical bounds on both over-parameterization condition and learning rate with very mild assumptions for learning very high dimensional data. Moreover, we uncover the role of the geometrical and spectral properties of the input data in determining desired over-parameterization size and global convergence rate. All these clues allow us to discover a novel geometric picture of nonconvex optimization in deep learning: angular distribution in high-dimensional data space ↦ spectrums of overparameterized activation matrices ↦ favorable geometrical properties of empirical loss landscape ↦ global convergence phenomenon. Furthremore, our theoretical results imply that gradient-based nonconvex optimization algorithms have much stronger statistical guarantees with much milder over-parameterization condition than exisiting theory states for learning very high dimensional data, which is rarely explored so far.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/20/2023

Over-Parameterization Exponentially Slows Down Gradient Descent for Learning a Single Neuron

We revisit the problem of learning a single neuron with ReLU activation ...
research
11/28/2021

Generalization Performance of Empirical Risk Minimization on Over-parameterized Deep ReLU Nets

In this paper, we study the generalization performance of global minima ...
research
10/06/2018

Over-parameterization Improves Generalization in the XOR Detection Problem

Empirical evidence suggests that neural networks with ReLU activations g...
research
09/18/2017

When is a Convolutional Filter Easy To Learn?

We analyze the convergence of (stochastic) gradient descent algorithm fo...
research
12/05/2019

Analysis of the Optimization Landscapes for Overcomplete Representation Learning

We study nonconvex optimization landscapes for learning overcomplete rep...
research
05/27/2022

Global Convergence of Over-parameterized Deep Equilibrium Models

A deep equilibrium model (DEQ) is implicitly defined through an equilibr...
research
04/18/2022

Benign Overfitting in Time Series Linear Model with Over-Parameterization

The success of large-scale models in recent years has increased the impo...

Please sign up or login with your details

Forgot password? Click here to reset