On Convergence of Training Loss Without Reaching Stationary Points

10/12/2021
by   Jingzhao Zhang, et al.
9

It is a well-known fact that nonconvex optimization is computationally intractable in the worst case. As a result, theoretical analysis of optimization algorithms such as gradient descent often focuses on local convergence to stationary points where the gradient norm is zero or negligible. In this work, we examine the disconnect between the existing theoretical analysis of gradient-based algorithms and actual practice. Specifically, we provide numerical evidence that in large-scale neural network training, such as in ImageNet, ResNet, and WT103 + TransformerXL models, the Neural Network weight variables do not converge to stationary points where the gradient of the loss function vanishes. Remarkably, however, we observe that while weights do not converge to stationary points, the value of the loss function converges. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems. We prove convergence of the distribution of weight values to an approximate invariant measure (without smoothness assumptions) that explains this phenomenon. We further discuss how this perspective can better align the theory with empirical observations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2019

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

We provide a theoretical explanation for the fast convergence of gradien...
research
04/29/2019

New optimization algorithms for neural network training using operator splitting techniques

In the following paper we present a new type of optimization algorithms ...
research
03/23/2023

Optimization Dynamics of Equivariant and Augmented Neural Networks

We investigate the optimization of multilayer perceptrons on symmetric d...
research
06/19/2023

Understanding Generalization in the Interpolation Regime using the Rate Function

In this paper, we present a novel characterization of the smoothness of ...
research
04/27/2023

Convergence of Adam Under Relaxed Assumptions

In this paper, we provide a rigorous proof of convergence of the Adaptiv...
research
10/09/2018

Collective evolution of weights in wide neural networks

We derive a nonlinear integro-differential transport equation describing...
research
10/21/2022

When Expressivity Meets Trainability: Fewer than n Neurons Can Work

Modern neural networks are often quite wide, causing large memory and co...

Please sign up or login with your details

Forgot password? Click here to reset