Log In Sign Up

Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning

by   Chaoyue Liu, et al.

The success of deep learning is due, to a great extent, to the remarkable effectiveness of gradient-based optimization methods applied to large neural networks. In this work we isolate some general mathematical structures allowing for efficient optimization in over-parameterized systems of non-linear equations, a setting that includes deep neural networks. In particular, we show that optimization problems corresponding to such systems are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD. We connect the PL condition of these systems to the condition number associated to the tangent kernel and develop a non-linear theory parallel to classical analyses of over-parameterized linear equations. We discuss how these ideas apply to training shallow and deep neural networks. Finally, we point out that tangent kernels associated to certain large system may be far from constant, even locally. Yet, our analysis still allows to demonstrate existence of solutions and convergence of gradient descent and SGD.


A Generalization Theory of Gradient Descent for Learning Over-parameterized Deep ReLU Networks

Empirical studies show that gradient based methods can learn deep neural...

Rigorous dynamical mean field theory for stochastic gradient descent methods

We prove closed-form equations for the exact high-dimensional asymptotic...

Continuous vs. Discrete Optimization of Deep Neural Networks

Existing analyses of optimization in deep learning are either continuous...

A Non-linear Differential CNN-Rendering Module for 3D Data Enhancement

In this work we introduce a differential rendering module which allows n...

Diving into the shallows: a computational perspective on large-scale shallow learning

In this paper we first identify a basic limitation in gradient descent-b...

Why Learning of Large-Scale Neural Networks Behaves Like Convex Optimization

In this paper, we present some theoretical work to explain why simple gr...