DeepAI AI Chat
Log In Sign Up

On the Effect of Initialization: The Scaling Path of 2-Layer Neural Networks

by   Sebastian Neumayer, et al.

In supervised learning, the regularization path is sometimes used as a convenient theoretical proxy for the optimization path of gradient descent initialized with zero. In this paper, we study a modification of the regularization path for infinite-width 2-layer ReLU neural networks with non-zero initial distribution of the weights at different scales. By exploiting a link with unbalanced optimal transport theory, we show that, despite the non-convexity of the 2-layer network training, this problem admits an infinite dimensional convex counterpart. We formulate the corresponding functional optimization problem and investigate its main properties. In particular, we show that as the scale of the initialization ranges between 0 and +∞, the associated path interpolates continuously between the so-called kernel and rich regimes. The numerical experiments confirm that, in our setting, the scaling path and the final states of the optimization path behave similarly even beyond these extreme points.


page 18

page 19


Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks

We prove linear convergence of gradient descent to a global minimum for ...

Training Integrable Parameterizations of Deep Neural Networks in the Infinite-Width Limit

To theoretically understand the behavior of trained deep neural networks...

All Local Minima are Global for Two-Layer ReLU Neural Networks: The Hidden Convex Optimization Landscape

We are interested in two-layer ReLU neural networks from an optimization...

Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks

Despite several attempts, the fundamental mechanisms behind the success ...

Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions

We develop fast algorithms and robust software for convex optimization o...

Kernel and Rich Regimes in Overparametrized Models

A recent line of work studies overparametrized neural networks in the "k...

Feature selection with gradient descent on two-layer networks in low-rotation regimes

This work establishes low test error of gradient flow (GF) and stochasti...