Training Overparametrized Neural Networks in Sublinear Time

08/09/2022
by   Hang Hu, et al.
0

The success of deep learning comes at a tremendous computational and energy cost, and the scalability of training massively overparametrized neural networks is becoming a real barrier to the progress of AI. Despite the popularity and low cost-per-iteration of traditional Backpropagation via gradient decent, SGD has prohibitive convergence rate in non-convex settings, both in theory and practice. To mitigate this cost, recent works have proposed to employ alternative (Newton-type) training methods with much faster convergence rate, albeit with higher cost-per-iteration. For a typical neural network with m=poly(n) parameters and input batch of n datapoints in ℝ^d, the previous work of [Brand, Peng, Song, and Weinstein, ITCS'2021] requires ∼ mnd + n^3 time per iteration. In this paper, we present a novel training method that requires only m^1-α n d + n^3 amortized time in the same overparametrized regime, where α∈ (0.01,1) is some fixed constant. This method relies on a new and alternative view of neural networks, as a set of binary search trees, where each iteration corresponds to modifying a small subset of the nodes in the tree. We believe this view would have further applications in the design and analysis of DNNs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/28/2019

A Gram-Gauss-Newton Method Learning Overparameterized Deep Neural Networks for Regression Problems

First-order methods such as stochastic gradient descent (SGD) are curren...
research
05/30/2019

On the Convergence of Memory-Based Distributed SGD

Distributed stochastic gradient descent (DSGD) has been widely used for ...
research
06/20/2020

Training (Overparametrized) Neural Networks in Near-Linear Time

The slow convergence rate and pathological curvature issues of first-ord...
research
09/07/2023

Convergence Analysis of Decentralized ASGD

Over the last decades, Stochastic Gradient Descent (SGD) has been intens...
research
10/09/2021

Does Preprocessing Help Training Over-parameterized Neural Networks?

Deep neural networks have achieved impressive performance in many areas....
research
03/08/2020

Generative Adversarial Imitation Learning with Neural Networks: Global Optimality and Convergence Rate

Generative adversarial imitation learning (GAIL) demonstrates tremendous...
research
05/17/2019

Adaptively Truncating Backpropagation Through Time to Control Gradient Bias

Truncated backpropagation through time (TBPTT) is a popular method for l...

Please sign up or login with your details

Forgot password? Click here to reset