DeepAI AI Chat
Log In Sign Up

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

by   Xiaoxia Wu, et al.

We propose a computationally-friendly adaptive learning rate schedule, "AdaLoss", which directly uses the information of the loss function to adjust the stepsize in gradient descent methods. We prove that this schedule enjoys linear convergence in linear regression. Moreover, we provide a linear convergence guarantee over the non-convex regime, in the context of two-layer over-parameterized neural networks. If the width of the first-hidden layer in the two-layer networks is sufficiently large (polynomially), then AdaLoss converges robustly to the global minimum in polynomial time. We numerically verify the theoretical results and extend the scope of the numerical experiments by considering applications in LSTM models for text clarification and policy gradients for control problems.


Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

Adaptive gradient methods like AdaGrad are widely used in optimizing neu...

Convergence and Implicit Regularization Properties of Gradient Descent for Deep Residual Networks

We prove linear convergence of gradient descent to a global minimum for ...

Learning One-hidden-layer neural networks via Provable Gradient Descent with Random Initialization

Although deep learning has shown its powerful performance in many applic...

Can Gradient Descent Provably Learn Linear Dynamic Systems?

We study the learning ability of linear recurrent neural networks with g...

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Many tasks in machine learning and signal processing can be solved by mi...

Equilibrated adaptive learning rates for non-convex optimization

Parameter-specific adaptive learning rate methods are computationally ef...