Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search)

06/11/2020
by   Sharan Vaswani, et al.
12

As adaptive gradient methods are typically used for training over-parameterized models capable of exactly fitting the data, we study their convergence in this interpolation setting. Under this assumption, we prove that constant step-size, zero-momentum variants of Adam and AMSGrad can converge to the minimizer at the O(1/T) rate for smooth, convex functions. When this assumption is only approximately satisfied, we show that these methods converge to a neighbourhood of the solution. On the other hand, we show that Adagrad is robust to the violation of interpolation and converges to the minimizer at the optimal rate. However, we demonstrate that even for simple, convex problems satisfying interpolation, the empirical performance of these methods heavily depends on the step-size and requires tuning. We alleviate this problem by making use of stochastic line-search methods (SLS) and Polyak's step-sizes (SPS) to help these methods adapt to the function's local smoothness. We prove that adaptive methods used in conjunction with these techniques do not require knowledge of problem-dependent constants and retain the convergence guarantees of their constant step-size counterparts. Experimentally, we show that using SLS or SPS consistently improves the convergence of adaptive methods across tasks; from binary classification with kernel mappings to classification with deep neural networks. Furthermore, our empirical results show that Adagrad equipped with SLS generalizes better than SGD.

READ FULL TEXT

page 33

page 35

page 36

research
05/24/2019

Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates

Recent works have shown that stochastic gradient descent (SGD) achieves ...
research
10/11/2019

Fast and Furious Convergence: Stochastic Second Order Methods under Interpolation

We consider stochastic second order methods for minimizing strongly-conv...
research
05/30/2023

BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization

The popularity of bi-level optimization (BO) in deep learning has spurre...
research
02/19/2019

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network

Adaptive gradient methods like AdaGrad are widely used in optimizing neu...
research
06/05/2023

Searching for Optimal Per-Coordinate Step-sizes with Multidimensional Backtracking

The backtracking line-search is an effective technique to automatically ...
research
03/05/2020

In-flight range optimization of multicopters using multivariable extremum seeking with adaptive step size

Limited flight range is a common problem for multicopters. To alleviate ...
research
02/19/2021

AI-SARAH: Adaptive and Implicit Stochastic Recursive Gradient Methods

We present an adaptive stochastic variance reduced method with an implic...

Please sign up or login with your details

Forgot password? Click here to reset