Formal guarantees for heuristic optimization algorithms used in machine learning

07/31/2022
by   Xiaoyu Li, et al.
0

Recently, Stochastic Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization of machine learning (ML) problems. A variety of strategies have been proposed for tuning the step sizes, ranging from adaptive step sizes to heuristic methods to change the step size in each iteration. Also, momentum has been widely employed in ML tasks to accelerate the training process. Yet, there is a gap in our theoretical understanding of them. In this work, we start to close this gap by providing formal guarantees to a few heuristic optimization methods and proposing improved algorithms. First, we analyze a generalized version of the AdaGrad (Delayed AdaGrad) step sizes in both convex and non-convex settings, showing that these step sizes allow the algorithms to automatically adapt to the level of noise of the stochastic gradients. We show for the first time sufficient conditions for Delayed AdaGrad to achieve almost sure convergence of the gradients to zero. Moreover, we present a high probability analysis for Delayed AdaGrad and its momentum variant in the non-convex setting. Second, we analyze SGD with exponential and cosine step sizes, which are empirically successful but lack theoretical support. We provide the very first convergence guarantees for them in the smooth and non-convex setting, with and without the Polyak-Łojasiewicz (PL) condition. We also show their good property of adaptivity to noise under the PL condition. Third, we study the last iterate of momentum methods. We prove the first lower bound in the convex setting for the last iterate of SGD with constant momentum. Moreover, we investigate a class of Follow-The-Regularized-Leader-based momentum algorithms with increasing momentum and shrinking updates. We show that their last iterate has optimal convergence for unconstrained convex stochastic optimization problems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2018

On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes

Stochastic gradient descent is the method of choice for large scale opti...
research
11/06/2018

On exponential convergence of SGD in non-convex over-parametrized learning

Large over-parametrized models learned via stochastic gradient descent (...
research
09/15/2023

Convergence of ADAM with Constant Step Size in Non-Convex Settings: A Simple Proof

In neural network training, RMSProp and ADAM remain widely favoured opti...
research
10/09/2017

SGD for robot motion? The effectiveness of stochastic optimization on a new benchmark for biped locomotion tasks

Trajectory optimization and posture generation are hard problems in robo...
research
06/05/2021

Bandwidth-based Step-Sizes for Non-Convex Stochastic Optimization

Many popular learning-rate schedules for deep neural networks combine a ...
research
02/09/2020

Momentum Improves Normalized SGD

We provide an improved analysis of normalized SGD showing that adding mo...
research
08/20/2018

Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions

Although stochastic gradient descent () method and its variants (e.g., s...

Please sign up or login with your details

Forgot password? Click here to reset