DeepAI AI Chat
Log In Sign Up

Linear Convergence of Adaptive Stochastic Gradient Descent

by   Yuege Xie, et al.

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak-Lojasiewicz (PL) inequality. The paper introduces the notion of Restricted Uniform Inequality of Gradients (RUIG), which describes the uniform lower bound for the norm of the stochastic gradients with respect to the distance to the optimal solution. RUIG plays the key role in proving the robustness of AdaGrad-Norm to its hyper-parameter tuning. On top of RUIG, we develop a novel two-stage framework to prove linear convergence of AdaGrad-Norm without knowing the parameters of the objective functions: Stage I: the step-size decrease fast such that it reaches to Stage II; Stage II: the step-size decreases slowly and converges. This framework can likely be extended to other adaptive stepsize algorithms. The numerical experiments show desirable agreement with our theories.


page 1

page 2

page 3

page 4


Towards Noise-adaptive, Problem-adaptive Stochastic Gradient Descent

We design step-size schemes that make stochastic gradient descent (SGD) ...

Universal Stagewise Learning for Non-Convex Problems with Convergence on Averaged Solutions

Although stochastic gradient descent () method and its variants (e.g., s...

On the Convergence of Adam and Adagrad

We provide a simple proof of the convergence of the optimization algorit...

The Power of Adaptivity in SGD: Self-Tuning Step Sizes with Unbounded Gradients and Affine Variance

We study convergence rates of AdaGrad-Norm as an exemplar of adaptive st...

A Stochastic Proximal Polyak Step Size

Recently, the stochastic Polyak step size (SPS) has emerged as a competi...

Sequential convergence of AdaGrad algorithm for smooth convex optimization

We prove that the iterates produced by, either the scalar step size vari...

Asynchronous Training Schemes in Distributed Learning with Time Delay

In the context of distributed deep learning, the issue of stale weights ...