On Faster Convergence of Scaled Sign Gradient Descent

09/04/2021
by   Xiuxian Li, et al.
8

Communication has been seen as a significant bottleneck in industrial applications over large-scale networks. To alleviate the communication burden, sign-based optimization algorithms have gained popularity recently in both industrial and academic communities, which is shown to be closely related to adaptive gradient methods, such as Adam. Along this line, this paper investigates faster convergence for a variant of sign-based gradient descent, called scaled signGD, in three cases: 1) the objective function is strongly convex; 2) the objective function is nonconvex but satisfies the Polyak-Lojasiewicz (PL) inequality; 3) the gradient is stochastic, called scaled signGD in this case. For the first two cases, it can be shown that the scaled signGD converges at a linear rate. For case 3), the algorithm is shown to converge linearly to a neighborhood of the optimal value when a constant learning rate is employed, and the algorithm converges at a rate of O(1/k) when using a diminishing learning rate, where k is the iteration number. The results are also extended to the distributed setting by majority vote in a parameter-server framework. Finally, numerical experiments on logistic regression are performed to corroborate the theoretical findings.

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

page 8

research
10/31/2022

Almost Sure Convergence Rates of Stochastic Zeroth-order Gradient Descent for Łojasiewicz Functions

We prove almost sure convergence rates of Zeroth-order Gradient Descent ...
research
06/26/2023

Gradient Descent Converges Linearly for Logistic Regression on Separable Data

We show that running gradient descent with variable learning rate guaran...
research
04/12/2021

Meta-Regularization: An Approach to Adaptive Choice of the Learning Rate in Gradient Descent

We propose Meta-Regularization, a novel approach for the adaptive choice...
research
04/15/2020

On Learning Rates and Schrödinger Operators

The learning rate is perhaps the single most important parameter in the ...
research
02/14/2020

Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

This article suggests that deterministic Gradient Descent, which does no...
research
02/05/2022

Distributed Learning With Sparsified Gradient Differences

A very large number of communications are typically required to solve di...
research
09/14/2020

A Qualitative Study of the Dynamic Behavior of Adaptive Gradient Algorithms

The dynamic behavior of RMSprop and Adam algorithms is studied through a...

Please sign up or login with your details

Forgot password? Click here to reset