Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence

02/24/2020
by   Nicolas Loizou, et al.
13

We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models.

READ FULL TEXT

page 7

page 27

research
10/16/2018

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Modern machine learning focuses on highly expressive models that are abl...
research
08/10/2022

Adaptive Learning Rates for Faster Stochastic Gradient Methods

In this work, we propose new adaptive step size strategies that improve ...
research
06/11/2020

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimi...
research
11/08/2019

MindTheStep-AsyncPSGD: Adaptive Asynchronous Parallel Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) is very useful in optimization problem...
research
12/10/2018

Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD?

Stagewise training strategy is commonly used for learning neural network...
research
06/05/2021

Bandwidth-based Step-Sizes for Non-Convex Stochastic Optimization

Many popular learning-rate schedules for deep neural networks combine a ...
research
05/04/2022

Making SGD Parameter-Free

We develop an algorithm for parameter-free stochastic convex optimizatio...

Please sign up or login with your details

Forgot password? Click here to reset