Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence

by   Nicolas Loizou, et al.

We propose a stochastic variant of the classical Polyak step-size (Polyak, 1987) commonly used in the subgradient method. Although computing the Polyak step-size requires knowledge of the optimal function values, this information is readily available for typical modern machine learning applications. Consequently, the proposed stochastic Polyak step-size (SPS) is an attractive choice for setting the learning rate for stochastic gradient descent (SGD). We provide theoretical convergence guarantees for SGD equipped with SPS in different settings, including strongly convex, convex and non-convex functions. Furthermore, our analysis results in novel convergence guarantees for SGD with a constant step-size. We show that SPS is particularly effective when training over-parameterized models capable of interpolating the training data. In this setting, we prove that SPS enables SGD to converge to the true solution at a fast rate without requiring the knowledge of any problem-dependent constants or additional computational overhead. We experimentally validate our theoretical results via extensive experiments on synthetic and real datasets. We demonstrate the strong performance of SGD with SPS compared to state-of-the-art optimization methods when training over-parameterized models.



There are no comments yet.


page 7

page 27


Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

Modern machine learning focuses on highly expressive models that are abl...

AdaS: Adaptive Scheduling of Stochastic Gradients

The choice of step-size used in Stochastic Gradient Descent (SGD) optimi...

Why Does Stagewise Training Accelerate Convergence of Testing Error Over SGD?

Stagewise training strategy is commonly used for learning neural network...

Bandwidth-based Step-Sizes for Non-Convex Stochastic Optimization

Many popular learning-rate schedules for deep neural networks combine a ...

Making SGD Parameter-Free

We develop an algorithm for parameter-free stochastic convex optimizatio...

Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum

Most convergence guarantees for stochastic gradient descent with momentu...

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootst...

Code Repositories


Official code for the Stochastic Polyak step-size optimizer

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.