PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization

by   Zhize Li, et al.

In this paper, we propose a novel stochastic gradient estimator—ProbAbilistic Gradient Estimator (PAGE)—for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability p or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability 1-p. We give a simple formula for the optimal choice of p. We prove tight lower bounds for nonconvex problems, which are of independent interest. Moreover, we prove matching upper bounds both in the finite-sum and online regimes, which establish that PAGE is an optimal method. Besides, we show that for nonconvex functions satisfying the Polyak-Łojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch, and the results demonstrate that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating our theoretical results and confirming the practical superiority of PAGE.


page 1

page 2

page 3

page 4


Linear Convergence of Accelerated Stochastic Gradient Descent for Nonconvex Nonsmooth Optimization

In this paper, we study the stochastic gradient descent (SGD) method for...

Better Theory for SGD in the Nonconvex World

Large-scale nonconvex optimization problems are ubiquitous in modern mac...

A Short Note of PAGE: Optimal Convergence Rates for Nonconvex Optimization

In this note, we first recall the nonconvex problem setting and introduc...

Stochastic Gradient Descent for Nonconvex Learning without Bounded Gradient Assumptions

Stochastic gradient descent (SGD) is a popular and efficient method with...

DASHA: Distributed Nonconvex Optimization with Communication Compression, Optimal Oracle Complexity, and No Client Synchronization

We develop and analyze DASHA: a new family of methods for nonconvex dist...

Minibatch vs Local SGD with Shuffling: Tight Convergence Bounds and Beyond

In distributed learning, local SGD (also known as federated averaging) a...

RNN Training along Locally Optimal Trajectories via Frank-Wolfe Algorithm

We propose a novel and efficient training method for RNNs by iteratively...