Convergence of Online Adaptive and Recurrent Optimization Algorithms

05/12/2020
by   Pierre-Yves Massé, et al.
0

We prove local convergence of several notable gradient descentalgorithms used inmachine learning, for which standard stochastic gradient descent theorydoes not apply. This includes, first, online algorithms for recurrent models and dynamicalsystems, such as Real-time recurrent learning (RTRL) and its computationally lighter approximations NoBackTrack and UORO; second,several adaptive algorithms such as RMSProp, online natural gradient, and Adam with β^2→ 1.Despite local convergence being a relatively weak requirement for a newoptimization algorithm, no local analysis was available for these algorithms, as far aswe knew. Analysis of these algorithms does not immediately followfrom standard stochastic gradient (SGD) theory. In fact, Adam has been provedto lack local convergence in some simple situations. For recurrent models, online algorithms modify the parameterwhile the model is running, which further complicates the analysis withrespect to simple SGD.Local convergence for these various algorithms results from a single,more general set of assumptions, in the setup of learning dynamicalsystems online. Thus, these results can cover other variants ofthe algorithms considered.We adopt an “ergodic” rather than probabilistic viewpoint, working withempirical time averages instead of probability distributions. This ismore data-agnostic andcreates differences with respect to standard SGD theory,especially for the range of possible learning rates. For instance, withcycling or per-epoch reshuffling over a finite dataset instead of purei.i.d. sampling with replacement, empiricalaverages of gradients converge at rate 1/T insteadof 1/√(T) (cycling acts as a variance reduction method),theoretically allowingfor larger learning rates than in SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2018

Predictive Local Smoothness for Stochastic Gradient Methods

Stochastic gradient methods are dominant in nonconvex optimization espec...
research
01/26/2022

On the Convergence of mSGD and AdaGrad for Stochastic Optimization

As one of the most fundamental stochastic optimization algorithms, stoch...
research
06/11/2015

Variance Reduced Stochastic Gradient Descent with Neighbors

Stochastic Gradient Descent (SGD) is a workhorse in machine learning, ye...
research
08/25/2022

A simplified convergence theory for Byzantine resilient stochastic gradient descent

In distributed learning, a central server trains a model according to up...
research
05/21/2019

Time-Smoothed Gradients for Online Forecasting

Here, we study different update rules in stochastic gradient descent (SG...
research
02/16/2017

Unbiased Online Recurrent Optimization

The novel Unbiased Online Recurrent Optimization (UORO) algorithm allows...
research
07/18/2021

A New Adaptive Gradient Method with Gradient Decomposition

Adaptive gradient methods, especially Adam-type methods (such as Adam, A...

Please sign up or login with your details

Forgot password? Click here to reset