The Role of Memory in Stochastic Optimization

07/02/2019
by   Antonio Orvieto, et al.
6

The choice of how to retain information about past gradients dramatically affects the convergence properties of state-of-the-art stochastic optimization methods, such as Heavy-ball, Nesterov's momentum, RMSprop and Adam. Building on this observation, we use stochastic differential equations (SDEs) to explicitly study the role of memory in gradient-based algorithms. We first derive a general continuous-time model that can incorporate arbitrary types of memory, for both deterministic and stochastic settings. We provide convergence guarantees for this SDE for weakly-quasi-convex and quadratically growing functions. We then demonstrate how to discretize this SDE to get a flexible discrete-time algorithm that can implement a board spectrum of memories ranging from short- to long-term. Not only does this algorithm increase the degrees of freedom in algorithmic choice for practitioners but it also comes with better stability properties than classical momentum in the convex stochastic setting. In particular, no iterate averaging is needed for convergence. Interestingly, our analysis also provides a novel interpretation of Nesterov's momentum as stable gradient amplification and highlights a possible reason for its unstable behavior in the (convex) stochastic setting. Furthermore, we discuss the use of long term memory for second-moment estimation in adaptive methods, such as Adam and RMSprop. Finally, we provide an extensive experimental study of the effect of different types of memory in both convex and nonconvex settings.

READ FULL TEXT
research
05/30/2022

Last-iterate convergence analysis of stochastic momentum methods for neural networks

The stochastic momentum method is a commonly used acceleration technique...
research
10/05/2018

Continuous-time Models for Stochastic Optimization Algorithms

We propose a new continuous-time formulation for first-order stochastic ...
research
07/25/2023

High Probability Analysis for Non-Convex Stochastic Optimization with Clipping

Gradient clipping is a commonly used technique to stabilize the training...
research
06/01/2020

Factorial Powers for Stochastic Optimization

The convergence rates for convex and non-convex optimization methods dep...
research
06/01/2016

CaMKII activation supports reward-based neural network optimization through Hamiltonian sampling

Synaptic plasticity is implemented and controlled through over thousand ...
research
04/19/2019

On the Convergence of Adam and Beyond

Several recently proposed stochastic optimization methods that have been...
research
05/04/2019

NAMSG: An Efficient Method For Training Neural Networks

We introduce NAMSG, an adaptive first-order algorithm for training neura...

Please sign up or login with your details

Forgot password? Click here to reset