Stochastic Gradient Descent: Going As Fast As Possible But Not Faster

09/05/2017
by   Alice Schoenauer-Sebag, et al.
0

When applied to training deep neural networks, stochastic gradient descent (SGD) often incurs steady progression phases, interrupted by catastrophic episodes in which loss and gradient norm explode. A possible mitigation of such events is to slow down the learning process. This paper presents a novel approach to control the SGD learning rate, that uses two statistical tests. The first one, aimed at fast learning, compares the momentum of the normalized gradient vectors to that of random unit vectors and accordingly gracefully increases or decreases the learning rate. The second one is a change point detection test, aimed at the detection of catastrophic learning episodes; upon its triggering the learning rate is instantly halved. Both abilities of speeding up and slowing down the learning rate allows the proposed approach, called SALeRA, to learn as fast as possible but not faster. Experiments on standard benchmarks show that SALeRA performs well in practice, and compares favorably to the state of the art.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/06/2022

BFE and AdaBFE: A New Approach in Learning Rate Automation for Stochastic Optimization

In this paper, a new gradient-based optimization approach by automatical...
research
06/12/2021

Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent

The plain stochastic gradient descent and momentum stochastic gradient d...
research
10/02/2018

Learning with Random Learning Rates

Hyperparameter tuning is a bothersome step in the training of deep learn...
research
01/22/2019

DTN: A Learning Rate Scheme with Convergence Rate of O(1/t) for SGD

We propose a novel diminishing learning rate scheme, coined Decreasing-T...
research
07/09/2022

Improved Binary Forward Exploration: Learning Rate Scheduling Method for Stochastic Optimization

A new gradient-based optimization approach by automatically scheduling t...
research
06/29/2020

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Ra...
research
02/14/2020

Stochasticity of Deterministic Gradient Descent: Large Learning Rate for Multiscale Objective Function

This article suggests that deterministic Gradient Descent, which does no...

Please sign up or login with your details

Forgot password? Click here to reset