The Golden Ratio of Learning and Momentum

06/08/2020
by   Stefan Jaeger, et al.
0

Gradient descent has been a central training principle for artificial neural networks from the early beginnings to today's deep learning networks. The most common implementation is the backpropagation algorithm for training feed-forward neural networks in a supervised fashion. Backpropagation involves computing the gradient of a loss function, with respect to the weights of the network, to update the weights and thus minimize loss. Although the mean square error is often used as a loss function, the general stochastic gradient descent principle does not immediately connect with a specific loss function. Another drawback of backpropagation has been the search for optimal values of two important training parameters, learning rate and momentum weight, which are determined empirically in most systems. The learning rate specifies the step size towards a minimum of the loss function when following the gradient, while the momentum weight considers previous weight changes when updating current weights. Using both parameters in conjunction with each other is generally accepted as a means to improving training, although their specific values do not follow immediately from standard backpropagation theory. This paper proposes a new information-theoretical loss function motivated by neural signal processing in a synapse. The new loss function implies a specific learning rate and momentum weight, leading to empirical parameters often used in practice. The proposed framework also provides a more formal explanation of the momentum term and its smoothing effect on the training process. All results taken together show that loss, learning rate, and momentum are closely connected. To support these theoretical findings, experiments for handwritten digit recognition show the practical usefulness of the proposed loss function and training parameters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/01/2023

QLAB: Quadratic Loss Approximation-Based Optimal Learning Rate for Deep Learning

We propose a learning rate adaptation scheme, called QLAB, for descent o...
research
06/12/2021

Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent

The plain stochastic gradient descent and momentum stochastic gradient d...
research
04/27/2021

A Dual Process Model for Optimizing Cross Entropy in Neural Networks

Minimizing cross-entropy is a widely used method for training artificial...
research
10/19/2022

Differentiable Self-Adaptive Learning Rate

Learning rate adaptation is a popular topic in machine learning. Gradien...
research
02/28/2011

Improving the character recognition efficiency of feed forward BP neural network

This work is focused on improving the character recognition capability o...
research
05/15/2021

Gradient Descent in Materio

Deep learning, a multi-layered neural network approach inspired by the b...
research
01/17/2020

Gradient descent with momentum — to accelerate or to super-accelerate?

We consider gradient descent with `momentum', a widely used method for l...

Please sign up or login with your details

Forgot password? Click here to reset