On Learning Rates and Schrödinger Operators

by   Bin Shi, et al.

The learning rate is perhaps the single most important parameter in the training of neural networks and, more broadly, in stochastic (nonconvex) optimization. Accordingly, there are numerous effective, but poorly understood, techniques for tuning the learning rate, including learning rate decay, which starts with a large initial learning rate that is gradually decreased. In this paper, we present a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD). Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (lr-dependent SDE) that serves as a surrogate for SGD. For a broad class of objective functions, we establish a linear rate of convergence for this continuous-time formulation of SGD, highlighting the fundamental importance of the learning rate in SGD, and contrasting to gradient descent and stochastic gradient Langevin dynamics. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schrödinger operator associated with the lr-dependent SDE. Strikingly, this expression clearly reveals the dependence of the linear convergence rate on the learning rate – the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.


On the Hyperparameters in Stochastic Gradient Descent with Momentum

Following the same routine as [SSJ20], we continue to present the theore...

DTN: A Learning Rate Scheme with Convergence Rate of O(1/t) for SGD

We propose a novel diminishing learning rate scheme, coined Decreasing-T...

Stochastic Gradient Descent for Nonconvex Learning without Bounded Gradient Assumptions

Stochastic gradient descent (SGD) is a popular and efficient method with...

Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

Embedding learning has found widespread applications in recommendation s...

Behavior of linear L2-boosting algorithms in the vanishing learning rate asymptotic

We investigate the asymptotic behaviour of gradient boosting algorithms ...

Gradient is All You Need?

In this paper we provide a novel analytical perspective on the theoretic...

On Faster Convergence of Scaled Sign Gradient Descent

Communication has been seen as a significant bottleneck in industrial ap...

Please sign up or login with your details

Forgot password? Click here to reset