Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width

02/23/2023

∙

We systematically analyze optimization dynamics in deep neural networks (DNNs) trained with stochastic gradient descent (SGD) over long time scales and study the effect of learning rate, depth, and width of the neural network. By analyzing the maximum eigenvalue λ^H_t of the Hessian of the loss, which is a measure of sharpness of the loss landscape, we find that the dynamics can show four distinct regimes: (i) an early time transient regime, (ii) an intermediate saturation regime, (iii) a progressive sharpening regime, and finally (iv) a late time “edge of stability" regime. The early and intermediate regimes (i) and (ii) exhibit a rich phase diagram depending on learning rate η≡ c/λ^H_0, depth d, and width w. We identify several critical values of c which separate qualitatively distinct phenomena in the early time dynamics of training loss and sharpness, and extract their dependence on d/w. Our results have implications for how to scale the learning rate with DNN depth and width in order to remain in the same phase of learning.

READ FULL TEXT

Phase diagram of training dynamics in deep neural networks: effect of learning rate, depth, and width

Sign in with Google

Consider DeepAI Pro