A Loss Curvature Perspective on Training Instability in Deep Learning

10/08/2021
by   Justin Gilmer, et al.
47

In this work, we study the evolution of the loss Hessian across many classification tasks in order to understand the effect the curvature of the loss has on the training dynamics. Whereas prior work has focused on how different learning rates affect the loss Hessian observed during training, we also analyze the effects of model initialization, architectural choices, and common training heuristics such as gradient clipping and learning rate warmup. Our results demonstrate that successful model and hyperparameter choices allow the early optimization trajectory to either avoid – or navigate out of – regions of high curvature and into flatter regions that tolerate a higher learning rate. Our results suggest a unifying perspective on how disparate mitigation strategies for training instability ultimately address the same underlying failure mode of neural network optimization, namely poor conditioning. Inspired by the conditioning perspective, we show that learning rate warmup can improve training stability just as much as batch normalization, layer normalization, MetaInit, GradInit, and Fixup initialization.

READ FULL TEXT
research
02/21/2020

The Break-Even Point on Optimization Trajectories of Deep Neural Networks

The early phase of training of deep neural networks is critical for thei...
research
07/22/2023

The instabilities of large learning rate training: a loss landscape view

Modern neural networks are undeniably successful. Numerous works study h...
research
11/19/2018

Do Normalization Layers in a Deep ConvNet Really Need to Be Distinct?

Yes, they do. This work investigates a perspective for deep learning: wh...
research
06/16/2020

Curvature is Key: Sub-Sampled Loss Surfaces and the Implications for Large Batch Training

We study the effect of mini-batching on the loss landscape of deep neura...
research
07/05/2022

Improving Covariance Conditioning of the SVD Meta-layer by Orthogonality

Inserting an SVD meta-layer into neural networks is prone to make the co...
research
10/29/2018

A Closer Look at Deep Learning Heuristics: Learning rate restarts, Warmup and Distillation

The convergence rate and final performance of common deep learning model...
research
04/15/2023

Non-Proportional Parametrizations for Stable Hypernetwork Learning

Hypernetworks are neural networks that generate the parameters of anothe...

Please sign up or login with your details

Forgot password? Click here to reset