Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

12/28/2020
by   Stanisław Jastrzębski, et al.
16

The early phase of training has been shown to be important in two ways for deep neural networks. First, the degree of regularization in this phase significantly impacts the final generalization. Second, it is accompanied by a rapid change in the local loss curvature influenced by regularization choices. Connecting these two findings, we show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM) from the beginning of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We further show that the early value of the trace of the FIM correlates strongly with the final generalization. We highlight that in the absence of implicit or explicit regularization, the trace of the FIM can increase to a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that 1) it limits memorization by reducing the learning speed of examples with noisy labels more than that of the clean examples, and 2) trajectories with a low initial trace of the FIM end in flat minima, which are commonly associated with good generalization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/21/2020

The Break-Even Point on Optimization Trajectories of Deep Neural Networks

The early phase of training of deep neural networks is critical for thei...
research
11/29/2022

Disentangling the Mechanisms Behind Implicit Regularization in SGD

A number of competing hypotheses have been proposed to explain why small...
research
11/01/2018

Implicit Regularization of Stochastic Gradient Descent in Natural Language Processing: Observations and Implications

Deep neural networks with remarkably strong generalization performances ...
research
05/27/2023

The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent

In this paper, we study the implicit regularization of stochastic gradie...
research
09/29/2021

Stochastic Training is Not Necessary for Generalization

It is widely believed that the implicit regularization of stochastic gra...
research
04/11/2019

High dimensional optimal design using stochastic gradient optimisation and Fisher information gain

Finding high dimensional designs is increasingly important in applicatio...
research
08/11/2022

Regularizing Deep Neural Networks with Stochastic Estimators of Hessian Trace

In this paper we develop a novel regularization method for deep neural n...

Please sign up or login with your details

Forgot password? Click here to reset