A variance principle explains why dropout finds flatter minima

11/01/2021
by   Zhongwang Zhang, et al.
0

Although dropout has achieved great success in deep learning, little is known about how it helps the training find a good generalization solution in the high-dimensional parameter space. In this work, we show that the training with dropout finds the neural network with a flatter minimum compared with standard gradient descent training. We further study the underlying mechanism of why dropout finds flatter minima through experiments. We propose a Variance Principle that the variance of a noise is larger at the sharper direction of the loss landscape. Existing works show that SGD satisfies the variance principle, which leads the training to flatter minima. Our work show that the noise induced by the dropout also satisfies the variance principle that explains why dropout finds flatter minima. In general, our work points out that the variance principle is an important similarity between dropout and SGD that lead the training to find flatter minima and obtain good generalization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/13/2022

Implicit regularization of dropout

It is important to understand how the popular regularization method drop...
research
05/25/2023

Stochastic Modified Equations and Dynamics of Dropout Algorithm

Dropout is a widely utilized regularization technique in the training of...
research
05/20/2023

Loss Spike in Training Neural Networks

In this work, we study the mechanism underlying loss spikes observed dur...
research
03/02/2023

Dropout Reduces Underfitting

Introduced by Hinton et al. in 2012, dropout has stood the test of time ...
research
03/06/2015

To Drop or Not to Drop: Robustness, Consistency and Differential Privacy Properties of Dropout

Training deep belief networks (DBNs) requires optimizing a non-convex fu...
research
01/16/2018

Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift

This paper first answers the question "why do the two most powerful tech...
research
02/08/2021

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

The empirical success of deep learning is often attributed to SGD's myst...

Please sign up or login with your details

Forgot password? Click here to reset