Improved generalization by noise enhancement

09/28/2020
by   Takashi Mori, et al.
0

Recent studies have demonstrated that noise in stochastic gradient descent (SGD) is closely related to generalization: A larger SGD noise, if not too large, results in better generalization. Since the covariance of the SGD noise is proportional to η^2/B, where η is the learning rate and B is the minibatch size of SGD, the SGD noise has so far been controlled by changing η and/or B. However, too large η results in instability in the training dynamics and a small B prevents scalable parallel computation. It is thus desirable to develop a method of controlling the SGD noise without changing η and B. In this paper, we propose a method that achieves this goal using “noise enhancement”, which is easily implemented in practice. We expound the underlying theoretical idea and demonstrate that the noise enhancement actually improves generalization for real datasets. It turns out that large-batch training with the noise enhancement even shows better generalization compared with small-batch training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/26/2019

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Large-batch stochastic gradient descent (SGD) is widely used for trainin...
research
02/24/2018

A Walk with SGD

Exploring why stochastic gradient descent (SGD) based optimization metho...
research
02/21/2019

Interplay Between Optimization and Generalization of Stochastic Gradient Descent with Covariance Noise

The choice of batch-size in a stochastic optimization algorithm plays a ...
research
02/26/2020

Stagewise Enlargement of Batch Size for SGD-based Learning

Existing research shows that the batch size can seriously affect the per...
research
06/19/2018

Faster SGD training by minibatch persistency

It is well known that, for most datasets, the use of large-size minibatc...
research
02/10/2021

On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

The noise in stochastic gradient descent (SGD), caused by minibatch samp...
research
08/13/2020

Deep Networks with Fast Retraining

Recent wor [1] has utilized Moore-Penrose (MP) inverse in deep convoluti...

Please sign up or login with your details

Forgot password? Click here to reset