The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

05/09/2019
by   Daniel S. Park, et al.
0

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/13/2017

Three Factors Influencing Minima in SGD

We study the properties of the endpoint of stochastic gradient descent (...
research
06/07/2016

Systematic evaluation of CNN advances on the ImageNet

The paper systematically studies the impact of a range of recent advance...
research
10/17/2017

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

This paper tackles two related questions at the heart of machine learnin...
research
06/26/2020

On the Generalization Benefit of Noise in Stochastic Gradient Descent

It has long been argued that minibatch stochastic gradient descent can g...
research
06/07/2021

Batch Normalization Orthogonalizes Representations in Deep Random Networks

This paper underlines a subtle property of batch-normalization (BN): Suc...
research
05/17/2022

Hyper-Learning for Gradient-Based Batch Size Adaptation

Scheduling the batch size to increase is an effective strategy to contro...
research
10/06/2020

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popula...

Please sign up or login with your details

Forgot password? Click here to reset