The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

by   Daniel S. Park, et al.

We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined by a "normalized noise scale," which is a function of the batch size, learning rate, and initialization conditions. In the absence of batch normalization, the optimal normalized noise scale is directly proportional to width. Wider networks, with their higher optimal noise scale, also achieve higher test accuracy. These observations hold for MLPs, ConvNets, and ResNets, and for two different parameterization schemes ("Standard" and "NTK"). We observe a similar trend with batch normalization for ResNets. Surprisingly, since the largest stable learning rate is bounded, the largest batch size consistent with the optimal normalized noise scale decreases as the width increases.


page 1

page 2

page 3

page 4


Three Factors Influencing Minima in SGD

We study the properties of the endpoint of stochastic gradient descent (...

Systematic evaluation of CNN advances on the ImageNet

The paper systematically studies the impact of a range of recent advance...

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

This paper tackles two related questions at the heart of machine learnin...

On the Generalization Benefit of Noise in Stochastic Gradient Descent

It has long been argued that minibatch stochastic gradient descent can g...

Batch Normalization Orthogonalizes Representations in Deep Random Networks

This paper underlines a subtle property of batch-normalization (BN): Suc...

Hyper-Learning for Gradient-Based Batch Size Adaptation

Scheduling the batch size to increase is an effective strategy to contro...

Reconciling Modern Deep Learning with Traditional Optimization Analyses: The Intrinsic Learning Rate

Recent works (e.g., (Li and Arora, 2020)) suggest that the use of popula...

Please sign up or login with your details

Forgot password? Click here to reset