Stagewise Enlargement of Batch Size for SGD-based Learning

by   Shen-Yi Zhao, et al.

Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent (SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called stagewise enlargement of batch size (), to set proper batch size for SGD. More specifically, adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for , momentum and AdaGrad. Empirical results on real data successfully verify the theories of . Furthermore, empirical results also show that SEBS can outperform other baselines.


page 1

page 2

page 3

page 4


Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Stochastic gradient descent (SGD) and its variants have been the dominat...

Don't Decay the Learning Rate, Increase the Batch Size

It is common practice to decay the learning rate. Here we show one can u...

Improved generalization by noise enhancement

Recent studies have demonstrated that noise in stochastic gradient desce...

Disentangling the Mechanisms Behind Implicit Regularization in SGD

A number of competing hypotheses have been proposed to explain why small...

Batch size-invariance for policy optimization

We say an algorithm is batch size-invariant if changes to the batch size...

The Effect of Network Width on the Performance of Large-batch Training

Distributed implementations of mini-batch stochastic gradient descent (S...

Black-Box Generalization

We provide the first generalization error analysis for black-box learnin...

Please sign up or login with your details

Forgot password? Click here to reset