DeepAI AI Chat
Log In Sign Up

Stagewise Enlargement of Batch Size for SGD-based Learning

by   Shen-Yi Zhao, et al.
Nanjing University

Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent (SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called stagewise enlargement of batch size (), to set proper batch size for SGD. More specifically, adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for , momentum and AdaGrad. Empirical results on real data successfully verify the theories of . Furthermore, empirical results also show that SEBS can outperform other baselines.


page 1

page 2

page 3

page 4


Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Stochastic gradient descent (SGD) and its variants have been the dominat...

Don't Decay the Learning Rate, Increase the Batch Size

It is common practice to decay the learning rate. Here we show one can u...

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

Stochastic gradient descent (SGD) is almost ubiquitously used for traini...

Improved generalization by noise enhancement

Recent studies have demonstrated that noise in stochastic gradient desce...

Black-Box Generalization

We provide the first generalization error analysis for black-box learnin...

The Effect of Network Width on the Performance of Large-batch Training

Distributed implementations of mini-batch stochastic gradient descent (S...

On the Power of Differentiable Learning versus PAC and SQ Learning

We study the power of learning via mini-batch stochastic gradient descen...