Revisiting Small Batch Training for Deep Neural Networks

04/20/2018
by   Dominic Masters, et al.
0

Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance and allows a significantly smaller memory footprint, which might also be exploited to improve machine throughput. In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. We adopt a learning rate that corresponds to a constant average weight update per gradient calculation (i.e., per unit cost of computation), and point out that this results in a variance of the weight updates that increases linearly with the mini-batch size m. The collected experimental results for the CIFAR-10, CIFAR-100 and ImageNet datasets show that increasing the mini-batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. On the other hand, small mini-batch sizes provide more up-to-date gradient calculations, which yields more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between m = 2 and m = 32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/06/2017

AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks

Training deep neural networks with Stochastic Gradient Descent, or its v...
research
05/12/2020

RSO: A Gradient Free Sampling Based Approach For Training Deep Neural Networks

We propose RSO (random search optimization), a gradient free Markov Chai...
research
04/13/2023

Statistical Analysis of Fixed Mini-Batch Gradient Descent Estimator

We study here a fixed mini-batch gradient decent (FMGD) algorithm to sol...
research
07/09/2020

A Study of Gradient Variance in Deep Learning

The impact of gradient noise on training deep models is widely acknowled...
research
06/24/2020

Accelerated Large Batch Optimization of BERT Pretraining in 54 minutes

BERT has recently attracted a lot of attention in natural language under...
research
05/24/2017

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Background: Deep learning models are typically trained using stochastic ...
research
10/19/2018

Sequenced-Replacement Sampling for Deep Learning

We propose sequenced-replacement sampling (SRS) for training deep neural...

Please sign up or login with your details

Forgot password? Click here to reset