On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

11/30/2018
by   Noah Golmant, et al.
0

Increasing the mini-batch size for stochastic gradient descent offers significant opportunities to reduce wall-clock training time, but there are a variety of theoretical and systems challenges that impede the widespread success of this technique. We investigate these issues, with an emphasis on time to convergence and total computational cost, through an extensive empirical analysis of network training across several architectures and problem domains, including image classification, image segmentation, and language modeling. Although it is common practice to increase the batch size in order to fully exploit available computational resources, we find a substantially more nuanced picture. Our main finding is that across a wide range of network architectures and problem domains, increasing the batch size beyond a certain point yields no decrease in wall-clock time to convergence for either train or test loss. This batch size is usually substantially below the capacity of current systems. We show that popular training strategies for large batch size optimization begin to fail before we can populate all available compute resources, and we show that the point at which these methods break down depends more on attributes like model architecture and data complexity than it does directly on the size of the dataset.

READ FULL TEXT

page 6

page 13

research
03/14/2019

Inefficiency of K-FAC for Large Batch Size Training

In stochastic optimization, large batch training can leverage parallel r...
research
07/25/2023

How to Scale Your EMA

Preserving training dynamics across batch sizes is an important tool for...
research
10/18/2019

On the Difficulty of Warm-Starting Neural Network Training

In many real-world deployments of machine learning systems, data arrive ...
research
07/09/2019

Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model

Increasing the batch size is a popular way to speed up neural network tr...
research
03/05/2019

Streaming Batch Eigenupdates for Hardware Neuromorphic Networks

Neuromorphic networks based on nanodevices, such as metal oxide memristo...
research
10/21/2022

A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes

Deep neural networks (DNNs) are typically optimized using various forms ...
research
08/15/2019

Accelerated CNN Training Through Gradient Approximation

Training deep convolutional neural networks such as VGG and ResNet by gr...

Please sign up or login with your details

Forgot password? Click here to reset