Extrapolation for Large-batch Training in Deep Learning

06/10/2020
by   Tao Lin, et al.
13

Deep learning networks are typically trained by Stochastic Gradient Descent (SGD) methods that iteratively improve the model parameters by estimating a gradient on a very small fraction of the training data. A major roadblock faced when increasing the batch size to a substantial fraction of the training data for improving training time is the persistent degradation in performance (generalization gap). To address this issue, recent work propose to add small perturbations to the model parameters when computing the stochastic gradients and report improved generalization performance due to smoothing effects. However, this approach is poorly understood; it requires often model-specific noise and fine-tuning. To alleviate these drawbacks, we propose to use instead computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima. This principled approach is well grounded from an optimization perspective and we show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer. We demonstrate that in a variety of experiments the scheme allows scaling to much larger batch sizes than before whilst reaching or surpassing SOTA accuracy.

READ FULL TEXT

page 28

page 30

research
05/21/2018

SmoothOut: Smoothing Out Sharp Minima for Generalization in Large-Batch Deep Learning

In distributed deep learning, a large batch size in Stochastic Gradient ...
research
02/21/2019

Interplay Between Optimization and Generalization of Stochastic Gradient Descent with Covariance Noise

The choice of batch-size in a stochastic optimization algorithm plays a ...
research
06/26/2019

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Large-batch stochastic gradient descent (SGD) is widely used for trainin...
research
01/27/2019

Augment your batch: better training with larger batches

Large-batch SGD is important for scaling training of deep neural network...
research
05/24/2017

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Background: Deep learning models are typically trained using stochastic ...
research
10/02/2018

Large batch size training of neural networks with adversarial training and second-order information

Stochastic Gradient Descent (SGD) methods using randomly selected batche...
research
08/28/2020

Predicting Training Time Without Training

We tackle the problem of predicting the number of optimization steps tha...

Please sign up or login with your details

Forgot password? Click here to reset