Inefficiency of K-FAC for Large Batch Size Training

03/14/2019
by   Linjian Ma, et al.
10

In stochastic optimization, large batch training can leverage parallel resources to produce faster wall-clock training times per epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns beyond a certain critical batch size. In the hopes of addressing this, the Kronecker-Factored Approximate Curvature (K-FAC) method has been hypothesized to allow for greater scalability to large batch sizes for non-convex machine learning problems, as well as greater robustness to variation in hyperparameters. Here, we perform a detailed empirical analysis of these two hypotheses, evaluating performance in terms of both wall-clock time and aggregate computational cost. Our main results are twofold: first, we find that K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD; and second, we find that K-FAC, in addition to requiring more hyperparameters to tune, suffers from the same hyperparameter sensitivity patterns as SGD. We discuss extensive results using residual networks on CIFAR-10, as well as more general implications of our findings.

READ FULL TEXT

page 7

page 8

page 12

research
11/30/2018

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Increasing the mini-batch size for stochastic gradient descent offers si...
research
12/03/2018

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

Stochastic gradient descent (SGD) is almost ubiquitously used for traini...
research
12/02/2021

On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective

We study the statistical properties of the dynamic trajectory of stochas...
research
07/25/2023

How to Scale Your EMA

Preserving training dynamics across batch sizes is an important tool for...
research
10/02/2018

Large batch size training of neural networks with adversarial training and second-order information

Stochastic Gradient Descent (SGD) methods using randomly selected batche...
research
07/09/2020

AdaScale SGD: A User-Friendly Algorithm for Distributed Training

When using large-batch training to speed up stochastic gradient descent,...
research
06/16/2019

One Epoch Is All You Need

In unsupervised learning, collecting more data is not always a costly pr...

Please sign up or login with your details

Forgot password? Click here to reset