DeepAI AI Chat
Log In Sign Up

Inefficiency of K-FAC for Large Batch Size Training

03/14/2019
by   Linjian Ma, et al.
10

In stochastic optimization, large batch training can leverage parallel resources to produce faster wall-clock training times per epoch. However, for both training loss and testing error, recent results analyzing large batch Stochastic Gradient Descent (SGD) have found sharp diminishing returns beyond a certain critical batch size. In the hopes of addressing this, the Kronecker-Factored Approximate Curvature (K-FAC) method has been hypothesized to allow for greater scalability to large batch sizes for non-convex machine learning problems, as well as greater robustness to variation in hyperparameters. Here, we perform a detailed empirical analysis of these two hypotheses, evaluating performance in terms of both wall-clock time and aggregate computational cost. Our main results are twofold: first, we find that K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD; and second, we find that K-FAC, in addition to requiring more hyperparameters to tune, suffers from the same hyperparameter sensitivity patterns as SGD. We discuss extensive results using residual networks on CIFAR-10, as well as more general implications of our findings.

READ FULL TEXT

page 7

page 8

page 12

11/30/2018

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Increasing the mini-batch size for stochastic gradient descent offers si...
12/03/2018

Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent

Stochastic gradient descent (SGD) is almost ubiquitously used for traini...
12/02/2021

On Large Batch Training and Sharp Minima: A Fokker-Planck Perspective

We study the statistical properties of the dynamic trajectory of stochas...
10/02/2018

Large batch size training of neural networks with adversarial training and second-order information

Stochastic Gradient Descent (SGD) methods using randomly selected batche...
09/29/2021

Stochastic Training is Not Necessary for Generalization

It is widely believed that the implicit regularization of stochastic gra...
11/29/2022

Disentangling the Mechanisms Behind Implicit Regularization in SGD

A number of competing hypotheses have been proposed to explain why small...
06/16/2019

One Epoch Is All You Need

In unsupervised learning, collecting more data is not always a costly pr...