On the Difficulty of Warm-Starting Neural Network Training

10/18/2019
by   Jordan T. Ash, et al.
18

In many real-world deployments of machine learning systems, data arrive piecemeal. These learning scenarios may be passive, where data arrive incrementally due to structural properties of the problem (e.g., daily financial data) or active, where samples are selected according to a measure of their quality (e.g., experimental design). In both of these cases, we are building a sequence of models that incorporate an increasing amount of data. We would like each of these models in the sequence to be performant and take advantage of all the data that are available to that point. Conventional intuition suggests that when solving a sequence of related optimization problems of this form, it should be possible to initialize using the solution of the previous iterate—to "warm start" the optimization rather than initialize from scratch—and see reductions in wall-clock time. However, in practice this warm-starting seems to yield poorer generalization performance than models that have fresh random initializations, even though the final training losses are similar. While it appears that some hyperparameter settings allow a practitioner to close this generalization gap, they seem to only do so in regimes that damage the wall-clock gains of the warm start. Nevertheless, it is highly desirable to be able to warm-start neural network training, as it would dramatically reduce the resource usage associated with the construction of performant deep learning systems. In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. Although the present investigation did not lead to a solution, we hope that a thorough articulation of the problem will spur new research that may lead to improved methods that consume fewer resources during training.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/18/2017

Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization

An efficient algorithm for recurrent neural network training is presente...
research
11/30/2018

On the Computational Inefficiency of Large Batch Sizes for Stochastic Gradient Descent

Increasing the mini-batch size for stochastic gradient descent offers si...
research
10/24/2018

Learned optimizers that outperform SGD on wall-clock and test loss

Deep learning has shown that learned functions can dramatically outperfo...
research
10/24/2018

Learned optimizers that outperform SGD on wall-clock and validation loss

Deep learning has shown that learned functions can dramatically outperfo...
research
03/18/2023

Learn, Unlearn and Relearn: An Online Learning Paradigm for Deep Neural Networks

Deep neural networks (DNNs) are often trained on the premise that the co...
research
06/16/2019

One Epoch Is All You Need

In unsupervised learning, collecting more data is not always a costly pr...
research
03/11/2023

Knowledge Distillation for Efficient Sequences of Training Runs

In many practical scenarios – like hyperparameter search or continual re...

Please sign up or login with your details

Forgot password? Click here to reset