Dynamic Mini-batch SGD for Elastic Distributed Training: Learning in the Limbo of Resources

by   Haibin Lin, et al.

With an increasing demand for training powers for deep learning algorithms and the rapid growth of computation resources in data centers, it is desirable to dynamically schedule different distributed deep learning tasks to maximize resource utilization and reduce cost. In this process, different tasks may receive varying numbers of machines at different time, a setting we call elastic distributed training. Despite the recent successes in large mini-batch distributed training, these methods are rarely tested in elastic distributed training environments and suffer degraded performance in our experiments, when we adjust the learning rate linearly immediately with respect to the batch size. One difficulty we observe is that the noise in the stochastic momentum estimation is accumulated over time and will have delayed effects when the batch size changes. We therefore propose to smoothly adjust the learning rate over time to alleviate the influence of the noisy momentum estimation. Our experiments on image classification, object detection and semantic segmentation have demonstrated that our proposed Dynamic SGD method achieves stabilized performance when varying the number of GPUs from 8 to 128. We also provide theoretical understanding on the optimality of linear learning rate scheduling and the effects of stochastic momentum.


page 1

page 2

page 3

page 4


Trajectory of Mini-Batch Momentum: Batch Size Saturation and Convergence in High Dimensions

We analyze the dynamics of large batch stochastic gradient descent with ...

Don't Decay the Learning Rate, Increase the Batch Size

It is common practice to decay the learning rate. Here we show one can u...

Approximate Fisher Information Matrix to Characterise the Training of Deep Neural Networks

In this paper, we introduce a novel methodology for characterising the p...

DecentLaM: Decentralized Momentum SGD for Large-batch Deep Training

The scale of deep learning nowadays calls for efficient distributed trai...

A view of mini-batch SGD via generating functions: conditions of convergence, phase transitions, benefit from negative momenta

Mini-batch SGD with momentum is a fundamental algorithm for learning lar...

Towards Practical Adam: Non-Convexity, Convergence Theory, and Mini-Batch Acceleration

Adam is one of the most influential adaptive stochastic algorithms for t...

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Recent work has shown how to train Convolutional Neural Networks (CNNs) ...

Please sign up or login with your details

Forgot password? Click here to reset