Large-Batch Training for LSTM and Beyond

01/24/2019
by   Yang You, et al.
22

Large-batch training approaches have enabled researchers to utilize large-scale distributed processing and greatly accelerate deep-neural net (DNN) training. For example, by scaling the batch size from 256 to 32K, researchers have been able to reduce the training time of ResNet50 on ImageNet from 29 hours to 2.2 minutes (Ying et al., 2018). In this paper, we propose a new approach called linear-epoch gradual-warmup (LEGW) for better large-batch training. With LEGW, we are able to conduct large-batch training for both CNNs and RNNs with the Sqrt Scaling scheme. LEGW enables Sqrt Scaling scheme to be useful in practice and as a result we achieve much better results than the Linear Scaling learning rate scheme. For LSTM applications, we are able to scale the batch size by a factor of 64 without losing accuracy and without tuning the hyper-parameters. For CNN applications, LEGW is able to achieve the same accuracy even as we scale the batch size to 32K. LEGW works better than previous large-batch auto-tuning techniques. LEGW achieves a 5.3X average speedup over the baselines for four LSTM-based applications on the same hardware. We also provide some theoretical explanations for LEGW.

READ FULL TEXT

page 2

page 7

page 11

page 12

page 14

research
06/15/2020

The Limit of the Batch Size

Large-batch training is an efficient approach for current distributed de...
research
08/03/2018

Large Scale Language Modeling: Converging on 40GB of Text in Four Hours

Recent work has shown how to train Convolutional Neural Networks (CNNs) ...
research
11/20/2019

Auto-Precision Scaling for Distributed Deep Learning

In recent years, large-batch optimization is becoming the key of distrib...
research
10/01/2019

Training Kinetics in 15 Minutes: Large-scale Distributed Training on Videos

Deep video recognition is more computationally expensive than image reco...
research
07/26/2019

Anonymity Mixes as (Partial) Assembly Queues: Modeling and Analysis

Anonymity platforms route the traffic over a network of special routers ...
research
06/07/2018

Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

Deep neural network models are usually trained in cluster environments, ...
research
04/01/2018

Training Tips for the Transformer Model

This article describes our experiments in neural machine translation usi...

Please sign up or login with your details

Forgot password? Click here to reset