AI Chat AI Image Generator AI Video Text to Speech

ImageNet/ResNet-50 Training in 224 Seconds

11/13/2018

∙

by Hiroaki Mikami, et al.

∙

∙

Scaling the distributed deep learning to a massive GPU cluster level is challenging due to the instability of the large mini-batch training and the overhead of the gradient synchronization. We address the instability of the large mini-batch training with batch size control. We address the overhead of the gradient synchronization with 2D-Torus all-reduce. Specifically, 2D-Torus all-reduce arranges GPUs in a logical 2D grid and performs a series of collective operation in different orientations. These two techniques are implemented with Neural Network Libraries (NNL). We have successfully trained ImageNet/ResNet-50 in 224 seconds without significant accuracy loss on ABCI cluster.

Hiroaki Mikami
1 publication
Hisahiro Suganuma
2 publications
Pongsakorn U-chupala
2 publications
Yoshiki Tanaka
3 publications
Yuichi Kageyama
1 publication

page 1

page 2

page 3

page 4

research

∙ 02/13/2020

Scalable and Practical Natural Gradient for Large-Scale Deep Learning

Large-scale distributed training of deep neural networks results in mode...

0 Kazuki Osawa, et al. ∙

research

∙ 03/29/2019

Yet Another Accelerated SGD: ResNet-50 Training on ImageNet in 74.7 seconds

There has been a strong demand for algorithms that can execute machine l...

0 Masafumi Yamazaki, et al. ∙

research

∙ 11/29/2018

Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs

Large-scale distributed training of deep neural networks suffer from the...

0 Kazuki Osawa, et al. ∙

research

∙ 12/02/2019

A Multigrid Method for Efficiently Training Video Models

Training competitive deep video models is an order of magnitude slower t...

7 Chao-Yuan Wu, et al. ∙

research

∙ 08/07/2017

PowerAI DDL

As deep neural networks become more complex and input datasets grow larg...

0 Minsik Cho, et al. ∙

research

∙ 06/01/2021

Concurrent Adversarial Learning for Large-Batch Training

Large-batch training has become a commonly used technique when training ...

0 Yong Liu, et al. ∙

research

∙ 06/15/2020

The Limit of the Batch Size

Large-batch training is an efficient approach for current distributed de...

22 Yang You, et al. ∙