On Scale-out Deep Learning Training for Cloud and HPC

01/24/2018
by   Srinivas Sridharan, et al.
0

The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/card cannot satisfy the compute, memory, and I/O requirements of today's state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.

READ FULL TEXT

page 1

page 2

page 3

research
09/22/2016

Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability

This paper presents a theoretical analysis and practical evaluation of t...
research
12/26/2017

Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

We explore scaling of the standard distributed Tensorflow with GRPC prim...
research
08/18/2020

Benchmarking network fabrics for data distributed training of deep neural networks

Artificial Intelligence/Machine Learning applications require the traini...
research
10/06/2020

Towards a Scalable and Distributed Infrastructure for Deep Learning Applications

Although recent scaling up approaches to train deep neural networks have...
research
10/31/2018

Democratizing Production-Scale Distributed Deep Learning

The interest and demand for training deep neural networks have been expe...
research
05/13/2019

Scaling Distributed Training of Flood-Filling Networks on HPC Infrastructure for Brain Mapping

Mapping all the neurons in the brain requires automatic reconstruction o...
research
01/09/2018

Distributed Deep Reinforcement Learning: Learn how to play Atari games in 21 minutes

We present a study in Distributed Deep Reinforcement Learning (DDRL) foc...

Please sign up or login with your details

Forgot password? Click here to reset