Scaling GRPC Tensorflow on 512 nodes of Cori Supercomputer

12/26/2017
by   Amrita Mathuriya, et al.
0

We explore scaling of the standard distributed Tensorflow with GRPC primitives on up to 512 Intel Xeon Phi (KNL) nodes of Cori supercomputer with synchronous stochastic gradient descent (SGD), and identify causes of scaling inefficiency at higher node counts. To our knowledge, this is the first exploration of distributed GRPC Tensorflow scalability on a HPC supercomputer at such large scale with synchronous SGD. We studied scaling of two convolution neural networks - ResNet-50, a state-of-the-art deep network for classification with roughly 25.5 million parameters, and HEP-CNN, a shallow topology with less than 1 million parameters for common scientific usages. For ResNet-50, we achieve >80 servers (PS tasks) with a steep decline down to 23 tasks. Our analysis of the efficiency drop points to low network bandwidth utilization due to combined effect of three factors. (a) Heterogeneous distributed parallelization algorithm which uses PS tasks as centralized servers for gradient averaging is suboptimal for utilizing interconnect bandwidth. (b) Load imbalance among PS tasks hinders their efficient scaling. (c) Underlying communication primitive GRPC is currently inefficient on Cori high-speed interconnect. The HEP-CNN demands less interconnect bandwidth, and shows >80 findings are applicable to other deep learning networks. Big networks with millions of parameters stumble upon the issues discussed here. Shallower networks like HEP-CNN with relatively lower number of parameters can efficiently enjoy weak scaling even with a single parameter server.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/24/2018

On Scale-out Deep Learning Training for Cloud and HPC

The exponential growth in use of large deep neural networks has accelera...
research
08/14/2018

CosmoFlow: Using Deep Learning to Learn the Universe at Scale

Deep learning is a promising tool to determine the physical model that d...
research
11/27/2018

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

Distributed synchronous stochastic gradient descent has been widely used...
research
10/27/2019

PopSGD: Decentralized Stochastic Gradient Descent in the Population Model

The population model is a standard way to represent large-scale decentra...
research
11/30/2019

Training Distributed Deep Recurrent Neural Networks with Mixed Precision on GPU Clusters

In this paper, we evaluate training of deep recurrent neural networks wi...
research
11/12/2019

Throughput Prediction of Asynchronous SGD in TensorFlow

Modern machine learning frameworks can train neural networks using multi...
research
11/12/2017

Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

For the past 5 years, the ILSVRC competition and the ImageNet dataset ha...

Please sign up or login with your details

Forgot password? Click here to reset