HyPar-Flow: Exploiting MPI and Keras for Scalable Hybrid-Parallel DNN Training using TensorFlow

11/12/2019
by   Ammar Ahmad Awan, et al.
0

The enormous amount of data and computation required to train DNNs have led to the rise of various parallelization strategies. Broadly, there are two strategies: 1) Data-Parallelism – replicating the DNN on multiple processes and training on different training samples, and 2) Model-Parallelism – dividing elements of the DNN itself into partitions across different processes. While data-parallelism has been extensively studied and developed, model-parallelism has received less attention as it is non-trivial to split the model across processes. In this paper, we propose HyPar-Flow: a framework for scalable and user-transparent parallel training of very large DNNs (up to 5,000 layers). We exploit TensorFlow's Eager Execution features and Keras APIs for model definition and distribution. HyPar-Flow exposes a simple API to offer data, model, and hybrid (model + data) parallel training for models defined using the Keras API. Under the hood, we introduce MPI communication primitives like send and recv on layer boundaries for data exchange between model-partitions and allreduce for gradient exchange across model-replicas. Our proposed designs in HyPar-Flow offer up to 3.1x speedup over sequential training for ResNet-110 and up to 1.6x speedup over Horovod-based data-parallel training for ResNet-1001; a model that has 1,001 layers and 30 million parameters. We provide an in-depth performance characterization of the HyPar-Flow framework on multiple HPC systems with diverse CPU architectures including Intel Xeon(s) and AMD EPYC. HyPar-Flow provides 110x speed up on 128 nodes of the Stampede2 cluster at TACC for hybrid-parallel training of ResNet-1001.

READ FULL TEXT
research
06/04/2020

A Linear Algebraic Approach to Model Parallelism in Deep Learning

Training deep neural networks (DNNs) in large-cluster computing environm...
research
10/25/2018

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

TensorFlow has been the most widely adopted Machine/Deep Learning framew...
research
04/11/2021

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Recently, Deep Neural Networks (DNNs) have recorded great success in han...
research
07/30/2019

Optimizing Multi-GPU Parallelization Strategies for Deep Learning Training

Deploying deep learning (DL) models across multiple compute devices to t...
research
11/05/2018

Mesh-TensorFlow: Deep Learning for Supercomputers

Batch-splitting (data-parallelism) is the dominant distributed Deep Neur...
research
05/30/2016

ParMAC: distributed optimisation of nested functions, with application to learning binary autoencoders

Many powerful machine learning models are based on the composition of mu...
research
03/31/2013

Parallel Computation Is ESS

There are enormous amount of examples of Computation in nature, exemplif...

Please sign up or login with your details

Forgot password? Click here to reset